Master Thesis: Evaluating The Impact of Curriculum Learning On The Training Process For An Intelligent Agent in A Videogame
Master Thesis: Evaluating The Impact of Curriculum Learning On The Training Process For An Intelligent Agent in A Videogame
Master Thesis: Evaluating The Impact of Curriculum Learning On The Training Process For An Intelligent Agent in A Videogame
A final project submitted in partial fulfillment of the requirements for the degree if:
Master in Systems and Computing Engineering
Supervised by:
Ph.D. Jorge Eliécer Camargo Mendoza
Research line:
Reinforcement Learning in Video Games
Resumen
Se desea medir el impacto de la técnica de aprendizaje por currículos sobre el tiempo de
entrenamiento de un agente inteligente que está aprendiendo a jugar un video juego usando
aprendizaje por refuerzo, para esto se diseñaron varios experimentos con diferentes currículos
adaptados para el video juego seleccionado como caso de estudio, y se ejecutaron en una
plataforma de simulación de juegos seleccionada, usando dos algoritmos de aprendizaje por
refuerzo, y midiendo su desempeño usando la recompensa media acumulada. Los resultados
sugieren que usar aprendizaje por currículos tiene un impacto significativo sobre el proceso de
entrenamiento, en algunos casos alargando los tiempos de entrenamiento, y en otros casos
disminuyéndolos en hasta en un 40% por ciento.
Keywords
Figure 1-1: An agent was trained in a video game with an action space consisting of four discrete actions, 3
and then transferred to a robot with a different action space with a small amount of training for the robot [29]
Karttunen et al. (2020).
Figure 1-2: (a) A schematic showing 4 affordance variables (lane_LL, lane_L, lane_R, lane_RR). (b) A 4
schematic showing the other 4 affordance variables (angle, car_L, car_M, car_R). (c) An in-game screenshot
showing lane_LL, lane_L, lane_R, and lane_RR. (d) An in-game screenshot showing the detection of the
number of lanes in a road. (e) An in-game screenshot showing angle. (f) An in-game screenshot showing
car_L, car_M, and car_R [30] Martinez et al. (2017).
Figure 2-1: Typical Reinforcement Learning training cycle. [19] Juliani (2017). 7
Figure 2-3: Influence diagram of relevant deep learning techniques applied to commonly used games for 12
game AI research. [7] Justesen et al. (2017).
Figure 2-4: Typical network architecture used in deep reinforcement learning for game-playing. [7] Justesen 14
et al. (2017).
Figure 2-5: Example of a mathematics curriculum. Lessons progress from simpler topics to more complex 14
ones, with each building on the last. [4] Juliani (2017).
Figure 2-6: A simplified visual representation of how a continuation method works, by defining a sequence 16
of optimization problems of increasing complexity, where the first ones are easy to solve but only the last
one corresponds to the actual problem of interest. [5] Gulcehre et al. (2019).
Figure 2-7: Different subgames in the game of Quick Chess, which are used to form a curriculum for learning 17
the full game of Chess [39] Narvekar et al. (2020).
Figure 3-1: Typical deep reinforcement learning model [35] Shao et al. (2019). 19
Figure 3-2: Taxonomy of game simulation platforms based on the flexibility of environment specification 22
according to [18] Juliani et al. (2020).
Figure 3-3: Diagram of high-level components inside The Unity Machine Learning Agents Toolkit [40] Juliani 24
et al. (2020).
Figure 4-2: Pseudocode for Soft Actor-Critic [42] Haarnoja et al. (2018) 30
Figure 6-1: Screenshot of learning environment Soccer Twos [40] Juliani et al. (2020). 37
Figure 6-4: Objects that can be detected by the agent: top-left: Soccer ball, top-center: Agents of the same 40
team as the agent in training, top-right: Agents of the opposite team, bottom-left: Walls of the soccer field,
bottom-center: Agent’s own goal, bottom-right: Opposite team’s goal.
Figure 6-5: Possible movements that the agent can perform: left: lateral motion, center: frontal motion, right: 41
rotation around its Y-axis.
Figure 6-6: Expected behavior of mean cumulative reward values over time for the blue agent in a 43
successful training lesson.
Figure 7-1: Mean cumulative reward for Proximal Policy Optimization over 100 million matches. 47
Figure 7-2: Mean cumulative reward for Soft Actor-Critic over 100 million soccer matches. 47
Figure 7-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 49
Optimization and Curriculum A over 100 million soccer matches.
Figure 7-4: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 49
Curriculum A over 100 million soccer matches.
Figure 7-5: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 51
Optimization and Curriculum B over 100 million soccer matches.
Figure 7-6: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 51
Curriculum B over 100 million soccer matches.
Figure 7-7: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 53
Optimization and Curriculum A+B over 100 million soccer matches.
Figure 7-8: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic and 53
Curriculum A+B over 100 million soccer matches.
Figure 7-9: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal Policy 55
Optimization and Curriculum D over 100 million soccer matches.
Figure 7-10: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 56
Policy Optimization and Curriculum G over 100 million soccer matches.
Figure 7-11: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 57
Policy Optimization and Curriculum H over 100 million soccer matches.
Figure 7-12: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 58
Policy Optimization and Curriculum P over 100 million soccer matches.
Figure 7-13: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 59
Policy Optimization and Curriculum Q over 100 million soccer matches.
Figure 7-14: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 60
Policy Optimization and Curriculum R over 100 million soccer matches.
Figure 7-15: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 61
Policy Optimization and Curriculum U over 100 million soccer matches.
Figure 7-16: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 62
Policy Optimization and Curriculum W over 100 million soccer matches.
Figure 7-17: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 63
Policy Optimization and Curriculum X over 100 million soccer matches.
Figure 7-18: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 64
Policy Optimization and Curriculum Y over 100 million soccer matches.
Figure 7-19: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 65
Policy Optimization and Curriculum Z over 100 million soccer matches.
Figure 7-20: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 66
Policy Optimization and Curriculum C over 100 million soccer matches.
Figure 7-21: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 67
Policy Optimization and Curriculum J over 100 million soccer matches.
Figure 7-22: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 68
Policy Optimization and Curriculum M over 100 million soccer matches.
Figure 7-23: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 69
Policy Optimization and Curriculum N over 100 million soccer matches.
Figure 7-24: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 70
Policy Optimization and Curriculum O over 100 million soccer matches.
Figure 7-25: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 71
Policy Optimization and Curriculum V over 100 million soccer matches.
Figure 7-26: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 72
Policy Optimization and Curriculum E over 100 million soccer matches.
Figure 7-27: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 73
Policy Optimization and Curriculum F over 100 million soccer matches.
Figure 7-28: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 74
Policy Optimization and Curriculum K over 100 million soccer matches.
Figure 7-29: Mean cumulative reward for Proximal Policy Optimization control experiment vs Proximal 75
Policy Optimization and Curriculum L over 100 million soccer matches.
Figure 8-1: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 76
experiments that had lower performance than the control experiment at the end of 100 million soccer
matches.
Figure 8-2: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 78
experiments that had the same performance as the control experiment at the end of 100 million soccer
matches.
Figure 8-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs all curriculum 79
experiments that had higher performance than the control experiment at the end of 100 million soccer
matches.
Figure 8-4: Mean cumulative reward for Proximal Policy Optimization control experiment vs the best 4 80
curriculum experiments compared over 100 million soccer matches.
Contents
Page
1. Introduction 1
1.2. Motivation 3
1.4. Contributions 6
2. Background 7
2.2. Perception-Action-Learning 11
5. Hyperparameters Description 31
7. Experimental Results 46
8. Discussion 76
9. Conclusions 81
10. References 82
1
1. Introduction
In this document we present the results of several experiments with curriculum learning applied
to a game AI learning process to measure its effects on the learning time, specifically we trained
an agent using a reinforcement learning algorithm to play a video game running on a game
simulation platform, then we trained another agent under the same conditions but including a
training curriculum, which is a set of rules that modify the learning environment at specific times
to make it easier to master by the agent at the beginning, then we compared both results. Our
initial hypothesis is that in some cases using a training curriculum would allow the agent to learn
faster, reducing the training time required.
We describe in detail all the main elements of our work, including the choice of the game
simulation platform to run the training experiments, the review of the reinforcement learning
algorithms used to train the agent, the description of the video game selected as case study, the
parameters used to design the training curriculums, and the discussion of the results obtained.
■ General video game playing: This challenge refers to creating general intelligent agents that
can play not only a single game, but an arbitrary amount of known and unknown games.
According to [23] Legg et al. (2007) being able to solve a single problem does not make you
intelligent, to learn general intelligent behavior you need to train on not just a single task, but
in many different tasks. [24] Schaul et al. (2011) suggest that video games are ideal
environments for Artificial General Intelligence (AGI), in part because there are multiple video
games that share common interface and reward conventions.
■ Computational resources: Usually training an agent using deep neural networks for learning
how to play an open-world game like Grand Theft Auto V [32] Rockstar Games (2020) requires
huge amounts of computational power. This issue becomes even more noticeable if we want
to train several agents. According to [7] Justesen et al. (2017) it is not yet feasible to train
deep networks in real-time to allow agents to adapt instantly to changes in the game or adapt
it to a particular playing style, which could be useful in the design of new types of games.
■ Games with very sparse rewards: Some games such as Montezuma’s Revenge are
characterized by very sparse rewards and still pose a challenge for most current deep
reinforcement learning techniques. Some approaches that might be useful to solve this kind
of games include hierarchical reinforcement learning [31] Barto et al. (2003) and intrinsically
motivated reinforcement learning [31] Singh et al. (2005).
■ Dealing with extremely large decision spaces: For well-known board games such as Chess
the average branching factor is around 30, but for video games like Grand Theft Auto V [32]
Rockstar Games (2020) or StarCraft the branching factor is several orders of magnitudes
larger. How can we scale deep reinforcement learning to handle such levels of complexity is
an important open challenge.
These challenges are problems that have not a straightforward solution, rather requiring a
combination of multiple techniques to mitigate their impact, we believe curriculum learning could
be one of those techniques since it could allow the agents to learn in less time, which in turn could
mean less computational resources required.
In the case of general video game playing, we believe curriculum learning could help to overcome
this challenge is most scenarios, if agents are trained on easier 2D games first, allowing the
agents to learn how to explore levels and collect items that can provide help later to defeat
potential enemies and overcome future obstacles, and then if they are be trained on harder and
complex 3D games, using all the experience gained when learning simpler games. Setting the
training process this way could make the agent learn faster than a training process in which games
are provided to the agent in a random order of complexity.
Using less time for training provides additional side benefits including less money required for
cloud servers to run the training, less energy consumption, less carbon footprint, and in case of
commercial video games, faster product release times.
Does an agent that uses reinforcement learning to learn how to play a video game learn faster if
curriculum learning is used during the training?
3
1.2. Motivation
Training intelligent agents to play video games have several applications in the real world since
a video game can be seen as a low-cost and low-risk playground for learning complex tasks. In
most cases, direct agent interaction with the real world is either expensive or not feasible, since
the real world is far too complex for the agent to perceive and understand, so it makes sense to
simulate the interaction in a virtual learning environment which receives input and returns
feedback on a decision made by the agent, then most of the knowledge about the environment
can be transferred to a physical agent, usually a robot, this approach is called Sim-to-Real
(Simulation to Real World). In this scenario, having faster learning times when training agents
could benefit several real-world applications, a few are listed below:
■ [29] Karttunen et al. (2020) performed several Sim-to-Real experiments training an agent
using deep reinforcement learning to learn a navigation task in ViZDoom [12] Kempka et al.
(2016), then the learned policy was transferred to a physical robot using transfer learning,
freezing most of the pre-trained neural network parameters, this process is depicted in Figure
1-1.
Figure 1-1: An agent was trained in a video game with an action space consisting of four
discrete actions, and then transferred to a robot with a different action space with a small
amount of training for the robot [29] Karttunen et al. (2020).
■ [30] Martinez et al. (2017) used over 480,000 labeled images of highway driving generated in
Grand Theft Auto V, a popular video game released in 2013 by [32] Rockstar Games (2020),
to train a convolutional neural network to calculate the distance to cars and objects ahead,
lane markings, and driving angle (angular heading relative to lane centerline), all variables
required for the development of an autonomous driving system. Figure 1-2 shows several
schematics and game screenshots of the lanes and cars detected.
Figure 1-2: (a) A schematic showing 4 affordance variables (lane_LL, lane_L, lane_R,
lane_RR). (b) A schematic showing the other 4 affordance variables (angle, car_L, car_M,
car_R). (c) An in-game screenshot showing lane_LL, lane_L, lane_R, and lane_RR. (d) An
in-game screenshot showing the detection of the number of lanes in a road. (e) An in-game
screenshot showing angle. (f) An in-game screenshot showing car_L, car_M, and car_R [30]
Martinez et al. (2017).
5
In chapter 3 ‘Game Simulation Platform’, several game simulation platforms are listed, describing
their main features, advantages, and restrictions. A taxonomy to classify the platforms is
presented and used to support the selection of Unity and its machine learning library ML-Agents
Toolkit as the preferred platform to run the experiments, then its main components are listed and
described, including a high-level architecture diagram.
In chapter 4 ‘Reinforcement Learning Algorithms’, we provide formal definitions for two policy-
based reinforcement learning algorithms chosen for our experiments, both included in the Unity
ML-Agents Toolkit: Proximal Policy Optimization (PPO), an on-policy method that collects a small
batch of experiences to update its decision-making policy, ensuring that the updated policy does
not change too much from the previous one, and Soft Actor-Critic (SAC), an off-policy method
that seeks to maximize the entropy of its stochastic policy and learns using past samples from
experience replay buffers.
In chapter 5 ‘Hyperparameters Description’, we list and describe all hyperparameters and training
configurations offered by the Unity ML-Agents Toolkit for Proximal Policy Optimization (PPO) and
Soft Actor-Critic (SAC) algorithms. The toolkit provides recommended values for most
parameters, we based our training setup based on those recommendations.
In chapter 6 ‘Case Study: Toy Soccer Game’, we describe in detail the toy soccer video game
chosen as the learning environment to run our experiments, game that has two agents competing
against each other to win the soccer match by scoring goals. We review in detail how the agents
can perceive its surroundings, what kind of actions they can perform, and how they are rewarded
for their actions. We also describe how the mean cumulative reward can be used to measure the
agent’s performance and what is the expected behavior of this value over time in a successful
training session. Finally, we describe the environment parameters used to control how difficult the
game is for our agent, parameters that allowed us to design several training curriculums adapted
for the case study.
In chapter 7 ‘Experimental Results’, we describe the hardware and software setup used to run
the experiments. Then we describe the experiments we used as control, to allow comparisons
against a training process that uses a curriculum, using the mean cumulative reward as the main
metric. We conducted some preliminary experiments with 3 curriculums, describing our initial
findings, then we ran 20 additional experiments with curriculums classified by the number of its
lessons, their results are shown in this chapter.
In chapter 8 Discussion’, we classify the curriculums according to its results when compared
against the control experiment. We identify a few common patterns that could explain why some
curriculum could increase or decrease the agent’s performance, affecting the training time.
1.4. Contributions
As a result of this work, we provide several resources that are available to use for academic
purposes, including further research about curriculum learning applied to video games:
■ The video game used as the case study, which is based on a 3D learning environment
provided by the Unity ML-Agents Toolkit [40] Juliani et al. (2020).
■ The curriculums designed for the case study, specified in the format required by the Unity ML-
Agents Toolkit [40] Juliani et al. (2020).
■ The results of all experiments exported in .csv format, results that include performance metrics
we used in this work such as the mean cumulative reward.
■ The software setup required to run the experiments, specifically the set of Linux commands
required to install all the dependencies and the learning environment.
■ A paper that was submitted to the Iberoamerican Artificial Intelligence journal, presenting the
findings of this work.
2. Background
2.1. Reinforcement Learning
Machine learning is traditionally divided into three different types of learning [20] Alpaydin. (2010):
supervised learning, unsupervised learning, and reinforcement learning. While supervised
learning, in which intelligent agents are trained by example, has shown impressive results in a
variety of different domains, it requires a large amount of training data that often has to be curated
by humans [7] Justesen et al. (2017), a condition that is not always meet on the video games
domain, sometimes there is no training data available (e.g. playing an unknown game), or the
available training data is insufficient and the process of collecting more data can be very labor-
intensive and sometimes infeasible. In these cases, reinforcement learning methods are often
applied [7] Justesen et al. (2017).
Reinforcement Learning (RL) [34] Bertsekas et al. (1996) [21] Sutton et al. (2018), is an attempt
of formalizing the idea of learning based on rewards and penalties [22] Wolfshaar. (2017), in which
an agent interacts with an environment and its goal is to learn a behavior policy through this
interaction to maximize future rewards, this training cycle is depicted In Figure 2-1. How the
environment reacts to a certain action is defined by a model that usually does not know, the agent
can be in one of a set of states (𝑠 ∈ 𝑆) and can take one of many actions (𝑎 ∈ 𝐴) to change
from one state to another. The decision of which state is chosen is decided by transition
probabilities between states (𝑃), once an action is taken the environments return a reward (𝑟 ∈
𝑅) as feedback. The model defines the reward function and the transition probabilities [43] Weng
(2018).
Figure 2-1: Typical Reinforcement Learning training cycle. [19] Juliani (2017).
The policy 𝜋(𝑠) indicates what is the optimal action to take in a particular state to maximize the
total rewards, each state is associated with a value function 𝑉(𝑠) that predicts the future rewards
the agent can receive in the state. In RL the objective is to learn the policy 𝜋(𝑠) and value 𝑉(𝑠)
functions. Agent and environment interaction involve a sequence of actions and rewards in time
𝑡 = 1, 2, 3, . . . , 𝑇. Defining 𝑆! , 𝐴! , and 𝑅! as the state, action, and reward at time step 𝑡
respectively, an episode is defined as a sequence of states, actions and rewards ending at
terminal state 𝑆" as follows 𝑆# , 𝐴# , 𝑅$ , 𝑆$ , 𝐴$ , . . . , 𝑆" .
A transition is defined as the action of going from current state 𝑠 to next state 𝑠′ to get a reward
𝑟, presented by a tuple (𝑠, 𝑎, 𝑠′, 𝑟). The reward function 𝑅 predicts the next reward triggered by
one action:
The value function 𝑉(𝑠) indicates how rewarding an action is by prediction of future reward. The
return is defined as the future reward, the sum of discounted rewards going forward. The return
𝐺! starting from time 𝑡 is defined as follows:
The discounting factor 𝛾 ∈ [0, 1] penalize the future reward because it may have higher
uncertainty or does not provide immediate benefits.
The state-value of a state is defined as the expected return in that state at time 𝑡, 𝑆 ! = 𝑠.
𝑉0 (𝑠) = 𝔼0 [𝐺! | 𝑆! = 𝑠]
𝑄0 (𝑠, 𝑎) = 𝔼0 [𝐺! | 𝑆! = 𝑠, 𝐴! = 𝑎]
9
The difference between action-value and state-value is the action advantage function (A-value):
Now we have 𝑉0∗ (𝑠) = 𝑉∗ (𝑠) and 𝑄0∗ (𝑠, 𝑎) = 𝑄∗ (𝑠, 𝑎). Usually memorizing 𝑄∗ (𝑠, 𝑎)
values for all state-action pairs is computationally infeasible when state and action space are
large, in this case, function approximation (i.e a machine learning model) is used: 𝑄(𝑠, 𝑎; 𝜃) which
is a Q value function with a parameter 𝜃 that is used to approximate Q values. Deep Q-Networks
(DQN) [44] Mnih et al. (2015) are a common function approximation technique that improves the
training results using two mechanisms:
■ Experience Replay: All steps 𝑒! = (𝑆! , 𝐴! , 𝑅! , 𝑆!%# ) are stored in a replay memory 𝐷! =
{𝑒# , 𝑒$ , 𝑒& , . . . , 𝑒! }. Samples are drawn at random from replay memory, improving data
efficiency, removing correlation in observation sequences, and smoothing changes in the data
distribution.
■ Periodically Updated Target: The Q network is cloned and kept frozen as the optimization
target every C steps, with C as a hyperparameter. This makes training more resilient to short-
term oscillations.
where 𝑈(𝐷) is a uniform distribution over the replay memory 𝐷, 𝜃 ' is the parameter of the frozen
target Q-network [44] Mnih et al. (2015).
𝕁(𝜃) = / 𝑑0: (𝑠) 𝑉0: (𝑠) = / C𝑑0: (𝑠) / 𝜋(𝑎 | 𝑠, 𝜃) 𝑄0 (𝑠, 𝑎)D
(;* (;* 4;<
Using gradient ascent the 𝜃 can be moved toward the direction suggested by the gradient 𝛻) 𝕁(𝜃)
to find the best 𝜃 for 𝜋) that produces the highest return.
Getting good results using policy gradient methods is challenging because they are sensitive to
the choice of step size: slow progress if a small step size is chosen or drops in performance when
big step size is used [54] Schulman et al. (2017). Several policy gradient algorithms have been
proposed in recent years, including:
2.2. Perception-Action-Learning
A video game can easily be modeled as an environment in an RL setting, wherein agents have a
finite set of actions that can be taken at each step and their sequence of moves determines their
success [7] Justesen et al. (2017). At any given moment the agent is in a certain state, from that
state it can take one of a set of actions. The value of a given state refers to how ultimately
rewarding it is to be in that state. Taking an action in a state can bring an agent to a new state,
provide a reward, or both, this is called the perception-action-learning loop [38] Arulkumaran et
al. (2017), and it is depicted in Figure 2-2.
At time 𝑡, the agent receives state 𝑠! from the environment. The agent uses its policy to choose
an action 𝑎! . Once the action is executed, the environment transitions a step, providing the next
state 𝑠!%# as well as feedback in the form of a reward 𝑟!%# . The agent uses knowledge of state
transitions, of the form (𝑠! , 𝑎! , 𝑠!%# , 𝑟!%# ), in order to learn and improve its policy [38] Arulkumaran
et al. (2017).
The total cumulative reward is what all RL agents try to maximize over time. Depending on the
game mechanic, reward signals can be sent frequently, for instance, every time an agent performs
actions like killing an enemy, reaching a checkpoint or getting a health box, or it can be very
sparse if the agent is rewarded only until finding something valuable inside an extensive terrain
or a complex maze.
2.3. Deep Learning in Video Games
[7] Justesen et al. (2017) list several deep learning techniques used for playing different video
game genres in a believable, entertaining, or human-like manner, focused on games that have
been used extensively as game AI research platforms.
Figure 2-3: Influence diagram of relevant deep learning techniques applied to commonly used
games for game AI research. [7] Justesen et al. (2017).
13
Figure 2-3 shows an influence diagram of these techniques, in which each node is an algorithm
while the color represents the game platform. The distance from the center represents the date
that the original paper was published on arXiv.org. Each node points to all other nodes that used
or modified that technique. The arrows represent how techniques are related, arrows pointing to
a particular algorithm show which algorithms influenced its design.
Worth to highlight the use by [36] Wu et al. (2017) of Asynchronous Advantage Actor-Critic (A3C)
[37] Mnih et al. (2016) plus Curriculum Learning (blue node at middle-right of the influence
diagram) to train a vision-based agent called F1 to play First-Person Shooters (FPS), in particular
Doom, to compete in ViZDoom AI Competition [12] Kempka et al. (2016) hosted by IEEE
Conference on Computational Intelligence and Games. The agent was first trained on easier tasks
(weak opponents, smaller map) to gradually face harder problems (stronger opponents, bigger
maps), this agent won the edition 2016 of the competition [57] Wydmuch et al. (2018).
Some of these techniques assume having full access to the game’s data in the training phase,
this scenario is called a Fully Observable Environment, which is useful for techniques that focus
only on the learning process itself [25] Ortega et al. (2015). In other scenarios, the technique tries
to infer a valid game model from raw pixel data and then learn how to play the inferred game
model [26] Lample et at. (2016), [27] Adil et al. (2017), some of them even try to predict the current
game frame, from past frames and the actions performed by the agent [28] Wang et al. (2017),
these scenarios are called Partially Observable Environments.
Figure 2-4 shows an example of a typical network architecture used in deep reinforcement
learning for game-playing on partially observable environments. The input typically consists of a
preprocessed screen image, or several concatenated images, which is followed by a couple of
convolutional layers without pooling, and a few fully connected layers [7] Justesen et al. (2017).
Recurrent networks have a recurrent layer after the fully connected layers. The output layer
usually consists of one unit for each unique combination of actions in the game, and for actor-
critic methods such as A3C [37] Mnih et al. (2016), it also has one for the value function 𝑉(𝑠).
Figure 2-4: Typical network architecture used in deep reinforcement learning for game-playing.
[7] Justesen et al. (2017).
An easy way to demonstrate this strategy is thinking about the way arithmetic, algebra, and
calculus is taught in a typical education system. Arithmetic is taught before algebra. Likewise,
algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide
scaffolding for later lessons. The same principle can be applied to machine learning, where
training on easier tasks can provide scaffolding for harder tasks in the future according to [4]
Juliani (2017). Figure 2-5 show a schematic of a mathematics curriculum.
Figure 2-5: Example of a mathematics curriculum. Lessons progress from simpler topics to more
complex ones, with each building on the last. [4] Juliani (2017).
15
[2] Elman et al. (1993) state that the “starting small” strategy could make possible for humans to
learn what might otherwise prove to be unlearnable, however, according to [3] Harris (1991),
some problems can best be learned if the whole data set is available for the neural network from
the beginning, otherwise they often fail to learn the correct generalization and remain stuck in a
local minimum.
[1] Bengio et al. (2009) formalize the use of curriculums as training strategy in the context of
machine learning by calling it Curriculum Learning, stating that this strategy can have two effects:
in convex criteria, problems can increase the speed of convergence of the training process to a
minimum, in case of non-convex criteria can increase the quality of the local minima obtained.
[1] Bengio et al. (2009) hypothesize that a well-chosen curriculum strategy can act as a
continuation method, which is a general strategy for global optimization of nonconvex functions,
especially useful on deep learning methods that attempt to learn feature hierarchies, where higher
levels are formed by the composition of lower-level features, training these deep architectures
involve potentially intractable non-convex optimization problems.
[6] Allgower (2003) indicates how continuation methods address complex optimization problems
by smoothing the original function, turning into a different problem that is easier to optimize. By
gradually reducing the amount of smoothing, it is possible to consider a sequence of optimization
problems that converge to the optimization problem of interest. A visual representation of how
this process work can be found in Figure 2-6.
Let 𝐶* (𝜃) be a single-parameter family of cost functions such that 𝐶+ can be optimized easily
(maybe convex in 𝜃), while 𝐶# is the criterion that is wanted to be minimized. 𝐶+ (𝜃) is minimized
first and then 𝜆 is gradually increased while keeping 𝜃 at a local minimum of 𝐶* (𝜃).
Typically, 𝐶+ is a highly smoothed version of 𝐶# , so that 𝜃 gradually moves into the basin of
attraction of a dominant (if not global) minimum of 𝐶# . Applying a continuation method to the
problem of minimizing a training criterion involves a sequence of training criteria, starting from
one that is easier to optimize, and ending with the training criterion of interest.
Figure 2-6: A simplified visual representation of how a continuation method works, by defining a
sequence of optimization problems of increasing complexity, where the first ones are easy to
solve but only the last one corresponds to the actual problem of interest. [5] Gulcehre et al. (2019).
A curriculum can be seen as a sequence of training criteria, in which each criterion is associated
with a different set of weights on the training examples, generally speaking, a reweighting of the
training distribution. In the beginning, the weights favor easier examples, next training criterion
involves a slight change in the weighting of examples that increase the probability of sampling
more difficult examples.
This idea can be formalized as follows: Let 𝑧 be a random variable representing an example for
the learner (for example, an (𝑥, 𝑦) pair for supervised learning). Let 𝑃(𝑧) be the target training
distribution from which the learner should ultimately learn a function of interest.
17
Let 0 ≤ 𝑊* (𝑧) ≤ 1 be the weight applied to example 𝑧 at the step 𝜆 in the curriculum sequence,
with 0 ≤ 𝜆 ≤ 1, and 𝑊# (𝑧) = 1. The corresponding training distribution at step 𝜆 is:
𝑄= (𝑧) ∝ 𝑊= (𝑧)𝑃(𝑧) ∀𝑧
𝑄# (𝑧) = 𝑃(𝑧) ∀𝑧
[39] Narvekar et al. (2020) present Quick Chess as an example of how a curriculum can be used
to teach how to play a game using the “starting small” strategy. Quick Chess is a game designed
to introduce children to the full game of chess, by using a sequence of progressively more difficult
subgames. The first subgame is played on a 5x5 board with pawns only, which is useful for
children to learn how pawns move, get promoted, and take other pieces. Figure 2-7 show several
subgames of Quick Chess.
Figure 2-7: Different subgames in the game of Quick Chess, which are used to form a curriculum
for learning the full game of Chess [39] Narvekar et al. (2020).
In the second subgame the King piece is added, which introduces a new objective: Keep the king
alive. In each successive subgame, new elements are introduced (such as new pieces, a larger
board, or different configurations) that require learning new skills and building upon the knowledge
learned in previous subgames, the final subgame is the full game of chess [39] Narvekar et al.
(2020).
19
According to [7] Justesen et al. (2017) increasing use of deep learning methods in video games
is due to the practice of comparing game-playing algorithm results on public available game
simulation platforms, in which algorithms are ranked on their ability to score points or win in
games.
A diagram of a typical deep reinforcement learning model running on a game simulation platform
is shown in Figure 3-1. Input is taken from the game environment and meaningful features are
extracted automatically, the RL agent produces actions based on these features, making the
environment transition to the next state [35] Shao et al. (2019).
Figure 3-1: Typical deep reinforcement learning model [35] Shao et al. (2019).
Several game simulation platforms listed by [18] Juliani et al. (2020), [35] Shao et al. (2019) and
[7] Justesen et al. (2017) are described below:
■ Arcade Learning Environment (ALE) [8] Bellemare et al. (2013) is a free, open-source
software framework for interfacing with hundreds of games for Atari 2600, a second-
generation home video game console originally released in 1977 and sold for over a decade
[9] Montfort et al. (2009). ALE allows domain-independent AI algorithms evaluation providing
an interface to game environments that present research challenges for reinforcement
learning, model learning, model-based planning, imitation learning, transfer learning, and
intrinsic motivation. ALE allows users to receive joystick motion, send and receive RAM data,
and provides a game-handling layer that transforms the running game into a standard
reinforcement learning problem by identifying the accumulated score and informing if the
game has ended [8] Bellemare et al. (2013).
■ Retro Learning Environment (RLE) [10] Bhonker et al. (2017) is an open-source software
environment that allows intelligent agents to be trained to play games for Super Nintendo
Entertainment System (SNES), Nintendo Ultra 64 (N64), Game Boy, Sega Genesis, Sega
Saturn, Dreamcast, and PlayStation. RLE supports working with multi-agent reinforcement
learning (MARL) [11] Buşoniu et al. (2010) tasks, particularly useful to train and evaluate
agents that compete against each other, rather than against a pre-configured in-game AI. RLE
separates the learning environment from the emulator, incorporating an interface called
LibRetro that allows communication between front-end programs and game console
emulators, allowing to support over 15 game consoles, each containing hundreds of games,
having an estimated total of 7.000 games. [10] Bhonker et al. (2017).
■ OpenNero [17] Karpov et al. (2008) is a general-purpose open-source game platform
designed for research and education in-game AI. The project is based on the Neuro-Evolving
Robotic Operatives (NERO) game developed by graduate and undergraduate students at the
Neural Networks Research Group and Department of Computer Science at the University of
Texas at Austin. OpenNero has a client-server architecture, features 3D graphics, physics
simulation, 3D audio rendering, networking, and has been used for planning, natural
processing language, multi-agent systems, reinforcement learning, and evolutionary
computation. OpenNERO has an API to implement machine learning tasks, environment, and
agents, and includes an extensible collection of ready-to-use AI algorithms such as value
function reinforcement learning, heuristic search and neuroevolution [17] Karpov et al. (2008).
21
■ ViZDoom [12] Kempka et al. (2016) is an AI research platform for visual reinforcement learning
based on the classical first-person shooter (FPS) video game Doom, a semi-realistic 3D world
that can be observed from a first-person perspective. VizDoom allows developing agents to
play Doom using the screen buffer as input, agents that have to perceive, interpret and learn
the 3D world in order to make tactical and strategic decisions such as where to go and how
to act. VizDoom allows users to define custom training scenarios that differ by maps,
environment elements, non-player characters, rewards, goals, and actions available to the
agent.
■ DeepMind Lab [13] Beattie et al (2016) is a first-person 3D game platform designed for
research and development of general artificial intelligence and machine learning systems,
which can be used to study how pixels-to-actions autonomous intelligent agents do learn
complex tasks in large, partially observed and visually diverse worlds. DeepMind Lab was
built on top of Quake III Arena created by ‘id Software’ in 1999, a game that features rich
science fiction-style 3D visuals and more naturalistic physics. DeepMind Lab allows agents to
learn tasks including navigation in mazes, traversing dangerous passages, and avoiding
falling off cliffs, bouncing through space using launch pads to move between platforms,
collecting items, learning, and remembering random procedurally generated environments
[13] Beattie et al (2016).
■ Project Malmo [14] Johnson et al. (2016) is an AI experimentation platform built on top of the
popular video game Minecraft, a platform designed to support fundamental research in
artificial general intelligence (AGI) and related areas including robotics, computer vision,
reinforcement learning, planning, and multi-agent systems, by providing a rich, structured and
dynamic 3D environment with complex dynamics. Project Malmo is heavily focused on the
research of flexible AI that can learn to perform well on a wide range of tasks, similar to the
kind of flexible learning seen in humans and other animals, in contrast to most AI approaches
that are mainly designed to perform narrow tasks [14] Johnson et al. (2016).
■ TorchCraft [15] Synnaeve et al. (2016) is a library that enables deep learning research on
real-time strategy games such as StarCraft: Brood War, a popular real-time strategy (RTS)
video game published in 1998 by Blizzard Entertainment. RTS games have been a domain of
interest for the planning and decision-making research communities [16] Silva et al. (2017),
since they aim to simulate the control of multiple units in a military setting at different scales
and levels of complexity, normally in a fixed size 2D map. The agents to train have to collect
resources, create buildings and military units to fight and destroy opponents. TorchCraft
serves as a low-level bridge between the game and Torch, a scientific computing framework
for LuaJIT, by dynamically injecting code to the game engine that hosts the game [15]
Synnaeve et al. (2016).
■ Unity [18] Juliani et al. (2020) is a 3D real-time cross-platform video game development
platform, featuring high-quality rendering and physics simulation, created by Unity
Technologies. Unity is not restricted to any specific genre of gameplay or simulation, this
flexibility enables the creation of tasks ranging from simple 2D grid world problems to complex
3D strategy games, physics-based puzzles, or multi-agent competitive games possible. Unity
provides an open-source framework called Unity Machine Learning Agents Toolkit (Unity ML-
Agents Toolkit) [19] Juliani (2017), [18] Juliani et al. (2020), [40] Juliani et al. (2020), a game
AI framework that enables 2D, 3D, VR/AR games and simulations to serve as learning
environments for training intelligent agents. Agents can be trained using reinforcement
learning, curriculum learning, imitation learning, neuroevolution, and other machine learning
methods [33] Mattar et al. (2020).
Figure 3-2: Taxonomy of game simulation platforms based on the flexibility of environment
specification according to [18] Juliani et al. (2020).
23
■ Single Environment: Platforms that act as a black box from an agent’s perspective, and usually
run only a specific game.
■ Environment Suite: Consist of several learning environments packaged together to
benchmark the performance of a game AI technique along with different games.
■ Domain-Specific Platform: Allows the creation of a set of tasks within a specific domain such
as first-person navigation, car racing, or human locomotion.
■ General Platform: Platforms that allow creating custom learning environments with arbitrarily
complex visuals, physical and social interactions, and configurable tasks.
According to [18] Juliani et al. (2020) Unity is the only platform that has the complexity, flexibility,
and computational properties expected from a general game simulation platform for game AI. This
platform is the only one that provides out-of-the-box support for both reinforcement learning and
curriculum learning, and it is not constrained to a specific game or learning environment. Taking
in consideration all these reasons, we decided to choose Unity and its Unity ML-Agents Toolkit
[18] Juliani et al. (2020) as the simulation platform to run our experiments.
■ Learning Environment: Unity scene that contains all game characters and the environment in
which agents can observe, act, and learn. Contains agents and behaviors.
■ Agents: Unity component that is attached to a Unity game object (any character within a Unity
Scene), generates its observations, performs actions it receives and assigns a reward
(positive or negative) when appropriate. Each agent is linked to one behavior.
■ Behaviors: Defines specific attributes of the agent such as the number of actions the agent
can take, can be thought of as a function that receives observations and rewards from the
Agent, and returns actions. Behaviors can be of one of three types:
■ Learning Behavior: Behavior that is not, yet, defined but about to be trained.
■ Heuristic Behavior: Behavior defined by a hard-coded set of rules written in code.
■ Inference Behavior: Behavior that includes a trained Neural Network file (.nn), after a
learning behavior is trained, it becomes an inference behavior.
■ Python low-level API: Python API interface for interacting and manipulating a learning
environment, this is not part of Unity, but lives outside and communicates with Unity through
the communicator.
■ External Communicator: Connects the learning environment with the Python low-level API.
■ Python Trainers: Contains all the machine learning algorithms that enable training agents.
Every learning environment has at least one agent, each agent must be linked to one behavior,
is possible for agents that have similar observations and actions to have the same behavior [40]
Juliani et al. (2020). A schematic showing the main components of the Unity ML-Agents Toolkit
is shown in Figure 3-3.
Figure 3-3: High-level components diagram of Unity ML-Agents Toolkit [40] Juliani et al. (2020).
25
The Unity ML-Agents Toolkit [18] Juliani et al. (2020) provides different scenarios to train agents
[40] Juliani et al. (2020):
■ Single-Agent: This is the traditional way of training agents, used in most cases, only one
reward signal is used.
■ Simultaneous Single-Agent: A parallelized version of the first scenario, it has several
independent agents with individual reward signals with the same behavior parameters. This
scenario can speed-up the training process.
■ Adversarial Self-Play: In this case, there are two interacting agents with inverse reward
signals, scenario especially designed for two-player games.
■ Cooperative Multi-Agent: Multiple agents working together to accomplish a task that cannot
be done alone, all of them share the reward signal, and could have the same or different
behavior parameters. Tower defense video games fit into this scenario.
■ Competitive Multi-Agent: Multiple agents interacting and competing with each other to either
win a competition or getting a set of limited resources. Agents use inverse signal rewards and
could have the same or different behavior parameters. This scenario works well for team
sports video games.
■ Ecosystem: Multiple interacting agents with independent signal rewards, this scenario is
adequate for open-world video games such as Grand Theft Auto V [32] Rockstar Games
(2020), which have several kinds of characters, each one with its own goals. Autonomous
driving simulation within an urban environment can be represented by this scenario.
The training scenario we found the most convenient for our experiments is the Simultaneous
Single-Agent scenario since it allows us to train multiple instances of the same agent to reduce
the time of the training process.
4. Reinforcement Learning Algorithms
The Unity ML-Agents Toolkit [18] Juliani et al. (2020) provides an implementation based on
TensorFlow of two state-of-the-art reinforcement learning algorithms that we found appropriate to
train our agents:
Proximal Policy Optimization (PPO) is an on-policy algorithm that has been shown to be more
general-purpose and stable than many other reinforcement learning algorithms. On the other
hand, Soft Actor-Critic (SAC) is off-policy, which means it can learn from any past experiences,
which are collected and placed in an experience replay buffer to be randomly drawn during
training. We provide a formal description of each one below:
taking multiple steps of Stochastic Gradient Descent (usually Minibatch) to maximize the objective
function. 𝐿 is given by:
which is 𝜖 is a hyperparameter that represents how far away the new policy is allowed to go from
𝜋𝜃 (4|()
the old policy, is the ratio of the probability under the new and old policies, respectively,
0., (4|()
0.,
and 𝐴 is the estimated advantage. There is a simplified version of the equation above:
𝜋: (𝑎|𝑠) 0.
𝐿 (𝑠, 𝑎, 𝜃, , 𝜃 ) = 𝑚𝑖𝑛 Q 𝐴 , (𝑠, 𝑎), 𝑔(𝜖, 𝐴0., (𝑠, 𝑎))S
𝜋:, (𝑎|𝑠)
where:
Clipping serves as a regularizer by removing incentives for the policy to change dramatically, and
the hyperparameter 𝜖 corresponds to how far away the new policy can go from the old while still
profiting the objective.
SAC learns a policy 𝜋) and two Q-functions 𝑄/# , 𝑄/$ . There are two main variants of SAC:
Let 𝑥 be a random variable with probability mass or density function 𝑃. The entropy 𝐻 of 𝑥 is
computed from its distribution 𝑃 according to:
In entropy-regularized reinforcement learning the agent gets a bonus reward at each time step
proportional to the entropy of the policy at that time step. This changes the RL problem to:
∞
𝜋 = 𝑎𝑟𝑔 𝑚𝑎𝑥𝜋 𝐸𝜏∼𝜋 T U 𝛾𝑡 (𝑅(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) + 𝛼𝐻(𝜋(⋅ | 𝑠𝑡)))V
𝑡=0
where 𝛼 > 0 is the trade-off coefficient, assuming an infinite-horizon discounted setting. For this
new problem a new value function 𝑉 ( is defined to include the entropy bonuses from every time
step:
𝑄( is changed to include the entropy bonuses from every time step except the first:
- -
+
𝐿(𝜙C , 𝐷) = 𝐸((,4,$,(),D)∼8 `a𝑄E0 (𝑠, 𝑎) − 𝑦(𝑟, 𝑠′, 𝑑)c d
𝑦(𝑟, 𝑠′, 𝑑) = 𝑟 + 𝛾(1 − 𝑑) e𝑚𝑖𝑛F.#,+ 𝑄E1234,6 (𝑠′, 𝑎′) − 𝛼 𝑙𝑜𝑔𝜋: (𝑎′ | 𝑠′)f , 𝑎′
∼ 𝜋: (⋅ |𝑠′)
The entropy regularization coefficient 𝛼 explicitly controls the explore-exploit trade-off, with higher
𝛼 corresponding to more exploration and lower 𝛼 corresponding to more exploitation.
An implementation in pseudocode of Soft Actor-Critic is shown in Figure 4-2.
Figure 4-2: Pseudocode for Soft Actor-Critic [42] Haarnoja et al. (2018)
31
5. Hyperparameters Description
To run experiments using the Unity ML-Agents Toolkit [18] Juliani et al. (2020) a
trainer_config.yaml configuration file must be provided containing all the hyperparameters
and training configurations. There are some parameters that are specific to either Proximal Policy
Optimization (PPO) or Soft Actor-Critic (SAC), while some others are common to both algorithms.
The toolkit gives recommended values for most of the parameters of the config file.
All configuration parameters provided by the toolkit are listed and described below:
■ sequence_length: Indicates how long the sequences of experiences must be while training.
Recommended values: If this value is too small the agent will not be able to remember things
over longer periods of time, if it is too large the neural network will take longer to train. For
PPO and SAC: between 4 and 128 is recommended.
■ strength: Factor by which to multiply the rewards coming from the environment (extrinsic
rewards). Recommended values: 1,0
■ gamma: Discount factor for future rewards coming from the environment, represents how far
into the future the agent should care about possible rewards. Must be strictly smaller than 1.
Recommended values: In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large, in cases when rewards
are more immediate should be smaller. For PPO and SAC: between 0,8 and 0,995 is
recommended.
# Common parameters
trainer: ppo
summary_freq: 10000
batch_size: 5120
buffer_size: 512000
hidden_units: 512
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 100000000
normalize: false
num_layers: 3
time_horizon: 1024
use_recurrent: false
memory_size: 128
sequence_length: 128
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
# PPO hyperparameters
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
Contents of trainer_config.yaml file for experiments using Soft Actor-Critic:
# Soft Actor-Critic
SoccerAcademy:
# Common parameters
trainer: sac
summary_freq: 10000
batch_size: 512
buffer_size: 512000
hidden_units: 512
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 100000000
normalize: false
num_layers: 3
time_horizon: 1024
use_recurrent: false
memory_size: 128
sequence_length: 128
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
# SAC hyperparameters
buffer_init_steps: 5000
init_entcoef: 0.5
save_replay_buffer: false
tau: 0.005
steps_per_update: 5
37
Figure 6-1: Screenshot of learning environment Soccer Twos [40] Juliani et al. (2020).
Since the environment is a simplification of the real soccer game, there are elements that are not
present, for example, there are no other players, including the goalkeeper, there are no team
manager and no referee. Also, there are soccer rules that do not apply, including getting a yellow
or red card for fouls done by the players, or special ways of restarting the game like penalty kick
or corner kick.
We constrained the environment to 15.000 iterations, also called frames or simulation steps, if no
agent is able to score a goal before the iteration limit is reached, a draw is declared and the match
finishes. A modern six-core CPU usually executes between 60 to 120 frames per second, this
means a match could last from 125 to 250 seconds if no agent scores a goal. The scenario
includes a trained reinforcement learning model that can be used by both agents, in our case we
used this model on the red agent only. All our training experiments were run only on the blue
agent, serving the red agent only as the opponent to defeat.
Figure 6-4 shows all the objects that can be detected by the agent.
Figure 6-4: Objects that can be detected by the agent: top-left: Soccer ball, top-center: Agents of
the same team as the agent in training, top-right: Agents of the opposite team, bottom-left: Walls
of the soccer field, bottom-center: Agent’s own goal, bottom-right: Opposite team’s goal.
By each type of object, the x, y, and z coordinates plus the distance to the object are detected. In
total, the agent has an Observation Space vector of 336 variables per iteration:
Since the game can run from 60 to 120 frames per second, each variable is being updated 60 to
120 times per second. This is the way the agent can perceive its surroundings.
41
■ Frontal motion: If this value is greater than 0 the agent moves forward, if the value is less than
zero it moves backward, and if the value is equal to zero no frontal motion is done.
■ Lateral motion: If this value is greater than 0 the agent moves to its right, if the value is less
than zero it moves to its left, and if the value is equal to zero no lateral motion is done.
■ Rotation: If this value is greater than 0 the agent rotates around its Y-axis clockwise, if the
value is less than zero it rotates around its Y-axis counterclockwise, if the value is equal to
zero no rotation is done.
The maximum lateral or frontal speed is constrained to 2 meters per second. Figure 6-5 show the
3 actions that agent can use to move inside the soccer field.
Figure 6-5: Possible movements that the agent can perform: left: lateral motion, center: frontal
motion, right: rotation around its Y-axis.
The main method of measuring the agent’s performance over multiple soccer matches is the
mean cumulative reward, which can tell us on average if the agent is scoring goals or not. A mean
value close to 1 means that our agent is winning almost every single match, outperforming its
opponent, close to -1 means is almost losing every single time, close to 0 means that in most
cases the matches are ending in a draw, no agent is scoring goals which mean that both have a
low game performance. If the mean value is close to 0.5 it means that both agents are being able
to score goals at similar rates.
We do not have information about the trained model that the red agent is using, we do not know
which reinforcement learning technique was used, or which learning environment configuration
and hyperparameters were set, but it is safe to assume that it was trained using one of the state-
of-the-art reinforcement learning algorithms included in the toolkit, using an optimal set of
hyperparameters, since the learning environment chosen was designed to serve as a challenge
to test new ML algorithms and techniques.
We wanted how the mean cumulative reward behaves if we use the same trained model on both
blue and red agents and make them compete against to each other, it turned out that the mean
cumulative reward stays close to 0.5 over time, meaning both agents perform well and score goals
at similar rates, each one having a 50% chance of winning the game.
In this context it is safe to assume that if one of our experiments ends up with a mean cumulative
reward value tending to 0.5, the training was successful, and the agent is performing at least as
well as the red agent which is using the trained model. Since the blue agent has an opponent that
is already trained to play in an optimal way, it is expected that at the beginning of the experiments
the mean cumulative reward value starts at zero, meaning the blue agent is being outperformed
by the red agent. Figure 6-6 shows the expected behavior of mean cumulative reward values over
time for the blue agent in a successful training lesson.
43
Figure 6-6: Expected behavior of mean cumulative reward values over time for the blue agent in
a successful training lesson.
What we wanted to evaluate in this work is if using the curriculum learning technique on the
training process the mean cumulative reward value gets closer to 0.5 faster than without a
curriculum, in other words, if this technique makes the agent to learn faster.
The difficulty our agent has when learning at the beginning of the training is linked directly to the
performance of the opponent agent, the easier the opponent is, the easier is for our agent to learn,
so it makes sense to design training curriculums that alter the opponent’s performance to give our
agent an initial advantage over it.
We adapted the case study to support the variation of 2 environment parameters that alter the
opponent’s behavior, and one that alters the reward signals:
■ opponent_exist: At the beginning of the training the blue agent needs to learn how to
move inside the pitch and how to rotate and move towards the ball, but if the red agent is
present, this learning process is interrupted, because the red agent will score a goal in a very
short time, almost immediately, since it is already trained on how to do it efficiently. In this
context, we want to be able to remove the opponent from the field while our agent is mastering
how to move efficiently towards the ball and to push it towards the opponent's goal. After the
agent learned how to play alone in the field, we want to restore the opponent back, so our
agent can start learning how to defeat it.
■ opponent_speed: We believe modifying the movement speed of the red agent will give the
blue agent a clear advantage over its opponent, the slower the red agent is, more
opportunities the blue agent will have to learn how to score goals since it will be able to move
faster towards the ball than the red agent. As soon as our agent learns how to defeat a slow
version of its opponent, we want to increase the opponent’s speed so it can become a harder
player to overcome.
■ ball_touch_reward: We want to give an additional reward to our agent every single time
it touches the ball, because we wonder if this will incentivize our agent to move toward the
ball faster, but this approach has an important risk to consider: This can teach the agent how
to seek the ball very quickly but not necessarily how to score goals. This option has an
additional problem: The mean cumulative reward values will be altered by the additional
reward, this will invalidate any possible performance comparison we want to make against
another agent’s cumulative reward values.
In the case of parameter opponent_exist a value of 1 means the opponent does exist, and a
value of 0 means the opponent does not exist. For opponent_speed parameter, a value of 0
means the opponent cannot move. The maximum movement speed of all agents is 2 meters per
second, so a greater value for the opponent_speed parameter will be rounded down to 2.
For each curriculum, we want to experiment with, a curriculum.yaml file must be provided,
containing the definition of which of those parameters are going to vary over time, and the times
when these variations occur, times called lesson thresholds.
Below you can find an example of a curriculum file that illustrates how a curriculum is defined:
45
# Curriculum example
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]
In this example, the curriculum sets the opponent’s movement speed to 0 meters per second at
the beginning of the training, then, when 10% of the total matches have been played, the speed
will be incremented to 2 meters per second. Worth to mention that a curriculum can control just
1, or all 3 environment parameters at the same time. Also, in the example, we have only one
lesson threshold, but we can have an arbitrary number of thresholds if we want.
7. Experimental Results
7.1. Experimental Setup
We used Google Cloud Platform to execute all or experiments, specifically we used Google
Compute Engine (GCE), an Infrastructure as a Service (IaaS) solution used by Google’s
applications including Gmail and YouTube. GCE enables its users to launch Virtual Machines
(VMs) on demand, that can be accessed using Secure Shell (SSH). We activated 2 virtual
machines running Fedora 10, a popular Linux distribution. Each virtual machine had a virtual 8-
core Intel CPU, 16 GB of RAM and 10 GB of Hard Drive.
Scatter plots showing the mean cumulative reward over the 100 million matches for each
algorithm are shown in Figures 7-1 and 7-2, a gaussian filter was applied to reduce noise, we
drew a green dotted line over the 0.5 reward value to help visualize where the cumulative reward
should be after running all the soccer matches so the training can be considered successful:
47
Figure 7-1: Mean cumulative reward for Proximal Policy Optimization over 100 million matches.
Figure 7-2: Mean cumulative reward for Soft Actor-Critic over 100 million soccer matches.
In the experiment with Proximal Policy Optimization (PPO), we got a result similar to our initial
hypothesis about how the mean cumulative reward values should vary over time in a successful
training session. On the other hand, with Soft Actor-Critic (SAC) we got results that indicate a
poor agent performance because the mean values tend to -1.
With curriculum A we wanted to test if our agents learn faster by removing the opponent during
the first 10% of the 100 million soccer matches, after that the opponent would be present but with
limited movement speed, first having no speed at all, then incrementing its speed linearly in
increments of 0.25 meters per second every 10% of the total matches until reaching its full speed
at 90% of the training.
# Curriculum A
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
In figure 7-3 and 7-4 the mean cumulative reward results for curriculum A are plotted in blue,
alongside with the PPO and SAC control experiments for comparison, the vertical dotted lines in
blue represent the curriculum thresholds, where one or more environment parameters were
changed, the dotted line is wider where the curriculum has no effect anymore, in this case after
90% of the matches.
49
Figure 7-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum A over 100 million soccer matches.
Figure 7-4: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum A over 100 million soccer matches.
In the scatter plot for Proximal Policy Optimization only vs Proximal Policy Optimization and
Curriculum A we can notice that our agent performs well while the opponent is not present or its
movement speed is reduced, but starting from 50% of the matches we can notice performance
drops every time the opponent increments its movement speed, and after 90% the agents is
slightly over-performed by the control case. When we train the agent using PPO only, we get a
mean cumulative reward around 0.45 at the end of the training, but using a curriculum we get
around 0.35, this means the curriculum is hurting the performance.
In the case of Soft Actor-Critic only vs Soft Actor-Critic and Curriculum A is evident that the
curriculum has a positive impact only on the first 10% of the training, this means the agent learns
very fast how to score goals if the opponent is not present, but after 20% the agent performs the
same and even worse than the control case. In the end, from 90% of the training, there is no
noticeable difference in the agent’s performance when using the curriculum vs the control, both
perform very badly.
For curriculum B, additional to the default reward signal established previously, we wanted to give
our agent a bonus reward for touching the ball, a reward that is reduced over time. In the first 10%
of the 100 million soccer matches, we give a reward of 0.3, then we reduce the reward in 0.1
every 10% of the total matches until having no additional reward at 30% of the training.
# Curriculum B
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3]
parameters:
ball_touch_reward: [0.3, 0.2, 0.1, 0.0]
Figure 7-5 and 7-6 show the mean cumulative reward results for curriculum B compared to the
control experiments.
51
Figure 7-5: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum B over 100 million soccer matches.
Figure 7-6: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum B over 100 million soccer matches.
In the experiment using curriculum B with Soft Actor-Critic, we get similar results using the
curriculum A: a peak of good performance in the first 10% of the soccer matches, but then we get
a bad performance overall. On the other hand, for Proximal Policy Optimization we get better
performance all the time, not only when curriculum B is active, but even after 30% of the training
when it is not. The blue line crosses the 0.5 cumulative reward limit around 60% of training, totally
outperforming the agent that was trained in the control experiment, its line does not cross the 0.5
limit, however, it gets very close to it.
The importance of this experiment is that, in some cases like this one, we can get the same or
even training results in fewer soccer matches, only 60 million matches are required to have a
good performance if we use curriculum B, but around 120 million will be required to get the same
results using PPO alone.
Is important to emphasize that the cumulative reward values before the 30% of the training are
being altered by the additional reward we are giving to the agent when touching the ball, so in the
range of 0% to 30% the cumulative reward cannot be considered a valid benchmark for the
agent’s performance, so it cannot be used for comparison.
After getting these results we decided to combine curriculums A and B into a single one, to test if
their effects over the training process merge somehow so we can get even better results than
having each one separated, we named this combination curriculum A+B, and its results are shown
in Figure 7-7 and 7-8:
# Curriculum A+B
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward:
[0.3, 0.2, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
53
Figure 7-7: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum A+B over 100 million soccer matches.
Figure 7-8: Mean cumulative reward for Soft Actor-Critic control experiment vs Soft Actor-Critic
and Curriculum A+B over 100 million soccer matches.
In case of Proximal Policy Optimization, the results when curriculum A+B are worse than the ones
obtained with curriculums A and B independently, we get a mean cumulative reward under 0 at
the end of the training, this means the agent has not been able to learn how to score goals in a
consistent way. In the case of Soft Actor-Critic, the performance is very low, not only in the control
experiment but also when using a curriculum, so at this point we decided not to continue using
SAC.
Since we got a good performance in one of the experiments that have 3 lessons, we decided to
run more experiments, not only with 3 but also with 4 lessons, each one having different
combinations of environment parameters and thresholds. Later we tested curriculums with 5, 6,
1, and 9 lessons.
In the following pages the curriculums we designed with 3 and 4 lessons are presented, using
only Proximal Policy Optimization (PPO) as the reinforcement learning algorithm, and the mean
cumulative reward as the performance measure. All of them are compared against the PPO
control experiment. We assigned a letter to each experiment in a random order, so it has no
special meaning, the order we use to present the experiment results is not relevant either.
55
# Curriculum D
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4]
parameters:
opponent_speed: [0.0, 0.5, 1.0, 1.5, 2.0]
Figure 7-9: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum D over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum G are shown below:
# Curriculum G
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-10: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum G over 100 million soccer matches.
57
# Curriculum H
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-11: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum H over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum P are shown below:
# Curriculum P
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.25, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-12: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum P over 100 million soccer matches.
59
# Curriculum Q
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.10, 0.22, 0.34]
parameters:
opponent_speed: [0.0, 0.25, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-13: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Q over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum R are shown below:
# Curriculum R
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.14, 0.26, 0.40]
parameters:
opponent_speed: [0.0, 0.50, 1.25, 1.75, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-14: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum R over 100 million soccer matches.
61
# Curriculum U
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.15, 0.30, 0.45]
parameters:
ball_touch_reward: [0.5, 0.25, 0.1, 0.0]
Figure 7-15: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum U over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum W are shown below:
# Curriculum W
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward: [0.3, 0.2, 0.1, 0.0, 0.0]
Figure 7-16: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum W over 100 million soccer matches.
63
# Curriculum X
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0]
ball_touch_reward: [0.2, 0.1, 0.0, 0.0, 0.0]
Figure 7-17: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum X over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum Y are shown below:
# Curriculum Y
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
ball_touch_reward: [0.2, 0.1, 0.0, 0.0, 0.0]
Figure 7-18: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Y over 100 million soccer matches.
65
# Curriculum Z
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.07, 0.15, 0.30, 0.45]
parameters:
opponent_speed: [0.0, 0.0, 1.0, 1.5, 2.0]
ball_touch_reward: [0.3, 0.2, 0.1, 0.0, 0.0]
Figure 7-19: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum Z over 100 million soccer matches.
7.5. Curriculums with 5 and 6 Lessons
Contents of the curriculum.yaml file for curriculum C are shown below:
# Curriculum C
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4, 0.5]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-20: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum C over 100 million soccer matches.
67
# Curriculum J
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.10, 0.15, 0.25, 0.35, 0.45, 0.55]
parameters:
opponent_speed:
[0.0, 0.0, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-21: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum J over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum M are shown below:
# Curriculum M
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.08, 0.16, 0.24, 0.32, 0.40]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-22: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum M over 100 million soccer matches.
69
# Curriculum N
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.08, 0.12, 0.20, 0.28, 0.36]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-23: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum N over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum O are shown below:
# Curriculum O
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.06, 0.10, 0.16, 0.24, 0.32]
parameters:
opponent_speed: [0.0, 0.0, 0.5, 1.0, 1.5, 2.0]
opponent_exist: [0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-24: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum O over 100 million soccer matches.
71
# Curriculum V
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1, 0.2, 0.3, 0.4, 0.5]
parameters:
ball_touch_reward: [0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
Figure 7-25: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum V over 100 million soccer matches.
7.6. Curriculums with 1 and 9 Lessons
Despite the fact that we got bad performance when we used a 9-lesson curriculum in our
preliminary experiments, we considered it was pertinent to run a few more experiments with 9-
lesson, to see if it was possible to get better performance. Also, we wanted to test 1-lesson
curriculums, since they are an edge case it was not considered in our preliminary experiments.
# Curriculum E
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]
opponent_exist: [0.0, 1.0]
Figure 7-26: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum E over 100 million soccer matches.
73
# Curriculum F
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds: [0.1]
parameters:
opponent_speed: [0.0, 2.0]
Figure 7-27: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum F over 100 million soccer matches.
Contents of the curriculum.yaml file for curriculum K are shown below:
# Curriculum K
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.08, 0.14, 0.20, 0.26, 0.32, 0.38, 0.44, 0.50, 0.56]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-28: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum K over 100 million soccer matches.
75
# Curriculum L
SoccerAcademy:
measure: progress
min_lesson_length: 100
signal_smoothing: true
thresholds:
[0.08, 0.12, 0.16, 0.20, 0.24, 0.30, 0.36, 0.42, 0.48]
parameters:
opponent_speed:
[0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0]
opponent_exist:
[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Figure 7-29: Mean cumulative reward for Proximal Policy Optimization control experiment vs
Proximal Policy Optimization and Curriculum L over 100 million soccer matches.
8. Discussion
After completing all experiments, we proceeded to group all curriculums according to its
performance when compared against the Proximal Policy Optimization (PPO) control experiment,
so we can analyze them as a cluster in order to identify common patterns:
Figure 8-1: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had lower performance than the control experiment at the end of 100
million soccer matches.
77
■ Curriculums F and E belong to this group, both have a single threshold at 10% of the training
at which the opponent’s movement speed is incremented from 0 to 2 meters per second, it
seems that this sudden change in the velocity does not provide the agent enough time to learn
how to play.
■ Curriculums D and G have 4 thresholds, at 10%, 20%, 30% and 40% of the training, in the
latest 3 thresholds the opponent’s movement speed is incremented linearly, adding 0.5 meters
per second. This result suggests that a linear increment in velocity at the end of the training
does not provide good performance results.
■ Curriculums W and Z use both ball_touch_reward and opponent_speed environment
parameters to encourage the agent to learn, but is not a good strategy because it seems the
agent suffers from overfitting, meaning the agent learn how to move towards the ball quickly
when the opponent is slow, but not necessarily learn how to score goals efficiently.
■ Curriculums A and A+B have a big space between the thresholds, in both cases of 10%, it
seems that this scenario is not good for the agent because it plays under the initial favorable
environment conditions for too long, making it unable to adapt to new harder environment
conditions.
In this group most of the curriculums do not have a consistent increase rate in the opponent’s
movement speed, the distribution of the thresholds does not follow a pattern. The results indicate
that this is not a good strategy in terms of the agent’s performance.
In case of curriculum V, we give additional reward to our agent, starting with 0.5 at the beginning
of the training, decreasing the value by 0.1 meters per second every 10% of the training. The bad
performance of this curriculum suggests that the agent learns how to reach the ball quickly since
the reward is big for this action, but then is unable to learn how to push it towards the opponent’s
goal.
Figure 8-2: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had the same performance as the control experiment at the end of
100 million soccer matches.
■ Curriculums B and U suggest that giving an additional reward to the agent can be beneficial,
but only if the reward decreases rapidly during the training, to avoid the agent to learn to rely
only on touching the ball to get a reward.
■ Several curriculums in this group suggest that it is important for the agent to have a strong
advantage over its opponent but only over a short period of time. In most curriculums the
opponent is not present for the first 10% of training time, giving the agent the opportunity to
learn how to move inside the pitch and how to move the soccer ball around, then the
79
opponent’s movement speed is incremented but not in a linear way, rather in a logarithmic
way, this prevents the agent to overfit to the initial conditions.
Figure 8-3: Mean cumulative reward for Proximal Policy Optimization control experiment vs all
curriculum experiments that had higher performance than the control experiment at the end of
100 million soccer matches.
Here we list the times at which the mean cumulative reward reaches the optimal value of 0.5:
Figure 8-4: Mean cumulative reward for Proximal Policy Optimization control experiment vs the
best 4 curriculum experiments compared over 100 million soccer matches.
81
9. Conclusions
Several experiments with curriculum learning were executed to measure its effects over the
training process of an agent that is learning to play a video game using reinforcement learning.
The initial hypothesis is that, in some cases, using a curriculum would allow the agent to learn
faster. Our results indicate that using curriculum learning could have a significant impact on the
learning process, in some cases helping the agent to learn faster, overperforming the control
experiment, in other cases having a negative impact, making the agent to learn slower.
In this work 24 curriculums were designed, each one having a different configuration of learning
environment parameters and thresholds. We were able to infer several patterns from the training
results that could indicate if a curriculum will improve or hurt the agent’s performance, measured
by the mean cumulative reward. In 12 experiments we get better performance than the control
experiment, the best curriculum could save up to 40% of the training time required to reach an
optimal performance level.
In this work we used only two algorithms in our experiments, Proximal Policy Optimization (PPO)
and Soft Actor-Critic (SAC), would be interesting to try more reinforcement learning algorithms,
or even supervised learning ones to test our main hypothesis. In the case of Soft Actor-Critic
(SAC) a process of hyperparameter optimization is needed to see if it can work well with the case
study we chose.
We used a learning environment included in the Unity ML-Agents Toolkit [18] Juliani et al. (2020)
as case study: Soccer Twos, a toy soccer video game, but would be interesting to test the
curriculum learning technique on a wider variety of video games, including 2D and 3D, first-person
and third-person games, of multiple genres such as strategy, adventure, role-playing and puzzle
to see if this technique has a bigger impact on a particular genre.
10. References
[1] Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning.
Proceedings of the 26th International Conference on Machine Learning, ICML 2009, 41–48.
https://dl.acm.org/doi/10.1145/1553374.1553380
[2] Elman, J. L. (1993). Learning and development in neural networks: The importance of
starting small. Cognition, 48, 71–99. https://doi.org/10.1016/S0010-0277(02)00106-3
[3] Harris, C. (1991). Parallel distributed processing models and metaphors for language and
development. Ph.D. dissertation, University of California, San Diego.
https://elibrary.ru/item.asp?id=5839109
[4] Juliani, Arthur. (2017, December 8). Introducing ML-Agents Toolkit v0.2: Curriculum
Learning, new environments, and more. https://blogs.unity3d.com/2017/12/08/introducing-ml-
agents-v0-2-curriculum-learning-new-environments-and-more/
[5] Gulcehre, C., Moczulski, M., Visin, F., & Bengio, Y. (2019). Mollifying networks. 5th
International Conference on Learning Representations, ICLR 2017 - Conference Track
Proceedings. http://arxiv.org/abs/1608.04980
[6] Allgower, E. L., & Georg, K. (2003). Introduction to numerical continuation methods. In
Classics in Applied Mathematics (Vol. 45). Colorado State University.
https://doi.org/10.1137/1.9780898719154
[7] Justesen, N., Bontrager, P., Togelius, J., & Risi, S. (2017). Deep Learning for Video Game
Playing. IEEE Transactions on Games, 12(1), 1–20. https://doi.org/10.1109/tg.2019.2896986
[8] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning
environment: An evaluation platform for general agents. IJCAI International Joint Conference on
Artificial Intelligence, 2013, 4148–4152. https://doi.org/10.1613/jair.3912
[9] Montfort, N., & Bogost, I. (2009). Racing the beam: The Atari video computer system. MIT
Press, Cambridge Massachusetts.
https://pdfs.semanticscholar.org/2e91/086740f228934e05c3de97f01bc58368d313.pdf
[10] Bhonker, N., Rozenberg, S., & Hubara, I. (2017). Playing SNES in the Retro Learning
Environment. https://arxiv.org/pdf/1611.02205.pdf
[11] Buşoniu, L., Babuška, R., & De Schutter, B. (2010). Multi-agent reinforcement learning:
An overview. Studies in Computational Intelligence, 310, 183–221. https://doi.org/10.1007/978-3-
642-14435-6_7
[12] Kempka, M., Wydmuch, M., Runc, G., Toczek, J., & Jaskowski, W. (2016). ViZDoom: A
Doom-based AI research platform for visual reinforcement learning. IEEE Conference on
Computational Intelligence and Games, CIG, 0. https://doi.org/10.1109/CIG.2016.7860433
[13] Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq,
A., Green, S., Valdés, V., Sadik, A., Schrittwieser, J., Anderson, K., York, S., Cant, M., Cain, A.,
Bolton, A., Gaffney, S., King, H., Hassabis, D., … Petersen, S. (2016). DeepMind Lab.
https://arxiv.org/pdf/1612.03801.pdf
[14] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The malmo platform for
artificial intelligence experimentation. Twenty-Fifth International Joint Conference on Artificial
Intelligence (IJCAI-16), 2016-January, 4246–4247. http://stella.sourceforge.net/
83
[15] Synnaeve, G., Nardelli, N., Auvolat, A., Chintala, S., Lacroix, T., Lin, Z., Richoux, F., &
Usunier, N. (2016). TorchCraft: a Library for Machine Learning Research on Real-Time Strategy
Games. https://arxiv.org/pdf/1611.00625.pdf
[16] Silva, V. do N., & Chaimowicz, L. (2017). MOBA: a New Arena for Game AI.
https://arxiv.org/pdf/1705.10443.pdf
[17] Karpov, I. V., Sheblak, J., & Miikkulainen, R. (2008). OpenNERO: A game platform for AI
research and education. Proceedings of the 4th Artificial Intelligence and Interactive Digital
Entertainment Conference, AIIDE 2008, 220–221.
https://www.aaai.org/Papers/AIIDE/2008/AIIDE08-038.pdf
[18] Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y.,
Henry, H., Mattar, M., & Lange, D. (2020). Unity: A General Platform for Intelligent Agents.
https://arxiv.org/pdf/1809.02627.pdf
[19] Juliani, A. (2017). Introducing: Unity Machine Learning Agents Toolkit.
https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/
[20] Alpaydin, E. (2010). Introduction to Machine Learning. In Massachusetts Institute of
Technology (Second Edition). The MIT Press.
https://kkpatel7.files.wordpress.com/2015/04/alppaydin_machinelearning_2010
[21] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (Second
Edition). The MIT Press. http://incompleteideas.net/sutton/book/RLbook2018.pdf
[22] Wolfshaar, J. Van De. (2017). Deep Reinforcement Learning of Video Games [University
of Groningen, The Netherlands].
http://fse.studenttheses.ub.rug.nl/15851/1/Artificial_Intelligence_Deep_R_1.pdf
[23] Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence.
Minds and Machines, 17(4), 391–444. https://doi.org/10.1007/s11023-007-9079-x
[24] Schaul, T., Togelius, J., & Schmidhuber, J. (2011). Measuring Intelligence through
Games. https://arxiv.org/pdf/1109.1314.pdf
[25] Ortega, D. B., & Alonso, J. B. (2015). Machine Learning Applied to Pac-Man [Barcelona
School of Informatics]. https://upcommons.upc.edu/bitstream/handle/2099.1/26448/108745.pdf
[26] Lample, G., & Chaplot, D. S. (2016). Playing FPS Games with Deep Reinforcement
Learning. https://arxiv.org/pdf/1609.05521.pdf
[27] Adil, K., Jiang, F., Liu, S., Grigorev, A., Gupta, B. B., & Rho, S. (2017). Training an Agent
for FPS Doom Game using Visual Reinforcement Learning and VizDoom. In (IJACSA)
International Journal of Advanced Computer Science and Applications (Vol. 8, Issue 12).
https://pdfs.semanticscholar.org/74c3/5bb13e71cdd8b5a553a7e65d9ed125ce958e.pdf
[28] Wang, E., Kosson, A., & Mu, T. (2017). Deep Action Conditional Neural Network for Frame
Prediction in Atari Games. http://cs231n.stanford.edu/reports/2017/pdfs/602.pdf
[29] Karttunen, J., Kanervisto, A., Kyrki, V., & Hautamäki, V. (2020). From Video Game to Real
Robot: The Transfer between Action Spaces. 5. https://arxiv.org/pdf/1905.00741.pdf
[30] Martinez, M., Sitawarin, C., Finch, K., Meincke, L., Yablonski, A., & Kornhauser, A. (2017).
Beyond Grand Theft Auto V for Training, Testing and Enhancing Deep Learning in Self Driving
Cars [Princeton University]. https://arxiv.org/pdf/1712.01397.pdf
[31] Singh, S., Barto, A. G., & Chentanez, N. (2005). Intrinsically Motivated Reinforcement
Learning. http://www.cs.cornell.edu/~helou/IMRL.pdf
[32] Rockstar Games. (2020). https://www.rockstargames.com/
[33] Mattar, M., Shih, J., Berges, V.-P., Elion, C., & Goy, C. (2020). Announcing ML-Agents
Unity Package v1.0! Unity Blog. https://blogs.unity3d.com/2020/05/12/announcing-ml-agents-
unity-package-v1-0/
[34] Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. In Encyclopedia of
Optimization. Springer US. https://doi.org/10.1007/978-0-387-74759-0_440
[35] Shao, K., Tang, Z., Zhu, Y., Li, N., & Zhao, D. (2019). A Survey of Deep Reinforcement
Learning in Video Games. https://arxiv.org/pdf/1912.10944.pdf
[36] Wu, Y., & Tian, Y. (2017). Training agent for first-person shooter game with actor-critic
curriculum learning. ICLR 2017, 10. https://openreview.net/pdf?id=Hk3mPK5gg
[37] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., &
Kavukcuoglu, K. (2016, February 4). Asynchronous Methods for Deep Reinforcement Learning.
33rd International Conference on Machine Learning. https://arxiv.org/pdf/1602.01783.pdf
[38] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017, August 19). A
Brief Survey of Deep Reinforcement Learning. IEEE Signal Processing Magazine.
https://doi.org/10.1109/MSP.2017.2743240
[39] Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., & Stone, P. (2020).
Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey.
https://arxiv.org/pdf/2003.04960.pdf
[40] Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y.,
Henry, H., Mattar, M., & Lange, D. (2020). Unity ML-Agents Toolkit. https://github.com/Unity-
Technologies/ml-agents
[41] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms. https://arxiv.org/pdf/1707.06347.pdf
[42] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
https://arxiv.org/pdf/1801.01290.pdf
[43] Weng, L. (2018). A (Long) Peek into Reinforcement Learning. Lil Log.
https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html
[44] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540), 529–533.
https://doi.org/10.1038/nature14236
[45] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014).
Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on
Machine Learning. https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf
[46] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra,
D. (2016, September 9). Continuous control with deep reinforcement learning. ICLR 2016.
https://arxiv.org/pdf/1509.02971.pdf
[47] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Tb, D., Muldal,
A., Heess, N., & Lillicrap, T. (2018). Distributed distributional deterministic policy gradients. ICLR
2018. https://openreview.net/pdf?id=SyZipzbCb
85
[48] Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015, February 19). Trust
Region Policy Optimization. Proceeding of the 31st International Conference on Machine
Learning. https://arxiv.org/pdf/1502.05477.pdf
[49] Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N.
(2017, November 3). Sample Efficient Actor-Critic with Experience Replay. ICLR 2017.
https://arxiv.org/pdf/1611.01224.pdf
[50] Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region method
for deep reinforcement learning using Kronecker-factored approximation.
https://arxiv.org/pdf/1708.05144.pdf
[51] Fujimoto, S., van Hoof, H., & Meger, D. (2018, February 26). Addressing Function
Approximation Error in Actor-Critic Methods. Proceedings of the 35th International Conference on
Machine Learning. https://arxiv.org/pdf/1802.09477.pdf
[52] Liu, Y., Ramachandran, P., Liu, Q., & Peng, J. (2017). Stein Variational Policy Gradient.
https://arxiv.org/pdf/1704.02399.pdf
[53] Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu,
V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018). IMPALA: Scalable Distributed
Deep-RL with Importance Weighted Actor-Learner Architectures.
https://arxiv.org/pdf/1802.01561.pdf
[54] Schulman, J., Klimov, O., Wolski, F., Dhariwal, P., & Radford, A. (2017). Proximal Policy
Optimization. https://openai.com/blog/openai-baselines-ppo/
[55] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H.,
Gupta, A., Abbeel, P., & Levine, S. (2019). Soft Actor-Critic Algorithms and Applications.
https://arxiv.org/pdf/1812.05905.pdf
[56] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation,
9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[57] Wydmuch, M., Kempka, M., & Jaskowski, W. (2018). ViZDoom Competitions: Playing
Doom from Pixels. IEEE Transactions on Games, 11(3), 248–259.
https://doi.org/10.1109/tg.2018.2877047