AC Project
AC Project
Intelligent Systems
Laboratory activity 2018-2019
1
Contents
1 Introduction 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Main functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Installing the tool and running it . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Project Details 7
3.1 What will the system do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Narrative description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Top level design of the scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 Knowledge acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2
Chapter 1
Introduction
1.1 Overview
1. PyTorch is an open-source machine learning library for Python, based on Torch,
used for applications such as natural language processing. It is primarily developed
by Facebook’s artificial-intelligence research group, and Uber’s ”Pyro” Probabilistic
programming language software is built on it.
3. Our example, outputs at each end of an episode the average reward for that episode,
additionally, every episode is animated for a better understanding of the algorithm.
2. Using anaconda, run the commands conda install pytorch - c soumith, pip install
gym[all], pip install numpy.
3
Chapter 2
(a) First the actor picks an action based on it’s policy and gets feedback from
the critic, the action chosen influences the enviornment and thus, the reward
for that action. An episode consists of 10000 actions, and at the end of each
action, the policies for the actor and critic are updated, unlike the normal
reinforcement methods where the update occurs at the end of the episode.
(b) The longer the agent keeps the pendulum up, the more it gets rewarded, so as
we can see from the episodes above, at the beginning the actor doesn’t perform
well, as expected, but after 500 episodes it manages to keep the pendulum up
for quite some time.
(c) Pseudo Code:
2. As for structure, the program is split into 5 files or modules: main, buffer, train,
models, utils.
4
2.3 Example analysis
1. As stated above, the output for the program is the reward for each episode and
an animation describing the decisions taken by the actor. For analysis, we will
concentrate only on the reward part and not on the animated part.
2. Episodes up to 10:
5
Chapter 3
Project Details
3.3 Facts
The facts of the scenario are that we need to implement new policies for the actor and critic
methods and make the connection with the enviornment as in the current state, the program
crashes on start when we give it the racetrack.
3.4 Specifications
When starting the program, the user will first see the animated window in which the car will try
to move along the track, for the first few episodes the agent won’t be able to perform very well,
but as in the pendulum enviornment, given time, it will learn to control the car and eventually,
move really fast along the track.
7
3.6 Knowledge acquisition
The project I chose requires me to take inspiration from other applications based on pytorch,
and to take into account on how those programs shape their policies. It also requires me to
understand how to connect with the tool and to understand each type of action it uses so the
reward is increased.