Artificial Intelligence With Python Cookbook
Artificial Intelligence With Python Cookbook
Proven recipes for applying AI algorithms and deep learning techniques using
TensorFlow 2.x and PyTorch 1.6
Ben Auffarth
BIRMINGHAM - MUMBAI
Artificial Intelligence with Python
Cookbook
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means,
without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the
information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its
dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by
the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78913-396-7
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and
videos, as well as industry leading tools to help you plan your personal
development and advance your career. For more information, please visit our
website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and
videos from over 4,000 industry professionals
Improve your learning with skill plans tailored especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at
www.packt.com and, as a print book customer, you are entitled to a discount on
the eBook copy. Get in touch with us at customercare@packtpub.com for more
details.
At www.packt.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and offers on
Packt books and eBooks.
Contributors
About the author
Ben Auffarth is a full-stack data scientist with more than 15 years of work
experience. With a background and Ph.D. in computational and cognitive
neuroscience, he has designed and conducted wet lab experiments on cell
cultures, analyzed experiments with terabytes of data, run brain models on IBM
supercomputers with up to 64k cores, built production systems processing
hundreds of thousands of transactions per day, and trained neural networks on
millions of text documents. He resides in West London with his family, where
you might find him in a playground with his young son. He co-founded and is
the former president of Data Science Speakers, London.
I am deeply grateful to the editors at Packt, who provided practical help and competent advice, and to
everyone who has been close to me and supported me, especially my partner Diane. This book is dedicated
Every time we figure out a piece of it, it stops being magical; we say, "Oh, that's just a computation." We
(Cited from Kahn, Jennifer (March 2002). It's Alive, in Wired, 10 (30):
https://www.wired.com/2002/03/everywhere/)
AI has made huge strides, especially over the last few years with the arrival of
powerful hardware, such as Graphics Processing Units (GPUs) and now
Tensor Processing Units (TPUs), that can facilitate more powerful models,
such as deep learning models with hundreds of thousands, millions, or even
billions of parameters. These models perform better and better on benchmarks,
often reaching human or even super-human levels. Excitingly for anyone
involved in the field, some of these models, trained for many thousands of hours
that would be worth hundreds of thousands of dollars if run on Amazon Web
Services (AWS), are available for download to play with and extend.
Chess 1997 46
Scrabble 2006
Shogi 2017 71
Go 2016 172
Please refer to the Wikipedia article Progress in Artificial Intelligence for more
information. You can see, for a series of games of varying complexity (as per the
third column, showing legal states in powers of 10), when AI reached the level
of top human players. More generally, you can find out more about state-of-the-
art performances in different disciplines on a dedicated website:
https://paperswithcode.com/sota.
It is therefore more timely than ever to look at and learn to use the state-of-the-
art methods in AI, and this is what this book is about. You'll find carefully
chosen recipes that will help you refresh your knowledge and bring you up to
date with cutting edge algorithms.
If you are looking to build AI solutions for work or even for your hobby
projects, you will find this cookbook useful. With the help of easy-to-follow
recipes, this book will take you through the AI algorithms required to build
smart models for problem solving. By the end of this book, you'll be able to
identify an AI approach for solving applied problems, implement and test
algorithms, and deal with model versioning, reports, and monitoring.
Who this book is for
This AI machine learning book is for Python developers, data scientists, machine
learning engineers, and deep learning practitioners who want to learn how to
build artificial intelligence solutions with easy-to-follow recipes. You’ll also find
this book useful if you’re looking for state-of-the-art solutions to perform
different machine learning tasks in various use cases. Basic working knowledge
of the Python programming language and machine learning concepts will help
you to work with code effectively in this book.
What this book covers
Chapter 1, Getting Started with Artificial Intelligence in Python, describes a
basic setup with Python for data crunching and AI. We'll perform data loading in
pandas, plotting, and writing first models in scikit-learn and Keras. Since data
preparation is such a time-consuming activity, we will present state-of-the-art
techniques to facilitate this activity.
Chapter 8, Working with Moving Images, starts with image detection on a video
feed and then creates videos using a deep fake model.
Some of the software and libraries most prominently covered in this book are
listed in the following table:
If you are using the digital version of this book, we advise you to type the
code yourself or access the code via the GitHub repository (link available in
the next section). Doing so will help you avoid any potential errors related to
the copying and pasting of code.
You can download the example code files for this book from GitHub at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook. In case there's an update to the code, it will be updated on the
existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Conventions used
CodeInText: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. Here is an example: "Here is the simplified code for
RangeTransformer."
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
Bold: Indicates a new term, an important word, or words that you see on screen.
For example, words in menus or dialog boxes appear in the text like this. Here is
an example: "Select System info from the Administration panel."
Sections
In this book, you will find several headings that appear frequently (Getting
ready, How to do it..., How it works..., There's more..., and See also).
This section tells you what to expect in the recipe and describes how to set up
any software or any preliminary settings required for the recipe.
How to do it…
How it works…
There's more…
See also
This section provides helpful links to other useful information for the recipe.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention
the book title in the subject of your message and email us at
customercare@packtpub.com.
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we would
be grateful if you would report this to us. Please visit
www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the
internet, we would be grateful if you would provide us with the location address
or website name. Please contact us at copyright@packt.com with a link to the
material.
If you are interested in becoming an author: If there is a topic that you have
expertise in, and you are interested in either writing or contributing to a book,
please visit authors.packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a
review on the site that you purchased it from? Potential readers can then see and
use your unbiased opinion to make purchase decisions, we at Packt can
understand what you think about our products, and our authors can see your
feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Getting Started with Artificial
Intelligence in Python
In this chapter, we'll start by setting up a Jupyter environment to run our
experiments and algorithms in, we'll get into different nifty Python and Jupyter
hacks for artificial intelligence (AI), we'll do a toy example in scikit-learn,
Keras, and PyTorch, and then a slightly more elaborate example in Keras to
round things off. This chapter is largely introductory, and a lot of what see in
this chapter will be built on in subsequent chapters as we get into more advanced
applications.
Technical requirements
You really should have a GPU available in order to run some of the recipes in
this book, or you would better off using Google Colab. There are some extra
steps required to make sure you have the correct NVIDIA graphics drivers
installed, along with some additional libraries. Google provides up-to-date
instructions on the TensorFlow website at
https://www.tensorflow.org/install/gpu. Similarly, PyTorch versions have
minimum requirements for NVIDIA driver versions (which you'd have to check
manually for each PyTorch version). Let's see how to use dockerized
environments to help set this up.
You can find the recipes in this chapter in the GitHub repository of this book at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook.
Please note that, although we'll be focusing on Jupyter Notebook, or Google Colab,
which runs Jupyter notebooks in the cloud, there are a few functionally similar
interpreter. Jupyter Notebook is still, however, the most popular (and probably the best
supported) choice.
In this recipe, we will make sure we have a working Python environment with
the software libraries that we need throughout this book. We'll be dealing with
installing relevant Python libraries for working with AI, and we'll set up a
Jupyter Notebook server.
Getting ready
You use one of the services that host interactive notebooks, such as Google
Colab.
You install Python libraries on your own machine(s).
In Python, a module is a Python file that contains functions, variables, or classes. A package is a
collection of modules within the same path. A library is a collection of related functionality, often in the
form of different packages or modules. Informally, it's quite common to refer to a Python library as a
How to do it...
In the first case, we won't need to set up anything on our server as we'll only be
installing a few additional libraries. In the second case, we'll be installing an
environment with the Anaconda distribution, and we'll be looking at setup
options for Jupyter.
In both cases, we'll have an interactive Python notebook available through which
we'll be running most of our experiments.
Run Colab with local kernels. This means you use the Colab interface but the
models compute on your own computer
(https://research.google.com/Colaboratory/local-runtimes.html).
Install Jupyter Notebook yourself and don't use Google Colab.
For Google Colab, just go to https://colab.research.google.com/, and sign in with
your Google credentials. In the following section, we'll deal with hosting
notebooks on your own machine(s).
In Google Colab, you can save and re-load your models to and from the remote disk on
Google servers. From there you can either download the models to your own computer
or synchronize with Google Drive. The Colab GUI provides many useful code snippets
for these use cases. Here's how to download files from Colab:
Anaconda is a Python distribution that comes with its own package installer and
environment manager, called conda. This makes it easier to keep your libraries
up to date and it handles system dependency management as well as Python
dependency management. We'll mention a few alternatives to Anaconda/conda
later; for now, we will quickly go through instructions for a local install. In the
online material, you'll find instructions that will show how to serve similar
installations to other people across a team, for example, in a company using a
dockerized setup, which helps manage the setup of a machine or a set of
machines across a network with a Python environment for AI.
If you have your computer already set up, and you are familiar with conda and pip,
For the Anaconda installation, we will need to download an installer and then
choose a few settings:
For macOS and Windows, you also have the choice of a graphical
installer. This is all well explained in the Anaconda documentation;
however, we'll just quickly go through the terminal installation.
You need to read and confirm the license agreement. You can do this by
pressing the spacebar until you see the question asking you to agree. You
need to press Y and then Enter.
At the end, you can decide if you want to run the conda init routine.
This will set up the PATH variables on your terminal, so when you type
python, pip, conda, or jupyter, the conda versions will take precedence
before any other installed version on your computer.
If you see something like the following, then you know you are using the
right Python runtime:
If you don't see the correct path, you might have to run the following:
This will set up your environment variables, including PATH. On
Windows, you'd have to check your PATH variable.
You should see the Jupyter Notebook server starting up. As a part of this
information, a URL for login is printed to the screen.
If you run this from a server that you access over the network, make sure you use a
screen multiplexer such as GNU screen or tmux to make sure your Jupyter Notebook
We'll use many libraries in this book such as pandas, NumPy, scikit-
learn, TensorFlow, Keras, PyTorch, Dash, Matplotlib, and others, so
we'll be installing lots as we go through the recipes. This will often look
like the following:
Please note that for the tensorflow-gpu library, you need to have a GPU
available and ready to use. If not, change this to tensorflow (that is,
without -gpu).
This should use the pip binary that comes with Anaconda and run it to
install the preceding libraries. Please note that Keras is part of the
TensorFlow library.
Well done! You've successfully set up your computer for working with
the many exciting recipes to come.
How it works...
Conda is an environment and package manager. Like many other libraries that
we will use throughout this book, and like the Python language itself, conda is
open source, so we can always find out exactly what an algorithm does and
easily modify it. Conda is also cross-platform and not only supports Python but
also R and other languages.
Package management can present many vexing challenges and, if you've been
around for some time, you will probably remember spending many hours on
issues such as conflicting dependencies or re-compiling packages and fixing
paths – and you might be lucky if it's only that.
There are hundreds of dedicated channels that you can use with conda. These are
sub-repositories that can contain hundreds or thousands of different packages.
Some of them are maintained by companies that develop specific libraries or
software.
For example, you can install the pytorch package from the PyTorch channel as
follows:
It's tempting to enable many channels in order to get the bleeding edge technology for
everything. There's one catch, however, with this. If you enable many channels, or
channels that are very big, conda's dependency resolution can become very slow. So be
careful with using many additional channels, especially if they contain a lot of libraries.
There's more...
There are a number of Jupyter options you should probably be familiar with.
These are in the file at $HOME/.jupyter/jupyter_notebook_config.py. If you
don't have the file yet, you can create it using this command:
Then we create a random password and configure it. We disable the option to
have the browser open when we run Jupyter Notebook, and we then set the
default port to 8888.
You can use the resources of a powerful server while simply accessing it
through your browser.
You can manage your packages in a contained environment on that server,
while not affecting the server itself.
You'll find yourself interacting with Jupyter Notebook's familiar REPL,
which allows you to quickly test ideas and prototype projects.
If you are a single person, you don't need this; however, if you work in a team,
you can put each person into a contained environment using either Docker or
JupyterHub. Online, you'll find setup instructions for setting up a Jupyter
environment with Docker.
See also
You can read up more on conda, Docker, JupyterHub, and other related tools on
their respective documentation sites, as follows:
Let's look at some simple but very handy tricks to make working in notebooks
more comfortable and efficient. These are applicable whether you are relying on
a local or hosted Python environment.
In this recipe, we'll look at a lot of different things that can help you become
more productive when you are working in your notebook and writing Python
code for AI solutions. Some of the built-in or externally available magic
commands or extensions can also come in handy (see
https://iPython.readthedocs.io/en/stable/interactive/magics.html for more
details).
It's important to be aware of some of the Python efficiency hacks when it comes
to machine learning, especially when working with some of the bigger datasets
or more complex algorithms. Sometimes, your jobs can take very long to run,
but often there are ways around it. For example, one, often relatively easy, way
of finishing a job faster is to use parallelism.
Getting ready
If you are using your own installation, whether directly on your system or inside
a Docker environment, make sure that it's running. Then put the address of your
Colab or Jupyter Notebook instance into your browser and press Enter.
With that done, let's get to some efficiency hacks that make working in Jupyter
faster and more convenient.
How to do it...
The sub-recipes here are short and sweet, and all provide ways to be more
productive in Jupyter and Python.
If not indicated otherwise, all of the code needs to be run in a notebook, or, more
precisely, in a notebook cell.
There are lots of different ways to obtain the code in Jupyter cells
programmatically. Apart from these inputs, you can also look at the generated
outputs. We'll get to both, and we can use global variables for this purpose.
Execution history
In order to get the execution history of your cells, the _ih list holds the code of
executed cells. In order to get the complete execution history and write it to a
file, you can do the following:
On Windows, to print the content of a file, you can use the type command.
Instead of _ih, we can use a shorthand for the content of the last three cells. _i
gives you the code of the cell that just executed, _ii is used for the code of the
cell executed before that, and _iii for the one before that.
Outputs
In order to get recent outputs, you can use _ (single underscore), __ (double
underscore), and ___ (triple underscore), respectively, for the most recent,
second, and third most recent outputs.
Auto-reloading packages
autoreload is a built-in extension that reloads the module when you make
changes to a module on disk. It will automagically reload the module once
you've saved it.
This can save a lot of time when you are developing (and testing) a library or
module.
Debugging
If you cannot spot an error and the traceback of the error is not enough to find
the problem, debugging can speed up the error-searching process a lot. Let's
have a quick look at the debug magic:
3. Execute the cell by pressing Ctrl + Enter or Alt + Enter. You will get a debug
prompt:
We've used the argument command to print out the arguments of the executed
function, and then we quit the debugger with the quit command. You can find
more commands on The Python Debugger (pdb) documentation page at
https://docs.Python.org/3/library/pdb.html.
Once your code does what it's supposed to, you often get into squeezing every
bit of performance out of your models or algorithms. For this, you'll check
execution times and create benchmarks using them. Let's see how to time
executions.
There is a built-in magic command for timing cell execution – timeit. The
timeit functionality is part of the Python standard library
(https://docs.Python.org/3/library/timeit.html). It runs a command 10,000 times
(by default) in a period of 5 times inside a loop (by default) and shows an
average execution time as a result:
We see the following output:
Please note that this syntax works for Colab, but not in standard Jupyter
Notebook. What always works to install libraries is using the pip or
conda magic commands, %pip and %conda, respectively. Also, you can
execute any shell command from the notebook if you start your line with
an exclamation mark, like this:
3. Test how long a simple list comprehension takes with the following
command:
Hopefully, you can see how this can come in handy for comparing different
implementations. Especially in situations where you have a lot of data, or
complex processing, this can be very useful.
Even if your code is optimized, it's good to know if it's going to finish in
minutes, hours, or days. tqdm provides progress bars with time estimates. If you
aren't sure how long your job will run, it's just one letter away – in many cases,
it's just a matter of changing range for trange:
The tqdm pandas integration (optional) means that you can see progress bars for
pandas apply operations. Just swap apply for progress_apply.
For Python loops just wrap your loop with a tqdm function and voila, there'll be a
progress bar and time estimates for your loop completion!
Tqdm provides different ways to do this, and they all require minimal code
changes - sometimes as little as one letter, as you can see in the previous
example. The more general syntax is wrapping your loop iterator with tqdm like
this:
So, next time you are just about to set off long-running loop, and you are not just
sure how long it will take, just remember this sub-recipe, and use tqdm.
Let's first look at Cython. Cython is an optimizing static compiler for Python,
and the programming language compiled by the Cython compiler. The main idea
is to write code in a language very similar to Python, and generate C code. This
C code can then be compiled as a binary Python extension. SciPy (and NumPy),
scikit-learn, and many other libraries have significant parts written in Cython for
speed up. You can find out more about Cython on its website at
https://cython.org/:
1. You can use the Cython extension for building cython functions in your
notebook:
3. We can call this function just like any Python function – with the added
benefit that it's already compiled:
This is perhaps not the most useful example of compiling code. For such
a small function, the overhead of compilation is too big. You would
probably want to compile something that's a bit more complex.
Numba is a JIT compiler for Python (https://numba.pydata.org/). You can
often get a speed-up similar to C or Cython using numba and writing
idiomatic Python code like the following:
So there are different ways to get speed benefits from using JIT or ahead-of-time
compilation. We'll see some other ways of speeding up your code in the
following sections.
One of the most important libraries throughout this book will be pandas, a
library for tabular data that's useful for Extract, Transform, Load (ETL) jobs.
Pandas is a wonderful library, however; once you get to more demanding tasks,
you'll hit some of its limitations. Pandas is the go-to library for loading and
transforming data. One problem with data processing is that it can be slow, even
if you vectorize the function or if you use df.apply().
You can move further by parallelizing apply. Some libraries, such as swifter,
can help you by choosing backends for computations for you, or you can make
the choice yourself:
You can use Dask DataFrames instead of pandas if you want to run on
multiple cores of the same or several machines over a network.
You can use CuPy or cuDF if you want to run computations on the GPU
instead of the CPU. These have stable integrations with Dask, so you can run
both on multiple cores and multiple GPUs, and you can still rely on a pandas-
like syntax (see https://docs.dask.org/en/latest/gpu.html).
As we've mentioned, swifter can choose a backend for you with no change of
syntax. Here is a quick setup for using pandas with swifter:
You can further improve the speed of execution by using the underlying NumPy
arrays directly and accessing NumPy functions, for example, using
df.values.apply(). NumPy vectorization can be a breeze, really. See the
following example of applying a NumPy vectorization on a pandas DataFrame
column:
These are just two ways, but if you look at the next sub-recipe, you should be
able to write a parallel map function as yet another alternative.
One way to get something done more quickly is to do multiple things at once.
There are different ways to implement your routines or algorithms with
parallelism. Python has a lot of libraries that support this functionality. Let's see
a few examples with multiprocessing, Ray, joblib, and how to make use of
scikit-learn's parallelism.
This would give you [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0]. We took this example from the joblib examples about parallel for loops,
available at https://joblib.readthedocs.io/en/latest/parallel.html.
When using scikit-learn, watch out for functions that have an n_jobs parameter.
This parameter is directly handed over to joblib.Parallel
(https://github.com/joblib/joblib/blob/master/joblib/parallel.py). none (the
default setting) means sequential execution, in other words, no parallelism. So if
you want to execute code in parallel, make sure to set this n_jobs parameter, for
example, to -1 in order to make full use of all your CPUs.
PyTorch and Keras both support multi-GPU and multi-CPU execution. Multi-
core parallelization is done by default. Multi-machine execution in Keras is
getting easier from release to release with TensorFlow as the default backend.
See also
While notebooks are convenient, they are often messy, not conducive to good
coding habits, and they cannot be versioned cleanly. Fastai has developed an
extension for literate code development in notebooks called nbdev
(https://github.com/fastai/nbdev), which provides tools for exporting and
documenting code.
There are a lot more useful extensions that you can find in different places:
Some other libraries used or mentioned in this recipe include the following:
Swifter: https://github.com/jmcarpenter2/swifter
Autoreload: https://iPython.org/iPython-
doc/3/config/extensions/autoreload.html
pdb: https://docs.Python.org/3/library/pdb.html
tqdm: https://github.com/tqdm/tqdm
JAX: https://jax.readthedocs.io/
Seaborn: https://seaborn.pydata.org/
Numba: https://numba.pydata.org/numba-doc/latest/index.html
Dask: https://ml.dask.org/
CuPy: https://cupy.chainer.org
cuDF: https://github.com/rapidsai/cudf
Ray: http://ray.readthedocs.io/en/latest/rllib.html
joblib: https://joblib.readthedocs.io/en/latest/
Throughout these recipes and several subsequent ones, we'll focus on covering
first the basics of the three most important libraries for AI in Python: scikit-
learn, Keras, and PyTorch. Through this, we will introduce basic and
intermediate techniques in supervised machine learning with deep neural
networks and other algorithms. This recipe will cover the basics of these three
main libraries in machine learning and deep learning.
These recipes are for introducing the basics of the three libraries. However, even
if you've already worked with all of them, you might still find something of
interest.
Getting ready
The Iris Flower dataset is one of the oldest machine learning datasets still in use.
It was published by Ronald Fisher in 1936 to illustrate linear discriminant
analysis. The problem is to classify one of three iris flower species based on
measurements of sepal and petal width and length.
This is a standard process template that we will have to apply to most of the
problems shown throughout this book. Typically, with industrial-scale problems,
Steps 1 and 2 can take much longer (sometimes estimated to take about 95
percent of the time) than for one of the already preprocessed datasets that you
will get for a Kaggle competition or at the UCI machine learning repository. We
will go into the complexities of each of these steps in later recipes and chapters.
We'll assume you've installed the three libraries earlier on and that you have
your Jupyter Notebook or Colab instance running. Additionally, we will use the
seaborn and scikit-plot libraries for visualization, so we'll install them as well:
The convenience of using a dataset so well known is that we can easily load it
from many packages, for example, like this:
How to do it...
In this recipe, we'll go through the basic steps of data exploration. This is often
important to understand the complexity of the problem and any underlying
issues with the data:
1. Plot a pair-plot:
Here it comes (rendered in seaborn's pleasant spacing and coloring):
From this plot, especially if you look along the diagonal, we can see that
the virginica and versicolor species are not (linearly) separable. This is
something we are going to struggle with, and that we'll have to
overcome.
We only see setosa, since the flower species are ordered and listed one
after another:
The last line converted the three strings corresponding to the three classes into
numbers – this is called an ordinal coding. A multiclass machine learning
algorithm can deal with this. For neural networks, we'll use another encoding, as
you'll see later.
After these basic steps, we are ready to start developing predictive models.
These are models that predict the flower class from the features. We'll see this in
turn for each of the three most important machine learning libraries in Python.
Let's start with scikit-learn.
Modeling in scikit-learn
In this recipe, we'll create a classifier in scikit-learn, and check its performance.
Please note that not all scikit-learn classifiers can do multiclass problems. All
classifiers can do binary classification, but not all can do more than two classes.
The random forest model can, fortunately. The random forest model
(sometimes referred to as random decision forest) is an algorithm that can be
applied to classification and regression tasks, and is an ensemble of decision
trees. The main idea is that we can increase precision by creating decision trees
on bootstrapped samples of the dataset, and average over these trees.
Some of the following lines of code should appear to you as boilerplate, and
we'll use them over and over:
2. Define a model.
Hyperparameters are parameters that are not part of the learning process, but control
the learning. In the case of neural networks, this includes the learning rate, model
Here, we pass the training dataset to our model. During training, the
parameters of the model are being fit so that we obtain better results
(where better is defined by a function, called the cost function or loss
function).
For training we use the fit method, which is available for all sklearn-
compatible models:
Since this is a normalized martix, the numbers on the diagonal are also called the
hit rate or true positive rate. We can see that setosa was predicted as setosa
100% (1) of the time. By contrast, versicolor was predicted as versicolor 95% of
the time (0.95), while 5% of the time (0.053) it was predicted as virginica.
The performance is very good in terms of hit rate, however; as expected, we are
having a small problem distinguishing between versicolor and virginica.
Modeling in Keras
Keras is a high-level interface for (deep) neural network models that can use
TensorFlow as a backend, but also Microsoft Cognitive Toolkit (CNTK),
Theano, or PlaidML. Keras is an interface for developing AI models, rather than
a standalone framework itself. Keras has been integrated as part of TensorFlow,
so we import Keras from TensorFlow. Both TensorFlow and Keras are open
source and developed by Google.
Since Keras is tightly integrated with TensorFlow, Keras models can be saved as
TensorFlow models and then deployed in Google's deployment system,
TensorFlow Serving (see https://www.tensorflow.org/tfx/guide/serving), or used
from any of the programming languages such as, C++ or Java. Let's get into it:
1. Run the following code. If you are familiar with Keras, you'll recognize it as
boilerplate:
This yields the following model construction:
We can visualize this model in different ways. We can use the built-in
Keras functionality as follows:
We use two dense layers, the intermediate layer with SELU activation
function, and the final layer with the softmax activation function. We'll
explain both of these in the How it works... section. As for the SELU
activation function, suffice it to say for now that it provides a necessary
nonlinearity so that the neural network can deal with more variables that
are not linearly separable, as in our case. In practice, it is rare to use a
linear (identity function) activation in the hidden layers.
Each unit (or neuron) in the final layer corresponds to one of the three
classes. The softmax function normalizes the output layer so that its
neural activations add up to 1. We train with categorical cross-entropy as
our loss function. Cross-entropy is typically used for classification
problems with neural networks. The binary cross-entropy loss is for two
classes, and categorical cross-entropy is for two or more classes (cross-
entropy will be explained in more detail in the How it works... section).
Our y_categorical therefore has the shape (150, 3). This means that to
indicate class 0 as the label, instead of having a 0 (this would be
sometimes called label encoding or integer encoding), we have a vector
of [1.0, 0.0, 0.0]. This is called one-hot encoding. The sum of each
row is equal to 1.
For neural networks, our features should be normalized in a way that the
activation functions can deal with the whole range of inputs – often this
normalization is to the standard distribution, which has a mean of 0.0 and
standard deviation of 1.0:
This runs our training. An epoch is an entire pass of the dataset through
the neural network. We use 150 here, which is a bit arbitrary. We could
have used a stopping criterion to stop training automatically when
validation and training errors start to diverge, or in other words, when
overfitting occurs.
In order to use plot_confusion_matrix() as before, for comparison,
we'd have to wrap the model in a class that implements the predict()
method, and has a list of classes_ and an attribute of _estimator_type
that is equal to the classifier. We will show that in the online material.
7. Check the charts from TensorBoard: the training progress and the model
graph. Here they are:
These plots show the accuracy and loss, respectively, over the entire
training. We also get another visualization of the network in
TensorBoard:
This shows all the network layers, the loss and metrics, the optimizer
(RMSprop), and the training routine, and how they are related. As for the
network architecture, we can see four dense layers (the presented input and
targets are not considered proper parts of the network, and are therefore colored
in white). The network consists of a dense hidden layer (being fed by the input),
and an dense output layer (being fed by the hidden layer). The loss function is
calculated between the output layer activation and the targets. The optimizer
works with all layers based on the loss. You can find a tutorial on TensorBoard
at https://www.tensorflow.org/tensorboard/get_started. The TensorBoard
documentation explains more about configuration and options.
So the classification accuracy is improving and the loss is decreasing over the
course of the training epochs. The final graph shows the network and training
architecture, including the two dense layers, the loss and metrics, and the
optimizer.
Modeling in PyTorch
In this recipe, we will describe a network equivalent to the previous one shown
in Keras, train it, and plot the performance.
1. Let's define the model architecture first. This looks very similar to Keras:
If you prefer an output similar to the summary() function in Keras, you can use the
Your plot might differ. Neural network learning is not deterministic, so you
could get better or worse numbers, or just different ones.
We can get better performance if we let this run longer. This is left as an
exercise for you.
How it works...
We'll first look at the intuitions behind neural network training, then we'll look a
bit more at some of the technical details that we will use in the PyTorch and
Keras recipes.
In the simplest terms, in a feed-forward neural network of one layer with linear
activations, the model predictions are given by the sum of the product of the
coefficients with the input in all of its dimensions:
We can also use the same very simple linear algebra to define the binary
classifier by thresholding as follows:
This is still very simple linear algebra. This linear model with just one layer,
called a perceptron, has difficulty predicting any more complex relationships.
This lead to deep concern about the limitations of neural networks following an
influential paper by Minsky and Papert in 1969. However, since the 1990s,
neural networks have been experiencing a resurgence in the shape of support
vector machines (SVMs) and the multilayer perceptron (MLP). The MLP is a
feed-forward neural network with at least one layer between the input and output
(hidden layer). Since a multilayer perceptron with many layers of linear
activations can be reduced to just one layer, non-trivially, we'll be referring to
neural networks with hidden layers and nonlinear activation functions. These
types of models can approximate arbitrary functions and perform nonlinear
classification (according to the Universal Approximation Theorem). The
activation function on any layer can be any differentiable nonlinearity;
If you look at this code, you'll see that we could have equally written this up
with operations in NumPy, TensorFlow, or PyTorch. You'll also note that the
construct_network() function takes a layer_sizes argument. This is one of
the hyperparameters of the network, something to decide on before learning. We
can choose just an output of [1] to get the perceptron, or [10, 1] to get a two-
layer perceptron. So this shows how to get a network as a set of parameters and
how to get a prediction from this network. We still haven't discussed how we
learn the parameters, and this brings us to errors.
There's an adage that says, "all models are wrong, but some are useful." We can
measure the error of our model, and this can help us to calculate the magnitude
and direction of changes that we can make to our parameters in order to reduce
the error.
Given a (differentiable) loss function (also called the cost function), , such as
the mean squared error (MSE), we can calculate our error. In the case of the
MSE, the loss function is as follows:
Then in order to get the change to our weights, we'll use the derivative of the
loss over the points in training:
This means we are applying a gradient descent, which means that over time, our
error will be reduced proportionally to the gradient (scaled by learning rate ).
Let's continue with our code:
Both PyTorch and JAX have autograd functionality, which means that we can
automatically get derivatives (gradients) of a wide range of functions.
We'll encounter a lot of different activation and loss functions throughout this
book. In this chapter, we used the SELU activation function.
The scaled exponential linear unit (SELU) activation function was published
quite recently by Klambauer et al in 2017 (http://papers.nips.cc/paper/6698-self-
normalizing-neural-networks.pdf):
The SELU function is linear for positive values of x, a scaled exponential for
negative values, and 0 when x is 0. is a value greater than 1. You can find the
details in the original paper. The SELU function has been shown to have better
convergence properties than other functions. You can find a comparison of
activation functions in Padamonti (2018) at https://arxiv.org/pdf/1804.02763.pdf.
Softmax activation
As our activation function for the output layer in the neural networks, we use a
softmax function. This works as a normalization to sum 1.0 of the neural
activations of the output layer. The output can be therefore interpreted as the
class probabilities. The softmax activation function is defined as follows:
Cross-entropy
In the multiclass training with neural networks, it's common to train for cross-
entropy. The binary cross-entropy for multiclass cases looks like the following:
See also
You can find out more details on the website of each of the libraries used in this
recipe:
Seaborn: https://seaborn.pydata.org/
Scikit-plot: https://scikit-plot.readthedocs.io/
Scikit-learn: https://github.com/scikit-learn/scikit-learn
Keras: https://github.com/keras-team/keras
TensorFlow: http://tensorflow.org/
TensorBoard: https://www.tensorflow.org/tensorboard
PyTorch: https://pytorch.org/
It should probably be noted that scikit-plot is not maintained anymore. For the
plotting of machine learning metrics and charts, mlxtend is a good option, at
http://rasbt.github.io/mlxtend/.
Some other libraries we used here and that we will encounter throughout this
book include the following:
Matplotlib: https://matplotlib.org/
NumPy: https://docs.scipy.org/doc/numpy
SciPy: https://docs.scipy.org/doc/scipy/reference
pandas: https://pandas.pydata.org/pandas-docs/stable
In the following recipe, we'll get to grips with a more realistic example in Keras.
Since we have a few categorical variables, we'll also deal with the encoding of
categorical variables.
Since this is still an introductory recipe, we'll go through this problem with a lot
of detail for illustration. We'll have the following parts:
Model training:
1. Creating the model
2. Writing a data generator
3. Training the model
4. Plotting the performance
5. Extracting performance metrics
6. Calculating feature importances
Getting ready
We'll need a few libraries for this recipe in addition to the libraries we installed
earlier:
that some of the libraries might become incompatible, creating a broken environment.
We'd recommend using conda when a version of conda is available, although it is
This dataset is already split into training and test. Let's download the dataset
from UCI as follows:
wget doesn't ship with macOS by default; we suggest installing wget using brew
(https://formulae.brew.sh/formula/wget). On Windows, you can visit the two
preceding URLs and download both via the File menu. Make sure you remember
the directory where you save the files, so you can find them later. There are a
few alternatives, however:
You can use the download script we provide in Chapter 2, Advanced Topics
in Supervised Machine Learning, in the Predicting house prices in PyTorch
recipe.
You can install the wget library and run import wget; wget.download(URL,
filepath).
We have the following information from the UCI dataset description page:
- age: continuous.
Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-th,
absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-
Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, and so on.
fnlwgt actually stands for the final weight; in other words, the total number of
people constituting the entry.
Please keep in mind that this dataset is a well-known dataset that has been used
many times in scientific publications and in machine learning tutorials. We are
using it here to go over some basics in Keras without having to focus on the
dataset.
How to do it...
As we've mentioned before, we'll first load the dataset, do some EDA, then
create a model in Keras, train it, and look at the performance.
We've split this recipe up into data loading and preprocessing, and secondly,
model training.
1. Loading the dataset: In order to load the dataset, we'll use pandas again. We
use pandas' read_csv() command as before:
Now let's look at the data!
2. Inspecting the data: The beginning of the DataFrame we can see with the
head() method:
3. Categorical encoding: Let's start with category encoding. For EDA, it's good
to use ordinal encoding. This means that for a categorical feature, we map
each value to a distinct number:
We are separating X, the features, and y, the targets, here. The features
don't contain the labels; that's the purpose of the drop() method – we
could have equally used del train['50k'].
When starting with a new task, it's best to do EDA. Let's plot some of
these variables.
Next, we'll look at a pair-plot again. We'll plot all numerical variables
against each other:
As discussed in the previous recipe, the diagonal in the pair-plot shows
us histograms of single variables – that is, the distribution of the variable
– with the hue defined by the classes. Here we have orange versus blue
(see the legend on the right of the following plot). The following subplots
on the diagonal show scatter plots between the two variables:
If we look at the age variable on the diagonal (second row), we see that
the two classes have a different distribution, although they are still
overlapping. Therefore, age seems to be discriminative with respect to
our target class.
Since the MIC can take a while to compute, we'll take the parallelization
pattern we introduced earlier. Please note the creation of the thread pool
and the map operation:
This can still take a while, but should be much faster than doing the
computations in sequence.
We can see in the correlation matrix heatmap that most pair correlations
are pretty low (most correlations are below 0.4), meaning that most
features are relatively uncorrelated; however, there is one pair of
variables that stands out, those of education-num and education:
The output is 0.9995095286140694.
The UCI description page mentions missing variables. Let's look for
missing variables now:
We only see False for each variable, so we cannot see any missing
values here.
In the following code block, we just made a choice and stuck with it:
Model training
We'll create the model, train it, plot performance, and then calculate the feature
importance.
1. To create the model, we use the Sequential model type again. Here's our
network architecture:
Here's the Keras model summary:
2. Now, let's write a data generator. To make this a bit more interesting, we will
use a generator this time to feed in our data in batches. This means that we
stream in our data instead of putting all of our training data into the fit()
function at once. This can be useful for very big datasets.
3. Now that we have our data generator, we can train our model as follows:
We have the output from the training, such as loss and metrics, in our
history variable.
4. This time we will plot the training progress over epochs from the Keras
training history instead of using TensorBoard. We didn't do validation, so we
will only plot the training loss and training accuracy:
5. Since we've already one-hot encoded and scaled our test data, we can directly
predict and calculate our performance. We will calculate the AUC (area-
under-the-curve) score using sklearn's built-in functions. The AUC score
comes from the receiver operating characteristics, which is a visualization of
the false positive rate (also called the false alarm rate) on the x axis, against
the true positive rate (also called the hit rate) on the y axis. The integral under
this curve, the AUC score, is a popular measure of classification performance
and is useful for understanding the trade-off between a high hit rate and any
false alarms:
We get 0.7579310072282265 as the AUC score. An AUC score of 76%
can be a good or bad score depending on the difficulty of the task. It's not
bad for this dataset, but we could probably improve the performance by
tweaking the model more. However, for now, we'll leave it as it is here.
6. Finally, we are going to check the feature importances. For this, we are going
to use the eli5 library for black-box permutation importance. Black-box
permutation importance encompasses a range of techniques that are model-
agnostic, and, roughly speaking, permute features in order to establish their
importance. You can read more about permuation importance in the How it
works... section.
Your final list might differ from the list here. The neural network training is not
deterministic, although we could have tried to fix the random generator seed.
Here, as we've expected, age is a significant factor; however, some categories in
relationship status and marital status come up before age.
How it works...
We went through a typical process in machine learning: we loaded a dataset,
plotted and explored it, and did preprocessing with the encoding of categorical
variables and normalization. We then created and trained a neural network
model in Keras, and plotted the training and validation performance. Let's talk
about what we did in more detail.
There are many ways to calculate and plot correlation matrices, and we'll see
some more possibilities in the recipes to come. Here we've calculated
correlations based on the maximal information coefficient (MIC). The MIC
comes from the framework of maximal information-based nonparametric
exploration. This was published in Science Magazine in 2011, where it was
hailed as the correlation metric of the 21st century (the article can be found at
https://science.sciencemag.org/content/334/6062/1518.full).
Data generators
If you are familiar with Python generators, you won't need an explanation for
what this is, but maybe a few clarifying words are in order. Using a generator
gives the possibility of loading data on-demand or on-line, rather than at once.
This means that you can work with datasets much larger than your available
memory.
There are different ways to implement generators with Keras, such as the
following:
For the first option, we can use any generator really, but this uses a function with
yield. This means we're providing the steps_per_epoch parameter for the Keras
fit_generator() function.
len(), in order for the fit_generator() function to know how much more
data is to come. This corresponds to steps_per_epoch and is
.
__getitem__(), for the fit_generator to ask for the next batch.
on_epoch_end() to do some shuffling or other things at the end of an epoch –
this is optional.
We'll see later that batch data loading using generators is often a part of online
learning, that is, the type of learning where we incrementally train a model on
more and more data as it comes in.
Permutation importance
The eli5 library can calculate permutation importance, which measures the
increase in the prediction error when features are not present. It's also called the
mean decrease accuracy (MDA). Instead of re-training the model in a leave-
one-feature-out fashion, the feature can be replaced by random noise. This noise
is drawn from the same distribution as the feature so as to avoid distortions.
Practically, the easiest way to do this is to randomly shuffle the feature values
between rows. You can find more details about permutation importance in
Breiman's Random Forests (2001), at
https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf.
See also
We'll cover a lot more about Keras, the underlying TensorFlow library, online
learning, and generators in the recipes to come. I'd recommend you get familiar
with layer types, data loaders and preprocessors, losses, metrics, and training
options. All this is transferable to other frameworks such as PyTorch, where the
application programming interface (API) differs; however, the essential
principles are the same.
For more datasets, the following three websites are your friends:
We'll be predicting partner choices with sklearn, where we'll implement a lot of
custom transformer steps and more complicated machine learning pipelines.
We'll then predict house prices in PyTorch and visualize feature and neuron
importance. After that, we will perform active learning to decide customer
values together with online learning in sklearn. In the well-known case of repeat
offender prediction, we'll build a model without racial bias. Last, but not least,
we'll forecast time series of CO2 levels.
Online learning in this context (as opposed to internet-based learning) refers to a model update strategy
that incorporates training data that comes in sequentially. This can be useful in cases where the dataset is
very big (often the case with images, videos, and texts) or where it's important to keep the model up to date
given the changing nature of the data.
In many of these recipes, we've shortened the description to the most salient
details in order to highlight particular concepts. For the full details, please refer
to the notebooks on GitHub.
Technical requirements
The code and notebooks for this chapter are available on GitHub at
https://github.com/PackPublishing/Artificial-Intelligence-with-Python-
Cookbook/tree/master/chapter02.
Perhaps this recipe will be informative in more ways than one, and we'll learn
something useful about the mechanics of human mating choices.
This data was gathered from participants in experimental speed dating events from 2002-2004. During the
events, the attendees would have a four-minute first date with every other participant of the opposite sex.
At the end of their 4 minutes, participants were asked whether they would like to see their date again. They
were also asked to rate their date on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and
shared interests. The dataset also includes questionnaire data gathered from participants at different points
in the process. These fields include demographics, dating habits, self-perception across key attributes,
beliefs in terms of what others find valuable in a mate, and lifestyle information.
The problem is to predict mate choices from what we know about participants
and their matches. This dataset presents some challenges that can serve an
illustrative purpose:
On the way to solving this problem of predicting mate choices, we will build
custom encoders in scikit-learn and a pipeline comprising all features and their
preprocessing steps.
Getting ready
We'll need the following libraries for this recipe. They are as follows:
OpenML to download the dataset
openml_speed_dating_pipeline_steps to use our custom transformer
imbalanced-learn to work with imbalanced classes
shap to show us the importance of features
OpenML is an organization that intends to make data science and machine learning reproducible and
therefore more conducive to research. The OpenML website not only hosts datasets, but also allows the
uploading of machine learning results to public leaderboards under the condition that the implementation
relies solely on open source. These results and how they were obtained can be inspected in complete detail
In order to retrieve the data, we will use the OpenML Python API. The
get_dataset() method will download the dataset; with get_data(), we can get
pandas DataFrames for features and target, and we'll conveniently get the
information on categorical and numerical feature types:
In the original version of the dataset, as presented in the paper, there was a lot more work to do. However,
the version of the dataset on OpenML already has missing values represented as numpy.nan, which lets
us skip this conversion. You can see this preprocessor on GitHub if you are interested:
https://github.com/benman1/OpenML-Speed-Dating
Alternatively, you can use a download link from the OpenML dataset web page
at https://www.openml.org/data/get_csv/13153954/speeddating.arff.
With the dataset loaded, and the libraries installed, we are ready to start
cracking.
How to do it...
A few things stand out pretty quickly looking at this dataset. We have a lot of
categorical features. So, for modeling, we will need to encode them numerically,
as in the Modeling and predicting in Keras recipe in Chapter 1, Getting Started
with Artificial Intelligence in Python.
Some of these are actually encoded ranges. This means these are ordinal, in
other words, they are categories that are ordered; for example, the
d_interests_correlate feature contains strings like these:
Please pay attention to how the fit() and transform() methods are used. We
don't need to do anything in the fit() method, because we always apply the
same static rule. The transfer() method applies this rule. We've seen the
examples previously. What we do in the transfer() method is to iterate over
columns. This transformer also shows the use of the parallelization pattern
typical to scikit-learn. Additionally, since these ranges repeat a lot, and there
aren't so many, we'll use a cache so that, instead of doing costly string
transformations, the range value can be retrieved from memory once the range
has been processed once.
Personal preferences
Self-assessment
Assessment of the other person
It seems clear that differences between any of these features could be significant,
such as the importance of sincerity versus how sincere someone assesses a
potential partner. Therefore, our next transformer is going to calculate the
differences between numerical features. This is supposed to help highlight these
differences.
These features are derived from other features, and combine information from
two (or potentially more features). Let's see what the
NumericDifferenceTransformer feature looks like:
This is a custom transformer that calculates differences between numerical
features. Please refer to the full implementation in the repository of the
OpenML-Speed-Dating library at https://github.com/benman1/OpenML-Speed-
Dating.
This gives us a prefix or functional syntax for standard operators. Since we can
pass functions as arguments, this gives us the flexibility to specify different
operators between columns.
The fit() method this time just collects the names of numerical columns, and
we'll use these names in the transform() method.
Combining transformations
Now we have columns that are ranges, columns that are categorical, and
columns that are numerical, and we can assign pipeline steps to them.
This is a very good performance, as you can see comparing it to the leaderboard
on OpenML.
How it works...
There are a few things to point out regarding our approach. As we said before,
we have missing values, so we have to impute (meaning replace) missing values
with other values. In this case, we replace missing values with -1. In the case of
categorical variables, this will be a new category, while in the case of numerical
variables, it will become a special value that the classifier will have to handle.
There's more...
You can see the complete example with the speed dating dataset, a few more
custom transformers, and an extended imputation class in the GitHub repository
of the openml_speed_dating_pipeline_steps library and notebook, on GitHub
at https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/blob/master/chapter02/Transforming%20Data%20in%20Scikit-
Learn.ipynb.
See also
In this recipe, we used ANOVA f-values for univariate feature selection, which
is relatively simple, yet effective. Univariate feature selection methods are
usually simple filters or statistical tests that measure the relevance of a feature
with regard to the target. There are, however, many different methods for feature
selection, and scikit-learn implements a lot of them: https://scikit-
learn.org/stable/modules/feature_selection.html.
As a little extra, we will also demonstrate neuron importance for the models
developed in PyTorch. You can try out different network architectures in
PyTorch or model types. The focus in this recipe is on the methodology, not an
exhaustive search for the best solution.
Getting ready
In order to prepare for the recipe, we need to do a few things. We'll download
the data as in the previous recipe, Transforming data in scikit-learn, and perform
some preprocessing by following these steps:
We'll use one more library, however, captum, which allows the inspection of
PyTorch models for feature and neuron importance:
There is one more thing. We'll assume you have a GPU available. If you don't
have a GPU in your computer, we'd recommend you try this recipe on Colab. In
Colab, you'll have to choose a runtime type with GPU.
After all these preparations, let's see how we can predict house prices.
How to do it...
The Ames Housing dataset is a small- to mid-sized dataset (1,460 rows) with 81
features, both categorical and numerical. There are no missing values.
In the Keras recipe previously, we've seen how to scale the variables. Scaling is
important here because all variables have different scales. Categorical variables
need to be converted to numerical types in order to feed them into our model.
We have the choice of one-hot encoding, where we create dummy variables for
each categorical factor, or ordinal encoding, where we number all factors and
replace the strings with these numbers. We could feed the dummy variables in
like any other float variable, while ordinal encoding would require the use of
embeddings, linear neural network projections that re-order the categories in a
multi-dimensional space.
Now we can split the data into training and test sets, as we did in previous
recipes. Here, we add a stratification of the numerical variable. This makes sure
that different sections (five of them) are included at equal measure in both
training and test sets:
Before going ahead, let's look at the importance of the features using a model-
independent technique.
Before we run anything, however, let's make sure we are running on the GPU:
Let's build our PyTorch model, similar to the Classifying in scikit-learn, Keras,
and PyTorch recipe in Chapter 1, Getting Started with Artificial Intelligence in
Python.
We'll implement a neural network regression with batch inputs using PyTorch.
This will involve the following steps:
3. Next, define the loss criterion and optimizer. We take the mean square error
(MSE) as the loss and stochastic gradient descent as our optimization
algorithm:
Since this seems so much more verbose than what we saw in Keras in the
Classifying in scikit-learn, Keras, and PyTorch recipe in Chapter 1,
Getting Started with Artificial Intelligence in Python, we commented this
code quite heavily. Basically, we have to loop over epochs, and within
each epoch an inference is performance, an error is calculated, and the
optimizer applies the adjustments according to the error.
This is the loop over epochs without the inner loop for training:
The training is performed inside this loop over all the batches of the
training data. This looks as follows:
This is the output we get. TQDM provides us with a helpful progress bar.
At every tenth epoch, we print an update to show training and validation
performance:
Let's plot how our model performs for training and validation datasets
during training:
The following diagram shows the resulting plot:
We stopped our training just in time before our validation loss stopped
decreasing.
We can also rank and bin our target variable and plot the predictions
against it in order to see how the model is performing across the whole
spectrum of house prices. This is to avoid the situation in regression,
especially with MSE as the loss, that you only predict well for a mid-
range of values, close to the mean, but don't do well for anything else.
You can find the code for this in the notebook on GitHub. This is called a
lift chart (here with 10 bins):
We can see that the model, in fact, predicts very closely across the whole
range of house prices. In fact, we get a Spearman rank correlation of
about 93% with very high significance, which confirms that this model
performs with high accuracy.
How it works...
SGD works the same as gradient descent except that it works on a single
example at a time. The interesting part is that the convergence is similar to the
gradient descent and is easier on the computer memory.
RMSProp works by adapting the learning rates of the algorithm according to the
gradient signs. The simplest of the variants checks the last two gradient signs
and then adapts the learning rate by increasing it by a fraction if they are the
same, or decreases it by a fraction if they are different.
ADAM is one of the most popular optimizers. It's an adaptive learning algorithm
that changes the learning rate according to the first and second moments of the
gradients.
Captum is a tool that can help us understand the ins and outs of the neural
network model learned on the datasets. It can assist in learning the following:
Feature importance
Layer importance
Neuron importance
This is very important in learning interpretable neural networks. Here, integrated
gradients have been applied to understand feature importance. Later, neuron
importance is also demonstrated by using the layer conductance method.
There's more...
Given that we have our neural network defined and trained, let's find the
important features and neurons using the captum library:
Now, we have a NumPy array of feature importances.
Layer and neuron importance can also be obtained using this tool. Let's look at
the neuron importances of our first layer. We can pass on house_model.act1,
which is the ReLU activation function on top of the first linear layer:
The diagram shows the neuron importances. Apparently, one neuron is just not
important.
We can also see the most important variables by sorting the NumPy array we've
obtained earlier:
Often, feature importances can help us to both understand the model and prune
our model to become less complex (and hopefully less overfitted).
See also
The PyTorch documentation includes everything you need to know about layer
types, data loading, losses, metrics, and training:
https://pytorch.org/docs/stable/nn.html
In this recipe, we will approach this with active learning, a strategy where we
actively decide what to explore (and learn) next. Our model will help decide
whom to call. Because we will update our model after each query (phone call),
we will use online learning models.
Getting ready
We'll prepare for our recipe by downloading our dataset and installing a few
libraries.
To model the likelihood of customers signing up for our product, we will use the
scikit-multiflow package that specializes in online models. We will also use the
category_encoders package again:
How to do it...
We can see that curious wins out after the first few examples. Exploitation is
actually the least successful scheme. By not updating the model, performance
deteriorates over time:
This is an ideal scenario for active learning or reinforcement learning, because,
not unlike in reinforcement learning, uncertainty can be an additional criterion,
apart from positive expectation, from a customer. Over time, this entropy
reduction-seeking behavior reduces as the model's understanding of customers
improves.
How it works...
It's worth delving a bit more into a few of the concepts and strategies employed
in this recipe.
Active learning
Active learning means that we can actively query for more information; in other
words, exploration is part of our strategy. This can be useful in scenarios where
we have to actively decide what to learn, and where what we learn influences not
only how much our model learns and how well, but also how much return on an
investment we can get.
Hoeffding Tree
The Hoeffding Tree (also known as the Very Fast Decision Tree, VFDT for
short) was introduced in 2001 by Geoff Hulten and others (Mining time-
changing data streams). It is an incrementally growing decision tree for
streamed data. Tree nodes are expanded based on the Hoeffding bound (or
additive Chernoff bound). It was theoretically shown that, given sufficient
training data, a model learned by the Hoeffding tree converges very closely to
the one built by a non-incremental learner.
It's important to note that the Hoeffding Tree doesn't deal with data distributions
that change over time.
Class weighting
Since we are dealing with an imbalanced dataset, let's use class weights. This
basically means that we are upsampling the minority (signing up) class and
downsampling the majority class (not signing up).
See also
You can find more resources and ideas regarding active learning from a recent
review that concentrates on biomedical image processing (Samuel Budd and
others, A Survey on Active Learning and Human-in-the-Loop Deep Learning for
Medical Image Analysis, 2019; https://arxiv.org/abs/1910.02923).
Our approach is inspired by the modalAI Python active learning package, which
you can find at https://modal-python.readthedocs.io/. We recommend you check
it out if you are interested in active learning approaches. A few more Python
packages are available, as follows:
Alipy: Active Learning in Python: http://parnec.nuaa.edu.cn/huangsj/alipy/
Active Learning: A Google repo about active learning:
https://github.com/google/active-learning
One of the main decisions in active learning is the trade-off between exploration
and exploitation. You can find out more about this in a paper called Exploration
versus exploitation in active learning: a Bayesian approach:
http://www.vincentlemaire-labs.fr/publis/ijcnn_2_2010_camera_ready.pdf
Discrimination presents a major problem for AI systems, and illustrates the importance of auditing your
model and the data you feed into your model. Models built on human decisions will amplify human biases if
this bias is ignored. Not just from a legal perspective, but also ethically, we want to build models that don't
disadvantage certain groups. This poses an interesting challenge for model building.
Generally, we would think that justice should be blind to gender or race. This means that court decisions
should not take these sensitive variables like race or gender into account. However, even if we omit them
from our model training, these sensitive variables might be correlated to some of the other variables, and
therefore they can still affect decisions, to the detriment of protected groups such as minorities or women.
In this section, we are going to work with the COMPAS modeling dataset as
provided by ProPublica. We are going to check for racial bias, and then create a
model to remove it. You can find the original analysis by ProPublica at
https://github.com/propublica/compas-analysis.
Getting ready
Before we can start, we'll first download the data, mention issues in
preprocessing, and install the libraries.
ProPublica compiled their dataset from different sources, which they matched up
according to the names of offenders:
1. The column race is a protected category. It should not be used as a feature for
model training, but as a control.
2. There are full names in the dataset, which will not be useful, or might even
give away the ethnicity of the inmates.
3. There are case numbers in the dataset. These will likely not be useful for
training a model, although they might have some target leakage in the sense
that increasing case numbers might give an indication of the time, and there
could be a drift effect in the targets over time.
4. There are missing values. We will need to carry out imputation.
5. There are date stamps. These will probably not be useful and might even
come with associated problems (see point 3). However, we can convert these
features into UNIX epochs, which indicates the number of seconds that have
elapsed since 1970, and then calculate time periods between date stamps, for
example, by repurposing NumericDifferenceTransformer that we saw in an
earlier recipe. We can then use these periods as model features rather than the
date stamps.
6. We have several categorical variables.
7. The charge description (c_charge_desc) might need to be cleaned up.
We will use a few libraries in this recipe, which can be installed as follows:
category-encoders is a library that provides functionality for categorical
encoding beyond what scikit-learn provides.
How to do it...
Let's get some basic terminology out of the way first. We need to come up with
metrics for fairness. But what does fairness (or, if we look at unfairness, bias)
mean?
The first is also called equal odds, while the latter refers to equal false positive
rates. While equal opportunity means that each group should be given the same
chance regardless of their group, the equal outcome strategy implies that the
underperforming group should be given more lenience or chances relative to the
other group(s).
We'll go with the idea of false positive rates, which intuitively appeals, and
which is enshrined in law in many jurisdictions in the case of equal employment
opportunities. We'll provide a few resources about these terms in the See also
section.
Therefore, the logic for the impact calculation is based on values in the
confusion matrix, most importantly, false positives, which we've just mentioned.
These cases are predicted positive even though they are actually negative; in our
case, people predicted as reoffenders, who are not reoffenders. Let's write a
function for this:
We can now use this function in order to summarize the impact on particular
groups with this code:
This first calculates the confusion matrix with true positives and false negatives,
and then encodes the adverse impact ratio (AIR), known in statistics also as the
Relative Risk Ratio (RRR). Given any performance metric, we can write the
following:
This expresses an expectation that the metric for the protected group (African-
Americans) should be the same as the metric for the norm group (Caucasians). In
this case, we'll get 1.0. If the metric of the protected group is more than 20
percentage points different to the norm group (that is, lower than 0.8 or higher
than 1.2), we'll flag it as a significant discrimination.
Norm group: a norm group, also known as a standardization sample or norming group, is a sample of
the dataset that represents the population to which the statistic is intended to be compared. In the context of
bias, its legal definition is the group with the highest success, but in some contexts, the entire dataset or the
most frequent group are taken as the baseline instead. Pragmatically, we take the white group, since they
are the biggest group, and the group for which the model works best.
In the preceding function, we calculate the false positive rates by sensitive
group. We can then check whether the false positive rates for African-Americans
versus Caucasians are disproportionate, or rather whether the false positive rates
for African-Americans are much higher. This would mean that African-
Americans get flagged much more often as repeat offenders than they should be.
We find that this is indeed the case:
The last FPR and FNR columns together can give an idea about the general
quality of the model. If both are high, the model just doesn't perform well for the
particular group. The last two columns express the adverse impact ratio of FPR
and FNR ratios, respectively, which is what we'll mostly focus on. We need to
reduce the racial bias in the model by reducing the FPR of African-Americans to
a tolerable level.
In the end, we create a new variable for stratification in order to make sure that
we have similar proportions in the training and test datasets for both recidivism
(our target variable) and whether someone is African-American. This will help
us to calculate metrics to check for discrimination:
We do some data engineering, deriving variables to record how many days
someone has spent in jail, has waited for a trial, or has waited for an arrest.
We'll build a neural network model using jax similar to the one we've
encountered in the Classifying in scikit-learn, Keras, and PyTorch recipe in
Chapter 1, Getting Started with Artificial Intelligence in Python. This time, we'll
do a fully fledged implementation:
This is a scikit-learn wrapper of a JAX neural network. For scikit-learn
compatibility, we inherit from ClassifierMixin and implement fit() and
predict(). The most important part here is the penalized MSE method, which,
in addition to model predictions and targets, takes into account a sensitive
variable.
Let's train it and check the performance. Please note that we feed in X, y, and
sensitive_train, which we define as the indicator variable for African-
American for the training dataset:
We visualize the statistics as follows:
How it works...
The keys for this to work are custom objective functions or loss functions. This
is far from straightforward in scikit-learn, although we will show an
implementation in the following section.
Generally, there are different possibilities for implementing your own cost or
loss functions.
LightGBM, Catboost, and XGBoost each provide an interface with many loss
functions and the ability to define custom loss functions.
PyTorch and Keras (TensorFlow) provide an interface.
You can implement your model from scratch (this is what we've done in the
main recipe).
Scikit-learn generally does not provide a public API for defining your own loss
functions. For many algorithms, there is only a single choice, and sometimes
there are a couple of alternatives. The rationale in the case of split criteria with
trees is that loss functions have to be performant, and only Cython
implementations will guarantee this. This is only available in a non-public API,
which means it is more difficult to use.
In neural networks, as long as you provide a differentiable loss function, you can
plug in anything you want.
Basically, we were able to encode the adverse impact as a penalty term with the
Mean Squared Error (MSE) function. It is based on the MSE that we've
mentioned before, but has a penalty term for adverse impact. Let's look again at
the loss function:
The first thing to note is that instead of two variables, we pass three variables.
sensitive is the variable relevant to the adverse impact, indicating if we have a
person from a protected group.
1. We calculate the MSE overall, err, from model predictions and targets.
2. We calculate the MSE for the protected group, err_s.
3. We take the ratio of the MSE for the protected group over the MSE overall
(AIR) and limit it to between 1.0 and 2.0. We don't want values lower than 1,
because we are only interested in the AIR if it's negatively affecting the
protected group.
4. We then multiply AIR by the overall MSE.
As for 2, the MSE can simply be calculated by multiplying the predictions and
targets, each by sensitive. That would cancel out all points, where sensitive is
equal to 0.
As for 4, it might seem that this would cancel out the overall error, but we see
that it actually seems to work. We probably could have added the two terms as
well to give both errors a similar importance.
In the following, we'll use the non-public scikit-learn API to implement a custom
split criterion for decision trees. We'll use this to train a random forest model
with the COMPAS dataset:
We can see that, although we came a long way, we didn't completely remove all
bias. 30% (DFP for African-Americans) would still be considered unacceptable.
We could try different refinements or sampling strategies to improve the result.
Unfortunately, we wouldn't be able to use this model in practice.
See also
You can read up more on algorithmic fairness in different places. There's a wide
variety of literature available on fairness:
A Science Magazine article about the COMPAS model (Julia Dressel and
Hany Farid, 2018, The accuracy, fairness, and limits of predicting
recidivism): https://advances.sciencemag.org/content/4/1/eaao5580
A comparative study of fairness-enhancing interventions in machine learning
(Sorelle Friedler and others, 2018): https://arxiv.org/pdf/1802.04422.pdf
A Survey on Bias and Fairness in Machine Learning (Mehrabi and others,
2019): https://arxiv.org/pdf/1908.09635.pdf
The effect of explaining fairness (Jonathan Dodge and others, 2019):
https://arxiv.org/pdf/1901.07694.pdf
Different Python libraries are available for tackling bias (or, inversely,
algorithmic fairness):
fairlearn: https://github.com/fairlearn/fairlearn
AIF360: https://github.com/IBM/AIF360
FairML: https://github.com/adebayoj/fairml
BlackBoxAuditing: https://github.com/algofairness/BlackBoxAuditing
Balanced Committee Election: https://github.com/huanglx12/Balanced-
Committee-Election
While you can find many datasets on recidivism by performing a Google dataset
search (https://toolbox.google.com/datasetsearch), there are many more
applications and corresponding datasets where fairness is important, such as
credit scoring, face recognition, recruitment, or predictive policing, to name just
a few.
There are different places to find out more about custom losses. The article
Custom loss versus custom scoring (https://kiwidamien.github.io/custom-loss-
vs-custom-scoring.html) affords a good overview. For implementations of
custom loss functions in gradient boosting, towardsdatascience
(https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-
f79c1b40466d) is a good place to start.
Getting ready
In order to prepare for this recipe, we'll install libraries and download a dataset.
We will analyze the CO2 concentration data in this recipe. You can see the data
loading in the notebook on GitHub accompanying this recipe, or in the scikit-
learn Gaussian process regression (GPR) example regarding Mauna Loa CO2
data: https://scikit-
learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html#sphx-glr-
auto-examples-gaussian-process-plot-gpr-co2-py
The dataset contains the average CO2 concentration measured at Mauna Loa
Observatory in Hawaii from 1958 to 2001. We will model the CO2 concentration
with respect to that.
How to do it...
Now we'll get to forecasting our time series of CO2 data. We'll first explore the
dataset, and then we'll apply the ARIMA and SARIMA techniques.
The script here shows the time series seasonal decomposition of the CO2
data, showing a clear seasonal variation in the CO2 concentration, which
can be traced back to the biology:
Here, we see the decomposition: the observed time series, its trend,
seasonal components, and what remains unexplained, the residual
element:
Now, let's analyze the time series.
We'll define our two models and apply them to each point in the test dataset.
Here, we iteratively fit the model on all the points and predict the next point, as a
one-step-ahead.
This leaves us with 468 samples for training and 53 for testing.
How it works...
Time series data is a collection of observations x(t), where each data point is
recorded at time t. In most cases, time is a discrete variable, that is,
In order to explain the models that we've used, ARIMA and SARIMA, we'll
have to go step by step, and explain each in turn:
Autoregression (AR)
Moving Average (MA)
Autoregressive Moving Average (ARMA)
Autoregressive Integrated Moving Average (ARIMA) and
Seasonal Autoregressive Integrated Moving Average (SARIMA)
ARMA is a linear model, defined in two parts. First, the autoregressive linear
model:
The fitting procedure is a bit involved, particularly because of the MA part. You
can read up on the Box-Jenkins method on Wikipedia if you are interested:
https://en.wikipedia.org/wiki/Box%E2%80%93Jenkins_method
There are a few limitations to note, however. The time series has to be the
following:
There are different extensions of ARMA to address the first two limitations, and
that's where ARIMA and SARIMA come in.
The integration refers to differencing. In order to stabilize the mean, we can take
the difference between consecutive observations. This can also remove a trend or
eliminate seasonality. It can be written as follows:
This can be repeated several times, and this is what the parameter d describes
that ARIMA comes with. Please note that ARIMA can handle drifts and non-
stationary time series. However, it is still unable to handle seasonality.
There's more...
See also
Statsmodels: http://statsmodels.sourceforge.net/stable/
Prophet: https://facebook.github.io/prophet/
There are many more interesting libraries relating to time series, including the
following:
Time series modeling using state space models in statsmodels:
https://www.statsmodels.org/dev/statespace.html
GluonTS: Probabilistic Time Series Models in MXNet (Python):
https://gluon-ts.mxnet.io/
SkTime: A unified interface for time series modeling:
https://github.com/alan-turing-institute/sktime
Patterns, Outliers, and
Recommendations
In order to gain knowledge from data, it's important to understand the structure
behind a dataset. The way we represent a dataset can make it more intuitive to
work with in a certain way, and consequently, easier to draw insights from it.
The law of the instrument states that when holding a hammer, everything seems
like a nail (based on Andrew Maslow's The Psychology of Science, 1966) and is
about the tendency to adapt jobs to the available tools. There is no silver bullet,
however, as all methods have their drawbacks given the problem at hand. It's
therefore important to know the basic methods in the arsenal of available tools in
order to recognize the situations where we should use a hammer as supposed to a
screwdriver.
Getting ready
For this recipe, we'll be using a dataset of credit risk, usually referred to in full as
the German Credit Risk dataset. Each row describes a person who took a loan,
gives us a few attributes about the person, and tells us whether the person paid
the loan back (that is, whether the credit was a good or bad risk).
We'll need to download and load up the German credit data as follows:
For visualizations, we'll use the dython library. The dython library works
directly on categorical and numeric variables, and makes adjustments for
numeric-categorical or categorical-categorical comparisons. Please see the
documentation for details, at http://shakedzy.xyz/dython/. Let's install the library
as follows:
We can now play with the German credit dataset, visualize it with dython, and
see how the people represented inside can be clustered together in different
groups.
How to do it...
We'll first visualize the dataset, do some preprocessing, and apply a clustering
algorithm. We'll try to make sense out of the clusters, and – with the new
insights – cluster again.
1. Visualizing correlations: In this recipe, we'll use the dython library. We can
calculate the correlations with dython's associations function, which calls
categorical, numerical (Pearson correlation), and mixed categorical-numerical
correlation functions depending on the variable types:
This call not only calculates correlations, but also cleans up the
correlation matrix by clustering variables together that are correlated.
The data is visualized as shown in the following screenshot:
We can't really see clear cluster demarcations; however, there seem to be
a few groups if you look along the diagonal.
Also, a few variables such as telephone and job stand out a bit from the
rest. In the notebook on GitHub, we've tried dimensionality reduction
methods to see if this would help our clustering. However,
dimensionality reduction didn't work that well, while clustering directly
worked better: https://github.com/PacktPublishing/Artificial-Intelligence-
with-Python-
Cookbook/blob/master/chapter03/Clustering%20market%20segments.ip
ynb.
As the first step for clustering, we'll convert some variables into dummy
variables; this means we will do a one-hot-encoding of the categorical
variables.
The inertia is the sum of distances to the closest cluster center over all the
data points. A visual criterion for choosing the best number of clusters
(the hyperparameter k in the k-means clustering algorithm) is called the
elbow criterion.
We see in this little excerpt that differences are largely due to differences
in credit amount. This brings us back to where we started out, namely
that we largely get out of the clustering what we put in. There's no trivial
way of resolving this problem, but we can select the variables we want to
focus on in our clusters.
We can now produce the overview table again in order to view the
cluster stats:
And here comes the new summary:
I would argue this is more useful than the previous clustering, because it clearly
shows us which customers can make us money, and highlights other differences
between them that are relevant to marketing.
How it works...
From this premise, we then tried different methods and evaluated them against
our business goal.
If you've paid attention when looking at the recipe, you might have noticed that
we don't standardize our output (z-scores). In standardization with the z-score, a
raw score x is converted into a standard score by subtracting the mean and
dividing by the standard deviation, so every standardized variable has a mean of
0 and a standard deviation of 1:
In the next section, we'll go more into detail with the k-means algorithm.
There's more...
PCA was proposed in 1901 (by Karl Pearson, in On Lines and Planes of Closest
Fit to Systems of Points in Space) and k-means in 1967 (by James MacQueen, in
Some Methods for Classification and Analysis of Multivariate Observations).
While both methods had their place when data and computing resources were
hard to come by, today many alternatives exist that can work with more complex
relationships between data points and features. On a personal note, as the authors
of this book, we often find it frustrating to see methods that rely on normality or
a very limited kind of relationship between variables, such as classic methods
like PCA or K-means, especially when there are so many better methods.
Both PCA and k-means have serious shortcomings that affect their usefulness in
practice. Since PCA operates over the correlation matrix, it can only find linear
correlations between data points. This means that if variables were related, but
not linearly (as you would see in a scatter plot), then PCA would fail.
Furthermore, PCA is based on mean and variance, which are parameters for
Gaussian distribution. K-means, being a centroid-based clustering algorithm, can
only find spherical groups in Euclidean space – that is, it fails to uncover any
more complicated structures. More information on this can be found at
https://developers.google.com/machine-
learning/clustering/algorithm/advantages-disadvantages.
Other robust, nonlinear methods are available, for example, affinity propagation,
fuzzy c-means, agglomerative clustering, and others. However, it's important to
remember that, although these methods separate data points into groups, the
following statements are also true:
Let's look at the k-means algorithm in more detail. It's actually really simple and
can be written down from scratch in numpy or jax. This implementation is based
on the one in NapkinML (https://github.com/eriklindernoren/NapkinML):
The main logic – as should be expected – is in the fit() method. It comes in
three steps that are iterated as follows:
1. Calculate the distances between each point and the centers of the clusters.
2. Each point gets assigned to the cluster of its closest cluster center.
3. The cluster centers are recalculated as the arithmetic mean.
It's surprising that such a simple idea can result in something that looks
meaningful to human observers. Here's an example of it being used. Let's try it
out with the Iris dataset that we already know from the Classifying in scikit-
learn, Keras, and PyTorch recipe in Chapter 1, Getting Started with Artificial
Intelligence in Python:
See also
Discovering anomalies
An anomaly is anything that deviates from the expected or normal outcomes.
Detecting anomalies can be important in Industrial Process Monitoring (IPM),
where data-driven fault detection and diagnosis can help achieve achieve higher
levels of safety, efficiency, and quality.
In this recipe, we'll look at methods for outlier detection. We'll go through an
example of outlier detection in a time series with Python Outlier Detection
(pyOD), a toolbox for outlier detection that implements many state-of-the-art
methods and visualizations. PyOD's documentation can be found at
https://pyod.readthedocs.io/en/latest/.
Getting ready
This recipe will focus on finding outliers. We'll demonstrate how to do this with
the pyOD library including an autoencoder approach. We'll also outline the
upsides and downsides to the different methods.
The streams of data are time series of key performance indicators (KPIs) of
website performance. This dataset is provided in the DONUT outlier detector
repository, available at https://github.com/haowen-xu/donut.
Please note that it's usually better to use the Keras version that ships with
TensorFlow.
Let's have a look at our dataset, and then apply different outlier detection
methods.
How to do it...
We'll cover different steps and methods in this section. They are as follows:
1. Visualizing
2. Benchmarking
3. Running an isolation forest
4. Running an autoencoder
This time series of KPIs is geared toward monitoring the operation and
maintenance of web services. They come with a label that indicates an
abnormality – in other words, an outlier – if a problem has occurred with
the service:
This is the resulting plot, where the dots represent outliers:
The following plot is the outlier distribution density, where the values of
the time series are on the x axis, and the two lines show what's
recognized as normal and what's recognized as an outlier, respectively –
0 indicates normal data points, and 1 indicates outliers:
Outliers (shown with the dotted line) are hardly distinguishable from
normal data points (the squares), so we won't be expecting perfect
performance.
Before we go on and test methods for outlier detection, let's set down a
process for comparing them, so we'll have a benchmark of the relative
performances of the tested methods.
Now let's write a testing function that we can use with different outlier
detection methods:
This function tests an outlier detection method on the dataset. It trains a
model, gets performance metrics from the model, and plots a
visualization.
Now that this is done, let's test two methods for outlier detection: the
isolation forest and an autoencoder.
We can see from the following graph, however, that there are no 1s
(predicted outliers) in the lower range of the KPI spectrum. The model
misses out on outliers in the lower range:
This doesn't look too bad, actually – values in the mid-range are
classified as normal, while values on the outside of the spectrum are
classified as outliers.
How it works...
Outliers are extreme values that deviate from other observations on the data.
Outlier detection is important in many domains, including network security,
finance, traffic, social media, machine learning, the monitoring of machine
model performance, and surveillance. A host of algorithms have been proposed
for outlier detection in these domains. The most prominent algorithms include k-
Nearest Neighbors (kNN), Local Outlier Factors (LOF), and the isolation
forest, and more recently, autoencoders, Long Short-Term Memory (LSTM),
and Generative Adversarial Networks (GANs). We'll discover some of these
methods in later recipes. In this recipe, we've used kNN, an autoencoder, and the
isolation forest algorithm. Let's talk about these three methods briefly.
k-nearest neighbors
Isolation forest
The idea of the isolation forest is relatively simple: create random decision trees
(this means each leaf uses a randomly chosen feature and a randomly chosen
split value) until only one point is left. The length of the path across the trees to
get to a terminal node indicates whether a point is an outlier.
You can find out more details about the isolation forest in its original publication by Liu et al., Isolation
Forest. ICDM 2008: 413–422: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf.
Autoencoder
An autoencoder consists of two parts: the encoder and the decoder. What we are
really trying to learn is the transformation of the encoder, which gives us a code
or the representation of the data that we look for.
More formally, we can define the encoder as the function , and the
The autoencoder represents data in an intermediate network layer, and the closer
they can be reconstructed based on the intermediate representation, the less of an
outlier they are.
See also
A fantastic resource for material about outlier detection is the PyOD author's
dedicated repository, available at https://github.com/yzhao062/anomaly-
detection-resources.
We'll do the following: given a dataset of paired string matches, we'll try out
different functions for measuring string similarity, then a bag-of-characters
representation, and finally a Siamese neural network (also called a twin neural
network) dimensionality reduction of the string representation. We'll set up a
twin network approach for learning a latent similarity space of strings based on
character n-gram frequencies.
A Siamese neural network, also sometimes called twin neural network, is named as such using the
analogy of conjoined twins. It is a way to train a projection or a metric space. Two models are trained at
the same time, and the output of the two models is compared. The training takes the comparison output
rather than the models' outputs.
Getting ready
We'll use a dataset of paired strings, where they are either matched or not based
on whether they are similar:
The dataset includes pairs of strings that either correspond to each other or don't
correspond. It starts like this:
There's also a test dataset available from the same GitHub repo:
Finally, we'll use a few libraries in this recipe that we can install like this:
How to do it...
As mentioned before, we'll first calculate the baseline using standard string
comparison functions, then we'll use a bag-of-characters approach, and then
we'll learn a projection using a Siamese neural network approach. You can find
the corresponding notebook on the book's GitHub repo, at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/blob/master/chapter03/Representing%20for%20similarity%20search.
ipynb.
Now we can run over all string pairs and calculate the string distances based on
each of the three methods. For each of the three algorithms, we can calculate the
area under the curve (AUC) score to see how well it does at separating
matching strings from non-matching strings:
The AUC scores for all algorithms are around 95%, which seems good. All three
distances perform quite well already. Let's try to beat this.
Bag-of-characters approach
As you can see in the AUC score of about 93%, this approach doesn't yet
perform quite as well overall, although the performance is not completely bad.
So let's try to tweak this.
Now we'll implement a Siamese network to learn a projection that represents the
similarities (or differences) between strings.
The Siamese network approach may seem a little daunting if you are not familiar
with it. We'll discuss it further in the How it works... section.
Next, we need to create the conjoined twins of the two models. For this, we
need a comparison function. We take the normalized Euclidean distance. This is
the Euclidean distance between the two L2-normalized projected vectors.
As we've mentioned before, the output of our combined network is the Euclidean
distance between the two outputs. This means we have to invert our target
(matched) column in order to change the meaning from similar to distant, so that
1 corresponds to different and 0 to the same. We can do this easily by
subtracting from 1.
How it works...
A Siamese network training is the situation where two (or more) neural networks
are trained against each other by comparing the output of the networks given a
pair (or tuple) of inputs and the knowledge of the difference between these
inputs. Often the Siamese network consists of the same network (that is, the
same weights). The comparison function between the two network outputs can
be metrics such as the Euclidean distance or the cosine similarity. Since we
know whether the two inputs are similar, and even how similar they are, we can
train against this knowledge as the target.
The following diagram illustrates the information flow and the different building
blocks that we'll be using:
Given the two strings that we want to compare, we'll use the same model to
create features from each one, resulting in two representations. We can then
compare these representations, and we hope that the comparison correlates with
the outcome, so that if our comparison shows a big difference, the strings will be
dissimilar, and if the comparison shows little difference, then the strings will be
similar.
We can actually directly train this complete model, given a string comparison
model and a dataset consisting of a pair of strings and a target. This training will
tune the string featurization model so the representation will be more useful.
Recommending products
In this recipe, we'll be building a recommendation system. A recommender is an
information-filtering system that predicts rankings or similarities by bringing
content and social connections together.
We'll download a dataset of book ratings that have been collected from the
Goodreads website, where users rank and review books that they've read. We'll
build different recommender models, and we'll suggest new books based on
known ratings.
Getting ready
To prepare for our recipe, we'll download the dataset and install the required
dependencies.
Let's get the dataset and install the two libraries we'll use here – spotlight and
lightfm are recommender libraries:
[It] contains (at a minimum) a pair of user-item interactions, but can also be enriched with ratings,
For implicit feedback scenarios, user IDs and item IDs should only be provided for user-item pairs where
an interaction was observed. All pairs that are not provided are treated as missing observations, and often
For explicit feedback scenarios, user IDs, item IDs, and ratings should be provided for all user-item-
Next, we'll implement a function to get the book titles by id. This will be useful
for showing our recommendations later:
How to do it...
We'll first use a matrix factorization model, then a deep learning model. You can
find more examples in the Jupyter notebook available at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/blob/master/chapter03/Recommending_products.ipynb.
How it works...
They can predict based on the assumption that customers who have shown
similar tastes in previous purchases will buy similar items in the future
(collaborative filtering).
Predictions based on the idea that customers will have an interest in items
similar to the ones they've bought in the past (content-based filtering).
Predictions based on a combination of collaborative filtering, content-based
filtering, or other approaches (a hybrid recommender).
Both models we've tried are based on the idea that we can separate the influences
of users and items. We'll explain each model in turn, and how they combine
approaches, but first let's explain the metric we are using: precision at k.
Precision at k
Precision at k doesn't take into account the ordering within the top k results, nor
does it include how many of the really good results that we absolutely should
have captured are actually returned: that would be recall. That said, precision at
k is a sensible metric, and it's intuitive.
Matrix factorization
Matrix decomposition or matrix factorization is the factorization of a matrix into a product of matrices.
Many different such decompositions exist, serving a variety of purposes.
In the preceding function, we have bias terms for users and items and is the
sigmoid function.
The model training maximizes the likelihood of the data conditional on the
parameters expressed as follows:
There are different ways to measure how well recommenders are performing
and, as always, which one we choose to use depends on the goal we are trying to
achieve.
See also
Again, there are a lot of libraries around that make it easy to get up and running.
First of all, I'd like to highlight these two, which we've already used in this
recipe:
lightfm: https://github.com/lyst/lightfm
Spotlight: https://maciejkula.github.io/spotlight
You can find a demonstration of library functionality for item ranking with a
dataset at the following repo:
https://github.com/cheungdaven/DeepRec/blob/master/test/test_item_ranking.py.
Last, but not least, you might find the following reading list about recommender
systems useful:
https://github.com/DeepGraphLearning/RecommenderSystems/blob/master/read
ingList.md.
Getting ready
In order to get everything in place for the recipe, we'll install the required
libraries and we'll download a dataset.
Furthermore, we'll use SciPy, but this comes with the Anaconda distribution:
The Credit Card Fraud dataset contains transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred over two days, with 492 fraudulent
transactions out of 284,807 transactions. The dataset is highly unbalanced: the positive class (fraud)
Let's import the dataset, and then split it into training and test sets:
We are ready! Let's do the recipe!
How to do it...
We'll first create an adjacency matrix, then we can apply the community
detection methods to it, and lastly, we'll evaluate the quality of the generated
communities. The whole process has the added difficulty associated with a large
dataset, which means we can only apply certain algorithms.
First, we need to calculate the distances of all points. This is a real problem with
a large dataset such as this. You can find several approaches online.
We use the annoy library from Spotify for this purpose, which is very fast and
memory-efficient:
We can then initialize our adjacency matrix with the distances as given by our
index:
We can now apply some community detection algorithms.
The size of our matrix leaves us with limited choice. We'll apply the two
following algorithms:
We can apply the SCC algorithm directly onto the adjacency matrix as follows:
For the second algorithm, we first need to convert the adjacency matrix to a
graph; this means that we treat each point in the matrix as an edge between
nodes. In order to save space, we use a simplified graph class for this:
Then we can apply the Louvain algorithm as follows:
Now we have two different partitions of our dataset. Let's find out if they are
worth anything!
In the ideal case, we'd expect that some communities have only fraudsters in
them, while others (most) have none at all. This purity is what we would be
looking for in a perfect community. However, since we also possibly want some
suggestions of who else might be a fraudster, we would anticipate some points to
be labeled as fraudsters in a majority-nonfraudster group and vice versa.
This shows that communities have a very high frequency of people who are not
fraudsters, and very few other values. But can we quantify how good this is?
We can then create appropriately chosen random experiments to see if any other
community assignments would have resulted in a better class entropy. If we
randomly shuffle the fraudsters and then calculate the entropy across
communities, we get an entropy distribution. This will give us a p-value, the
statistical significance, for the entropy of the Louvain communities.
The p-value is the probability that we get a distribution like this (or better) purely by chance.
You can find the implementation for the sampling in the notebook on GitHub.
We get a very low significance, meaning that it is highly unlikely to have gotten
anything like this by chance, which leads us to conclude that we have found
meaningful clusters in terms of identifying fraudsters.
How it works...
The hardest part of network analysis with a big dataset is constructing the
adjacency matrix. You can find different approaches in the notebook available
online. The two problems are runtime and memory. Both can increase
exponentially with the number of data points.
Our dataset contains 284,807 points. This means a full connectivity matrix
between all points would take a few hundred gigabytes (at 4 bytes per point),
.
We are using a sparse matrix where most adjacencies are 0s if they don't exceed
the given threshold. We represent each connection between the points as a
Boolean (1 bit) and we take a sample of 33%, 93,986 points, rather than the full
dataset.
Let's go through two graph community algorithms to get an idea of how they
work.
Louvain algorithm
We've used the Louvain algorithm in this recipe. The algorithm was published in
2008 by Blondel et al. (https://arxiv.org/abs/0803.0476). Since its time
complexity is , the Louvain algorithm can and has been used with big
datasets, including data from Twitter containing 2.4 million nodes and 38
million links.
For all vertices , assign them to the community so that is the highest it
can be. This step can be repeated a few times until no improvement in
modularity occurs.
All communities are treated as vertices. This means that edges are also
grouped together so that all edges that are part of the vertices that were
grouped together are now edges of the newly created vertex.
These two steps are iterated until no further improvement in modularity occurs.
Girvan–Newman algorithm
The result is a dendrogram that shows the arrangement of clusters by the steps of
the algorithm.
Information entropy
If a variable is not discrete, we can apply binning (for example, via a histogram)
or use non-discrete versions of the formula.
There's more...
We could have also applied more traditional clustering algorithms. For example,
the affinity propagation algorithm takes an adjacency matrix, as follows:
There are a host of other methods that we could apply. For some of them, we'd
have to convert the adjacency matrix to a distance matrix.
See also
You can find reading materials about graph classification and graph algorithms
on GitHub, collected by Benedek Rozemberczki, at
https://github.com/benedekrozemberczki/awesome-graph-classification.
There are some very nice graph libraries around for Python with many
implementations for community detection or graph analysis:
Cdlib: https://cdlib.readthedocs.io/en/latest/
Karateclub: https://karateclub.readthedocs.io/en/latest/
Networkx: https://networkx.github.io/
Label propagation: https://github.com/yamaguchiyuto/label_propagation
Snap.py: https://snap.stanford.edu/snappy/index.html
Python-louvain: https://github.com/taynaud/python-louvain
Graph-tool: https://graph-tool.skewed.de/
Cdlib also contains the BigClam algorithm, which works with big graphs.
Some graph databases such as neo4j, which comes with a Python interface,
implement community detection algorithms: https://neo4j.com/docs/graph-
algorithms/current/.
Probabilistic Modeling
This chapter is about uncertainty and probabilistic approaches. State-of-the-art
machine learning systems have two significant shortcomings.
Secondly, the more complex a machine learning system is, the more data we
need to fit our model, and the more severe the risk of overfitting.
In this chapter, we'll build a stock-price prediction model with different plug-in
methods for confidence estimation. We'll then cover estimating customer
lifetime, a common problem in businesses that serve customers.We'll also look
at diagnosing a disease, and we'll quantify credit risk, taking into account
different types of uncertainty.
Technical requirements
In this chapter, we mainly use the following:
scikit-learn, as before
Keras, as before
Lifetimes (https://lifetimes.readthedocs.io/), a library for customer lifetime
value
tensorflow-probability (tfp; https://www.tensorflow.org/probability)
In this recipe, we'll build a simple stock prediction pipeline in scikit-learn, and
we'll produce probability estimates using different methods. We'll then evaluate
our different approaches.
Getting ready
How to do it...
In a practical setting, we'd want to answer the following question: given the level
of prices, are they going to rise or to fall, and how much?
In order to make progress toward this goal, we'll proceed with the following
steps:
1. Download stock prices.
2. Create a featurization function.
3. Write an evaluation function.
4. Train models to predict stocks and compare performance.
Platt scaling
Naive Bayes
Isotonic regression
We'll discuss these methods and their background in the How it works... section.
Now we have our stock prices available as the pandas DataFrame hist.
2. Create a featurization function: So, let's start with a function that will give
us a dataset for training and prediction given a window size and a shift;
basically, how many descriptors we want for each price and how far we look
into the future:
We can see that there's a skew to the left, in the sense that more values
are below zero (about 49%) than above (about 43%). This means that in
training, prices go down rather than up.
We are not done with our dataset yet, however; we need to do one more
transformation. Our scenario is that we want to apply this model to help
us decide whether to buy a stock on the chance that prices are going up.
We are going to separate three different classes:
Prices go up by x.
Prices stay the same.
Prices go down by x.
After this, we have the thresholded y values for training and testing
(validation).
In the evaluation, we calculate and print the Area Under the Curve
(AUC) as the performance measure. We create a function,
measure_perf(), which measures performance and prints out relevant
metrics, given a model such as this:
We can use our new method now to evaluate the performance after
training our models.
We find that neither Platt scaling (logistic regression) nor isotonic regression can
deal well with our dataset. Naive Bayes regression doesn't get much better than
50%, which is nothing that we'd want to bet our money on, even if it's slightly
better than random choice. However, the complement Naive Bayes classifier
performs much better, at 59% AUC.
How it works...
We've seen that we can create a predictor for stock prices. We've broken this
down into creating data, and validating and training a model. In the end, we
found a method that would give us hope that we could actually use it in practice.
Let's go through our data generation first, and then over our different methods.
Featurization
This is central to any work in artificial intelligence. Before doing any work or
the first we look at our dataset, we should ask ourselves what we choose as the
unit of our observational units, and how are we going to describe our points in a
way that's meaningful and can be captured by an algorithm. This is something
that becomes automatic with experience.
There are more concerns that we have to address. We've already seen a few
methods for data treatment in time series, in the Forecasting CO2 time series
recipe in Chapter 2, Advanced Topics in Supervised Machine Learning. In
particular, stationarity and normalization are concerns that are shared in this
recipe as well (you might want to flip back and have a look at the explanation
there).
We'll look next at Platt scaling, which is one of the simplest ways of scaling
model predictions to get probabilistic outcomes.
Platt scaling
Platt scaling (John Platt, 1999, Probabilistic outputs for support vector machines
and comparisons to regularized likelihood methods) is the first method of
scaling model outcomes that we've used. Simply stated, it's applying logistic
regression on top of our classifier predictions. The logistic regression can be
expressed as follows (equation 1):
Isotonic regression
Isotonic regression (Zadrozny and Elkan, 2001, Learning and Making Decisions
When Costs and Probabilities are Both Unknown) is regression using an isotonic
function – that is, a function that is monotonically increasing or non-decreasing,
as a function approximation while minimizing the mean squared error.
We can express this as follows:
Here m is our isotonic function, x and y are features and target, and f is our
classifier.
Next, we'll look at one of the simplest probabilistic models, Naive Bayes.
Naive Bayes
It is called naive because it assumes that features are independent of each other,
so the nominator can be simplified as follows:
In the next section, we'll look at additional material.
See also
For Platt scaling, refer to Probabilistic Outputs for Support Vector Machines
and Comparisons to Regularized Likelihood Methods by John Platt, (1999).
For isotonic regression, as in our application for probability estimates in
classification, please refer to Transforming classifier scores into accurate
multi-class probability estimates by Zadrozny, B. and Elkan, C., (2002).
For a comparison between the two, refer to Predicting Good Probabilities
with Supervised Learning by A. Niculescu-Mizil & R. Caruana, ICML,
(2005). Refer to Rennie, J. D. and others, Tackling the Poor Assumptions of
Naive Bayes Text Classifiers (2003), on the complement Naive Bayes
algorithm.
The scikit-learn documentation gives an overview of confidence calibration
(https://scikit-
learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-
glr-auto-examples-calibration-plot-calibration-curve-py).
For an approach applied to deep learning models, see the ICLR 2018 paper by
Lee and others, Training Confidence-Calibrated Classifiers for Detecting
Out-of-Distribution Samples (https://arxiv.org/abs/1711.09325). Their code is
available on GitHub at https://github.com/alinlab/Confident_classifier.
You can find more examples of probabilistic analyses of time series with
different frameworks at the following links:
Getting ready
We'll need the lifetimes package for this recipe. Let's install it as shown in the
following code:
Now we can get started.
How to do it...
T: The transaction period; the elapsed time since the first purchase by the
customer
Frequency: The number of purchases by a customer within the observation
period
Monetary value: The average value of purchases
Recency: The age of the customer at the time of the last purchase
3. We can then combine the predictions of the model that predicts the number of
future transactions (bgf) and the model that predicts average purchase values
(ggf) using another of the Lifetimes library's methods. It includes a parameter
for discounting future values. We'll include a discount that corresponds to an
annualized 12.7%. We'll print five customers' lifetime values:
The output shows us the customer lifetime values:
Now we know who our best customers are, and therefore where to invest our
time and resources!
How it works...
Each customer has a value to the company. This is important for the marketing
budget – for example, in lead acquisition or ads spent based on customer
segments. The actual customer lifetime value is known after a customer has left
the company; however, we can instead build two different probabilistic
forecasting models for each customer:
This takes into account the purchasing frequency of customers and the dropout
probability of customers.
This model is used to estimate the mean transaction value over the customer's
lifetime, E(M), for which we have an imperfect estimate, as follows:
See also
This recipe was relatively short because of the excellent work that's been put into
the Lifetimes library, which makes a lot of the needed functionality plug-and-
play. An extended explanation of this analysis can be found in the Lifetimes
documentation (https://lifetimes.readthedocs.io/en/latest/Quickstart.html).
The Lifetimes library comes with a range of models (called fitters), which you
might want to look into. You can find more details about the two methods in this
recipe in Fader and others, Counting your Customers the Easy Way: An
Alternative to the Pareto/NBD Model, 2005, and Batislam and others, Empirical
validation and comparison of models for customer base analysis, 2007. You can
find the details of the Gamma-Gamma model in Fader and Hardi's report,
Gamma-Gamma Model of Monetary Value (2013).
The Google Cloud Platform GitHub repo shows a model comparison for
estimation of customer lifetime values
(https://github.com/GoogleCloudPlatform/tensorflow-lifetime-value) that
includes Lifetimes, a TensorFlow neural network, and AutoML. You can find a
very similar dataset of online retail in the UCI machine learning archive
(http://archive.ics.uci.edu/ml/datasets/Online+Retail).
Diagnosing a disease
For probabilistic modeling, experimental libraries abound. Running probabilistic
networks can be much slower than algorithmic (non-algorithmic) approaches,
which until not long ago rendered them impractical for anything but very small
datasets. In fact, most of the tutorials and examples relate to toy datasets.
However, this has changed in recent years due to faster hardware and variational
inference. With TensorFlow Probability, it is often straightforward to define
architectures, losses, and layers, even with probabilistic sampling with full GPU
support, and state-of-the-art implementations that support fast training.
Getting ready
How to do it...
'softmax', as we would do in binary classification tasks, we'll reduce the outputs to the number of
parameters our probability distribution needs, which is just one in the case of the Bernoulli distribution,
which takes a single parameter, which is the expected average of the binary outcome.
3. Model training: Now, we can train our model. We'll plot our training loss in
tensorboard and we'll enable early stopping:
This will run for 2,000 epochs, and it might take a while to complete.
From TensorBoard, we can see the training loss over epochs:
4. Validating the model: We can now sample from the model. Each network
prediction gives us a mean and variance. We can have a look at a single
prediction. We've arbitrarily chosen prediction number 10:
This prediction looks as follows:
In the following code segment, we'll look in more detail at the results that we
get:
This visualizes our results in order to give us a better understanding of the
trade-off between precision and recall.
This curve visualizes the trade-off inherent in our model, between recall and
precision. Given different cutoffs on our confidence (or class probability), we
can make a call about whether someone is ill or not. If we want to find everyone
(recall=100%), precision drops down to below 40%. On the other hand, if we
want to be always right (precision=100%) when we diagnose someone as ill,
then we'd miss everyone (recall=0%).
It's now a question of the cost of, respectively, missing people or diagnosing too
many, to make a decision on a cutoff for saying someone is ill. Given the
importance of treating people, perhaps there's a sweet spot around 90% recall
and around 65% precision.
How it works...
Aleatoric uncertainty
TensorFlow Probability comes with layer types for modeling different types of
uncertainty. Aleatoric uncertainty refers to the stochastic variability of our
outcomes given the same input – in other words, we can learn the spread in our
data.
Negative log-likelihood
We use negative log-likelihood as our loss. This loss is often used in maximum
likelihood estimation.
Bernoulli distribution
Metrics
See also
In this recipe, you've seen how to use a probabilistic model for a health
application. There are many other datasets and many different ways of doing
probabilistic inference. Please see TensorFlow Probability as one of the
frameworks in probabilistic modeling that has the most traction
(https://www.tensorflow.org/probability). It comes with a wide range of
tutorials.
Getting ready
How to do it...
Let's get the dataset and preprocess it, then we create the model, train the model,
and validate it:
1. Download and prepare the dataset: The dataset that we'll use for this recipe
was published in 2009 (I-Cheng Yeh and Che-hui Lien, The comparisons of
data mining techniques for the predictive accuracy of probability of default of
credit card clients), and originally hosted on the UCI machine learning
repository at
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
We'll download the data with openml using scikit-learn's utility function:
We'll use a very standard process of preprocessing that we've seen many
times before in this book and that we'll largely breeze through. We could
have examined the features more, or done some more work on
transformations and feature engineering, but this is beside the point of
this recipe.
2. Create a model: First, we need the priors and posteriors. This is done by
directly following the online TensorFlow Probability tutorial
(http://www.github.com/tensorflow/probability/blob/master/tensorflow_proba
bility/examples/jupyter_notebooks/Probabilistic_Layers_Regression.ipynb),
and is appropriate for normal distributions:
Please note DenseVariational.
Now to the main model, where we'll use the priors and posteriors. You'll
recognize DistributionLambda. We've replaced Binomial from the
previous recipe, Diagnosing a disease, with Normal, which will give us
an estimate of the variance of predictions:
After fitting, we apply the model to our test dataset and obtain
predictions for it.
We get around 70% AUC. Since this summary figure often doesn't tell a
full story, let's look at the confusion matrix as well:
This concludes our recipe. This recipe is as an exercise for you to try better
preprocessing, tweaking the model more, or switching the distributions.
How it works...
Models for credit scoring often use logistic regression models, which we've
encountered in the Predicting stock prices with confidence recipe in this chapter.
Alternatively, boosted models or interpretable decision trees are also in use.
Given the ability to do online learning and to represent residual uncertainties,
tensorflow-probability offers itself as another practical alternative.
In this recipe, we've created a probabilistic credit default prediction model that
works with epistemic uncertainty. It's time to explain what that means.
Epistemic uncertainty
See also
There are other routes to explore as well, such as libraries or additional material,
which we will list here.
As for tutorials, the Open Risk Manual offers open resources for credit scoring
in Python: https://www.openriskmanual.org/wiki/Credit_Scoring_with_Python.
We'll be dealing with various techniques in this chapter, including logic solvers,
graph embeddings, genetic algorithms (GA), particle swarm optimization
(PSO), SAT solvers, simulated annealing (SA), ant colony optimization, multi-
agent systems, and Monte Carlo tree search.
In this recipe, we'll go through two examples for each of these possibilities.
From Aristotle to Linnaeus to today's mathematicians and physicists, people have tried to put order into the
world by categorizing objects into a systematic order, called taxonomy. Mathematically, taxonomies are
expressed as graphs, which represent information as tuples (s, o), in that subject s which is connected to
object o; or triplets (s, p, o), in that a is related to (a predicate of) p to o. A frequently used type of
taxonomy is the ISA taxonomy, where relationships are of the type is-a. For example, a car is a vehicle, and
Getting ready
In this recipe, we'll use a logic solver interfaced from the nltk (natural
language toolkit) library from Python, and then use the graph libraries known as
networkx and karateclub.
The pip command you'll need to use to download these libraries is as follows:
For the second part of this recipe, we'll also need to download the zoo dataset
from Kaggle, which is available at https://www.kaggle.com/uciml/zoo-animal-
classification.
How to do it...
Logical reasoning
In this part of this recipe, we'll look at a simple example of logical reasoning
using libraries bundled with the nltk library. There are many other ways to
approach logical inference, some of which we'll look at in the See also... section
at the end of this recipe.
We'll use a very simple toy problem that you could find in any 101 –
Introduction to Logic book, though a more complex approach to such problems
could be taken.
Our problem is well-known: if all men are mortal, and Socrates is a man, is
Socrates mortal?
We can express this very naturally in nltk, as shown here:
The reasoning provided by the solver can also be read naturally, so we won't
explain this here. We'll learn how this works internally in the How it works...
section.
Knowledge embedding
In this part of this recipe, we'll try to make use of how information is interrelated
by embedding it into a multidimensional space that can serve as part of
featurization.
Here, we'll load the data, preprocess it, embed it, and then test our embedding by
classifying species, given their new features. Let's get started:
1. Dataset loading and preprocessing: First, we'll load the zoo dataset into
pandas, as we've done many times already. Then, we'll make sure that the
binary columns are represented as bool instead of int:
The zoo dataset contains 101 animals, each with features describing
whether it, for example, has hair or produces milk. Here, the target class
is the biological class of the animal.
Now, we can load this dataset into a graph using the networkx API:
The Vocabulary class is a wrapper for the label2id and id2label
dictionaries. We need this because some graph embedding algorithms don't
accept string names for nodes or relationships. Here, we converted the
concept labels into IDs before storing them in the graph.
Now, we can embed the graph numerically with different algorithms. We'll
use Walklets here:
The preceding code shows that every concept in the graph will be
represented by a 5-dimensional vector.
Now, we can test whether these features are useful for predicting the target
(the animal):
The output looks as follows:
This looks quite good, though the technique only becomes really interesting if
we have a knowledge base that goes beyond our training set. It is hard to show
graph embedding without loading millions of triplets or huge graphs. We'll
mention a few large knowledge bases in the following section.
How it works...
In this section, we'll look at the basic concepts behind this recipe, as well as its
corresponding methods. First, we'll cover logical reasoning and logic provers,
before looking at knowledge embedding and graph embedding with Walklets.
Logical reasoning
An expert system is a reasoning system that emulates the decision-making abilities of human experts. Expert
systems are designed to solve complex problems by reasoning through bodies of knowledge, represented
mainly as if-then-else rules (this is called a knowledge base).
For example, the reasoning that Socrates is a man. Man is mortal. Therefore,
Socrates is mortal, can be expressed as a logical statement using propositional
logic, as follows:
Automated theorem proving is a wide field that includes work based on logical
theorems and mathematical formulas. We've already looked at the problem of
proving a first-order logic equation that consists of logical equations. A search
algorithm is combined with logical equations so that the satisfiability of a
propositional formula can be decided on (see the Solving the n-queens problem
recipe in this chapter), as well as the validity of a sentence, given a set of
axioms. The Resolution Theorem Prover in nltk provides other functionality,
such as unification, subsumption, and question answering (QA):
http://www.nltk.org/howto/resolution.html.
Knowledge embedding
w(t) refers to the current word (or concept), while w(t-2), w(t-1), w(t+1), and
w(t+2) refer to two words before and after, respectively. We predict the word
context based on the current word. As we've already mentioned, the size of the
context (window size) is a hyperparameter of the skipgram algorithms.
The Walklet algorithm performs well on large graphs and – since it's a neural
network – can be trained online. You can find out more about Walklets in the
2017 paper by Brian Perozzi and others, Don't Walk, Skip! Online Learning of
Multi-scale Network Embeddings (https://arxiv.org/abs/1605.02115).
See also
The following are libraries that can be used for logical inference in Python:
SymPy: https://docs.sympy.org/latest/modules/logic.html
Kanren logic programming: https://github.com/logpy/logpy
PyDatalog: https://sites.google.com/site/pydatalog/
We've been following the inference guide in nltk for this recipe. You can find
more tools at the official nltk website:
http://www.nltk.org/howto/inference.html.
KarateClub: https://karateclub.readthedocs.io/en/latest/index.html
pykg2vec: https://github.com/Sujit-O/pykg2vec
PyTorch BigGraph (by Facebook Research):
https://github.com/facebookresearch/PyTorch-BigGraph
GraphVite: https://graphvite.io/
AmpliGraph (by Accenture): https://docs.ampligraph.org/
pyRDF2Vec: https://github.com/IBCNServices/pyRDF2Vec
Some resources for reasoning about the real world and/or with common sense
are as follows:
ActionCores: http://www.actioncores.org/apidoc.html#pracinference
KagNet: https://github.com/INK-USC/KagNet
Allen AI Commonsense Knowledge Graphs:
https://mosaic.allenai.org/projects/commonsense-knowledge-graphs
Commonsense Reasoning Problem Page at NYU CS:
http://commonsensereasoning.org/problem_page.html
There are several large real-world knowledge databases available, such as the
following:
Wikidata: https://www.wikidata.org/
Conceptnet5: https://github.com/commonsense/conceptnet5
The Open Multilingual Wordnet: http://compling.hss.ntu.edu.sg/omw/
Yago: https://github.com/yago-naga/yago3
Solving the n-queens problem
In mathematical logic, satisfiability is about whether a formula can be valid
under some interpretation (parameters). We say that a formula is unsatisfiable if
it can't be true under any interpretation. A Boolean satisfiability problem, or
SAT, is all about whether a Boolean formula is valid (satisfiable) under any of
the values of its parameters. Since many problems can be reduced to SAT
problems, and solvers and optimizations for it exist, it is an important class of
problems.
SAT problems have been proven to be NP-complete. NP-completeness (short for nondeterministic
polynomial time) means that a solution to a problem can be verified in polynomial time. Note that this
doesn't mean that a solution can be found quickly, only that a solution can be verified quickly. NP-complete
problems are often approached with search heuristics and algorithms.
In this recipe, we'll address a SAT problem in various ways. We'll take a
relatively simple and well-studied case known as the n-queens problem, where
we try to place queens on a chessboard of n by n squares so that any column,
row, and diagonal can only take, at most, one queen.
First, we'll apply a GA, then PSO, and then a specialized SAT solver.
Getting ready
We'll be using the dd solver for one of the approaches in this recipe. To install it,
we also need the omega library. We can get both by using the pip command, as
follows:
We'll use the dd SAT solver libraries later, but first, we'll look at some other
algorithmic approaches.
How to do it...
Genetic algorithm
First, we'll define how a chromosome is represented and how it can mutate.
Then, we'll define a feedback loop for testing these chromosomes and changing
them. We'll explain the algorithm itself in the How it works... section, toward the
end of this recipe. Let's get started:
2. Writing the main algorithm: The GA for the n-queens problem is as follows
(we've omitted the visualization here):
This class contains the population of chromosomes and can have
methods applied to it (population control, if you like), such as what
parents to use (get_parents()) and mating them (cross_over()). Take
note of the iterate() method, which is where the main logic is
implemented. We'll comment on the main decisions we've made here in
the How it works... section.
If we run the preceding code, we'll get a single run that looks like this
(yours may look different):
The following plot shows the fitness of the best chromosome at each
iteration of the algorithm:
Here, we can see that the fitness of the algorithm doesn't always improve; it
can also go down. We could have chosen to keep the best chromosome here.
In that case, we wouldn't have seen any decline (but the potential downside is
that we could have ended up in a local minimum).
In this part of this recipe, we'll be implementing a PSO algorithm for the n-
queens problem from scratch. Let's get started:
We are going to use the same cost function that we defined for the GA.
This cost function tells us how well our particles fit the given problem –
in other words, how good a property vector is.
We'll wrap our initialization and the main algorithm into a class:
The get_best_particle() method returns the best configuration and the
best score. Take note of the iterate() method, which updates our
particles and returns the best particle, along with its score. Details
regarding this update are provided in the How it works... section. The
optimization process itself is done using a few formulas that are
relatively simple.
We'll also want to display our solutions. The code for showing the board
positions is as follows:
As we mentioned previously, we'll explain how all of this works in the How
it works... section.
You can view the output of the algorithm being run with n = 8 at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/blob/master/chapter05/solving-n-queens.md.
In the following plot, you can see the quality of the solutions over our
iterations:
Since all the particles maintain their own records of the best solution, the score
can never decline. At iteration 1,323, we reached a solution and the algorithm
stopped.
SAT solver
This is heavily based on the example that can be found in the dd library,
copyright of California Institute of Technology, at https://github.com/tulip-
control/dd/blob/0f6d16483cc13078edebac9e89d1d4b99d22991e/examples/quee
ns.py.
Basically, there's one formula that incorporates all the constraints. Once all the
constraints have been satisfied (or the conjunction of all the constraints), the
solution is found:
The following is our example solution for the eight queens problem:
This solver not only got all the solutions (we only visualized one of them) but
was also about twice as fast as the GA!
How it works...
In this section, we'll explain the different approaches we employed in this recipe,
starting with the GA.
Genetic algorithm
A chromosome can calculate its own fitness; here, we used the same cost
function that we used previously, but this time, we scaled it to be between 0 and
1, where 1 means we found a solution and anything in-between shows how close
we are to getting a solution. A chromosome can also mutate itself; that is, it can
randomly change one of its values.
We've expressed the last step very loosely here. Basically, we can decide when
the fitness is high enough and how many times we want to iterate. These are our
stopping criteria.
The main hyperparameters and major decisions that must be made for the GA
are as follows:
As we can see, the GA is quite flexible and very intuitive. In the next section,
we'll look at PSO.
PSO takes a few parameters, as follows (most of these were named in our
implementation; here, we're omitting the ones that are specific to our nine
queens problem):
In our PSO problem, there were two deltas, delta_p and delta_g, where p and g
stand for particle and global, respectively. This is because of one of them is
calculated with respect to the particle's historic best and the other is calculated
with respect to the particle's global best.
Here, rp and rg are random numbers and phip and phig are the local and global
factors, respectively. They refer to either a unique particle or all the particles, as
shown in the delta_p and delta_g variables.
There's also another parameter, omega, that regulates the decay of the current
velocities. At each iteration, the new velocities are calculated according to the
following formula:
Note that the algorithm is sensitive to what's chosen for phip, phig, and omega.
Our cost function (or goodness function) calculates the score for a particle
according to a given configuration of queens. This configuration is represented
as a list of indexes in the range ]0, N-1[. For each pair of queens, the function
checks whether they overlap either in the diagonal, vertical, or horizontal sense.
Each non-conflicting check awards a point, so the maximal number of points is
SAT solver
There are lots of different ways specialized satisfiability (SAT) solvers work. A
survey by Weiwei Gong and Xu Zhou (2017) provides a broad overview of the
different approaches: https://aip.scitation.org/doi/abs/10.1063/1.4981999.
The dd solver, which we used in our recipe, works using binary decision
diagrams (BDD), which were introduced by Randal Bryant (Graph-based
algorithms for Boolean function manipulation, 1986). Binary decision diagrams
(sometimes called branching programs) are constraints represented as Boolean
functions as opposed to other encodings, such as negation normal.
This means that we can represent problems as binary trees or, equivalently, as
truth tables.
To illustrate this, let's look at an example. We can enumerate all the states over
our binary variables (x1, x2, and x3) and then come up with a final state that's the
result of f. The following truth table summarizes the states of our variables, as
well as our function evaluation:
x1 x2 x3 f
See also
There are lots of other SAT solvers in Python, some of which are as follows:
A discussion of the SAT solver, when applied to Sudoku, can be found here:
https://codingnest.com/modern-sat-solvers-fast-neat-underused-part-1-of-n/.
An example of Z3 for the Knights and Knaves problem can be found here:
https://jamiecollinson.com/blog/solving-knights-and-knaves-with-z3/.
Getting ready
Apart from standard dependencies such as scipy and numpy, which we always
rely on, we'll be using the scikit-opt library, which implements many different
algorithms for swarm intelligence.
Swarm intelligence is the collective behavior of decentralized, self-organized systems that leads to the
emergence of apparent intelligence in the eyes of an observer. This concept is used in work based on
artificial intelligence. Natural systems, such as ant colonies, bird flocking, hawks hunting, animal herding,
and bacterial growth, display a certain level of intelligence at the global level, even though ants, birds, and
hawks typically exhibit relatively simple behavior. Swarm algorithms, which are inspired by biology,
include the genetic algorithm, particle swarm optimization, simulated annealing, and ant colony
optimization.
First, we need to create a list of coordinates (longitude, latitude) for bus stops.
The difficulty of the problem depends on the number of stops (N). Here, we've
set N to 15:
We can feed this distance matrix into the two algorithms to save time.
Simulated annealing
In this subsection, we'll write our algorithm for finding the shortest bus route.
This is based on Luke Mile's Python implementation of simulated annealing,
when applied to the traveling salesman problem:
https://gist.github.com/qpwo/a46274751cc5db2ab1d936980072a134. Let's get
started:
We can also plot the internal distance measure of the algorithm. Please
note how this internal cost function goes down all the time until about
800,000 iterations:
Now, let's try out ant colony optimization.
Here, we're loading the implementation from a library. We'll explain the details
of this in the How it works... section:
We are using the distance calculations based on the point distances
(distance_matrix) we retrieved previously.
Again, we can plot the best path and the path distance over iterations, as follows:
Once again, we can see the final path, which is the result of our optimization (the
subplot on the left), as well as the distance as it goes down over iterations of the
algorithm (the subplot on the right).
How it works...
case, means that there are a finite number of options. The intelligence part of combinatorial optimization
goes into either reducing the search space or accelerating the search. The traveling salesman problem, the
minimum spanning tree problem, the marriage problem, and the knapsack problem are all applications of
combinatorial optimization.
The TSP can be stated as follows: given a list of towns to visit, which is the
shortest path that traverses all of them and leads back to the point of origin? The
TSP has applications in domains such as planning, logistics, and microchip
design.
Now, let's take a look at simulated annealing and ant colony optimization in
more detail.
Simulated annealing
In this recipe, we randomly initialized our city tour and then iterated for
simulated annealing. The main idea of SA is that the rate of changes depends on
a certain temperature. In our implementation, we decreased the temperature
logistically from 4 to 0. In each iteration, we tried swapping (we could have tried
other operations) two random bus stops, indexes i and j in our path (tour), where
i < j, and then we calculated the sum of distances to i from i-1 to i, from i to i+1,
from j-1 to j, and from j to j+1 (see calc_distance). We also needed a distance
measure for calc_distance. We chose the Euclidean distance here, but we
could have chosen others.
The temperature gets factored in when we need to decide whether to accept the
swap. We calculate the exponential of the difference in path length before and
after the swap:
Then, we draw a random number. We accept the change if this random number
is lower than our expression; otherwise, we undo it.
As the name suggests, ant colony optimization is inspired by ant colonies. Let's
use pheromones, which are secreted by ants as they follow a path, as an analogy:
here, the agents have candidate solutions that are more attractive the closer they
get to the solution.
In general, ant number k moves from state x to state y with the following
probability:
Tau is the pheromone trail that's deposited between x and y. The eta parameter
controls the influence of the pheromone, where eta to the power of beta is the
state transition (for example, one over the cost of the transition). Pheromone
trails are updated according to how good the overall solution that included the
state transition was.
The scikit-opt function does the heavy lifting here. We only have to pass a
few parameters, such as the distance function, the number of points, the number
of ants in the population, the number of iterations, and the distance matrix,
before calling run().
See also
You can also solve this problem as a mixed-integer problem. The Python-MIP
library solves mixed-integer problems, and you can find an example for the TSP
at https://python-mip.readthedocs.io/en/latest/examples.html.
The TSP can be solved with a Hopfield Network as well, as explained in this
tutorial:
https://www.tutorialspoint.com/artificial_neural_network/artificial_neural_netw
ork_optimization_using_hopfield.htm. A cuckoo search approach is discussed
here: https://github.com/Ashwin-Surana/cuckoo-search.
Differential evolution
Genetic algorithm
Particle swarm optimization
Simulated annealing
Ant colony algorithm
Immune algorithm
Artificial fish swarm algorithm
As we mentioned in the introduction to this recipe, transport logistics has its own
application in the TSP, even in its purest form. A dataset of 30,000 public buses,
minibuses, and vans in Mexico is available at https://thelivinglib.org/mapaton-
cdmx/.
Regarding Covid-19, to libertarians, Sweden was, for some time, the poster child
for how you didn't need a lockdown, although secondary factors such as having a
high proportion of single-person households and a cultural tendency to social
distance weren't taken into account. Recently, fatalities in Sweden have been on
the rise, and its per capita rate is one of the highest recorded
(https://www.worldometers.info/coronavirus/).
In the UK, the initial response was to rely on herd immunity, and the lockdown
was declared only weeks after other countries had already imposed it. The
National Health Service (NHS) were using makeshift beds and renting beds in
commercial hospitals because they didn't have the capacity to cope.
A multi-agent system (MAS) is a computer simulation consisting of participants known as agents. The
individual agents can respond heuristically or based on reinforcement learning. Conjunctively, the system
behavior of these agents responding to each other and to the environment can be applied to study topics,
In this recipe, a relatively simple, multi-agent simulation will show you how
different responses can cause a difference in the number of fatalities, and the
spread, of a pandemic.
Getting ready
We'll be using the mesa multi-agent modeling library to implement our multi-
agent simulation.
How to do it...
This simulation is based on work by Maple Rain Research Co., Ltd. For this
recipe, we've made a few changes regarding introducing factors such as hospital
beds and lockdown policies, and we've also changed how infections and active
cases are accounted for. You can find the complete code at
https://github.com/benman1/covid19-sim-mesa.
Disclaimer: This recipe's intent is not to provide medical advice, nor are we qualified medical practitioners
or specialists.
First, we are going to define our agents through the Person class:
This defines an agent as a person with a health and quarantine status.
We still need a few methods to change how other properties can change. We
won't go through all of them, just the ones that should suffice for you to gain an
understanding of how everything comes together. The core thing we need to
understand is what agents do while they're infected. Basically, while infected, we
need to understand whether the agents infect others, die from the infection, or
recover:
Here, we can see quite a few variables that are defined at the model level, such
as self.model.critical_rate, self.model.hospital_factor, and
self.model.recovery_period. We'll look at these model variables in more
detail later.
Now, we need a way for our agents to record their position, which is what in
mesa is called a MultiGrid:
The entry method, which is called at every cycle (iteration), is the step()
method:
Agents move at every step if they are alive. Here's what happens when they
move:
This concludes the main logic of our agents; that is, Person. Now, let's look at
how everything comes together at the model level. This can be found in the
Simulation class inside model.py.
The preceding code creates as many agents as we need. Some of them will be
infected according to the start_infected parameter. We also add the agents to
a map of cells organized in a grid.
When called, the function iterates over the agents in the model and counts the
ones whose status is infected.
Again, just like for Person, the main logic of Simulation is in the step()
method, which advances the model by one cycle:
Let's see how different lockdown policies affect deaths and the spread of the
disease over time.
We'll use the same set of variables that we used previously in these simulations.
We've set them so that they roughly correspond to the UK according to a factor
of 1/1,000:
We'll explain the motivation for the grid in the How it works... section.
First, let's look at the data when no lockdown was introduced. We can create this
policy if our policy function always returns False:
The resulting graph shows our five collected variables over time:
Let's compare this to a very cautious policy of declaring lockdown every time
the death rate rises or if the rate of infected in the population rises above 20%
within (roughly) 3 weeks:
With a single lockdown, we get the following graph, which shows about 600
deaths overall:
You can change these parameters or play with the logic to create more
sophisticated and/or realistic simulations.
How it works...
The simulation is quite simple: it's composed of agents and proceeds in iterations
(called cycles). Each agent represents a part of the population.
Here, a certain population is infected with the disease. At each cycle (which
corresponds to 1 hour), infected people can go to the hospital (if there's
capacity), die, or make progress toward recovery. They can also go into
quarantine. While alive, not recovered, and not in quarantine, they can infect
other people in spatial proximity to them. When recovering, agents can become
immune.
At each cycle, agents can move around. They move to a new position if they're
not in quarantine or the national lockdown has been declared; otherwise, they
stay in place. If a person is infected, they can die, go to the hospital, recover,
infect others, or go into quarantine.
We have a lot more parameters to take into account here, such as the following:
There's more...
Since the simulations can take a long time to run, it can be very slow to try out
parameters. Instead of having to do a full run, and only then see if we get the
desired effect or not, we can use the live plotting functionality of matplotlib.
In order to get faster feedback, let's live plot the simulation loop, as follows:
This will continuously (every 10 cycles) update our plot of the simulation
parameters. Instead of having to wait for a full simulation, we can abort it if it
doesn't work out.
See also
You can find out more about mesa's multi-agent-based modeling in Python at
https://mesa.readthedocs.io/en/master/). Some other multi-agent libraries are as
follows:
MAgent specializes in 2D environments with a very large number of agents
that learn through reinforcement learning: https://github.com/PettingZoo-
Team/MAgent.
osBrain and PADE are general-purpose multi-agent system libraries. They
can be found at https://osbrain.readthedocs.io/en/stable/ and
https://pade.readthedocs.io/en/latest/, respectively.
SimPy is a discrete event simulator that can be used for a broader range of
simulations: https://simpy.readthedocs.io/en/latest/.
Other simulators have also been released, most prominently the CovidSim
microsimulation model (https://github.com/mrc-ide/covid-sim), which was
developed by the MRC Centre for Global Infectious Disease Analysis, hosted at
Imperial College, London.
In this recipe, we'll use Monte Carlo tree search to create a basic chess engine.
Getting ready
We'll use the python-chess library for visualization, to get valid moves, and to
know if a state is terminal. We can install it with the pip command, as follows:
We'll be using this library for visualization, to generate valid moves at each
position, and to check if we've reached a final position.
How to do it...
First, we'll look at the code we'll be using to define our tree search class, and
then look at how the search works. After that, we'll learn how this can be
adapted to chess.
Tree search
We'll look at these variables in more detail in the How it works... section. We'll
be adding more methods to this class shortly.
The different steps in our tree search are performed in our do_rollout method:
2. The expansion step adds the children nodes – the nodes that can be reached
via valid moves, given a board position:
This function updates the children dictionary with the descendants (or
children) of the node. These nodes are any valid board positions that can
be reached from the node in a single move.
3. The simulation step runs a series of moves until the game is ended:
This function plays out the simulation until the end of the game.
4. The backpropagation step associates a reward with each step of the path:
Finally, we need a way to choose the best move, which can be as simple as
going through the Q and N dictionaries and choosing the descendent with the
maximum utility (reward):
We set the score of any unseen node to -infinity, in order to avoid choosing an
unseen move.
Implementing a node
Now, let's learn how to use a node for our chess implementation.
Now that everything has been prepared, we can finally play chess.
Playing chess
The following is just a simple loop with a graphical prompt stating the board
position:
You should then be asked to enter a move to go to a certain position on the
chessboard. After each move, a board will appear, showing the current position
of the chess pieces. This can be seen in the following screenshot:
Note that moves have to be entered in UCI notation. If you enter the move in a format that's square to
The playing strength being used here isn't very high, but it should still be easy to
see a few improvements that you can make while play around with it. Note that
this implementation is not parallelized.
How it works...
In Monte Carlo tree search (MCTS), we apply the Monte Carlo method –
which is basically random sampling – in order to obtain an idea of the strength
of the moves that are made by the player. For each move, we play random
moves until the game finishes. If we do this often enough, we'll get a good
estimate.
The selection step, at its most basic, looks for a node (such as a board position)
that hasn't been explored yet.
The expansion step updates the children dictionary with the children of the
selected node.
As for implementing a node, nodes must be hashable and comparable since we'll
store them in the dictionaries we mentioned previously. So, here, we need to
implement the __hash__ and __eq__ methods. We omitted them previously
since we didn't need them to understand the algorithm itself, so we've added
them here for completeness:
The __repr__() method can be quite useful when you are debugging.
For the main functionality of the ChessGame class, we also need the following
methods:
Please take a look at the implementation of ChessGame again to see this in action.
There's more...
One major extension of MCTS is Upper Confidence Trees (UCTs), which are
used to balance exploration and exploitation. The first Go programs to reach dan
level on a 9x9 board used MCTS with UCT.
To implement the UCT extension, we have to go back to our MCTS class and make
a couple of changes:
Here, c is a constant.
Next, we need to replace the last line of code so that it uses _uct_select()
instead of _select() for recursion. Here, we'll replace the last line of _select()
so that it states the following:
Making this change should increase the agent's playing strength further.
See also
To find out more about UCTs, take a look at the following article on MoGO
regarding the first computer Go program to reach dan level on a 9x9 board:
https://hal.inria.fr/file/index/docid/369786/filename/TCIAIG-2008-
0010_Accepted_.pdf. It also provides a description of MCTS in pseudocode.
In this chapter, we'll start with a relatively basic use case of reinforcement
learning for website optimization with multi-armed bandits, where we'll look at
an agent and an environment, and how they interact. Then we'll move on to a
simple demonstration of control, where it gets a bit more complex, and we'll get
to see an agent environment and a policy-based method, REINFORCE. Finally,
we'll learn how to play blackjack, where we'll use a deep Q network (DQN), a
value-based algorithm that was used in the wave-making AI that could play
Atari games created by DeepMind in 2015.
Technical requirements
The full notebooks are available online on GitHub:
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/tree/master/chapter06.
Optimizing a website
In this recipe, we'll deal with website optimization. Often, it is necessary to try
changes (or better, a single change) on a website to see the effect they will have.
In a typical scenario of what's called an A/B test, two versions of the website
will be compared systematically. An A/B test is conducted by showing versions
A and B of a web page to a pre-determined number of users. Later, statistical
significance or a confidence interval is calculated in order to quantify the
differences in click-through rates, with the goal of deciding which of the two
web page variants to keep.
This example use case of website optimization will help us to introduce the
notions of agent and environment, and show us the trade-off between exploration
and exploitation. We'll explain these concepts in the How it works... section.
How to do it...
Since we are only using standard Python, we don't need to install anything, and
we can delve right into implementing our recipe:
2. Now we need to interact with this environment. This is where our agent
comes in. The agent has to make decisions, and we'll give it a strategy to
make decisions. We'll include metrics collection as well. An abstract agent
looks like this:
Agents will contain a lookup list of metric functions and also inherit a
metric collection functionality. We can run the metrics collection through
the run_metrics(self, i) function.
The strategy that we use here is called UCB1. We'll explain this strategy in
the How to do it... section:
Our UCB1 agent needs an environment (a bandit) to interact with, and a single
parameter alpha, which weighs the importance of exploring actions (versus
exploiting the best known action). The agent maintains its history of choices
over time, and a record of estimates for each possible choice.
method runs a series of choices and gets back the feedback from the
environment.
Let's track two metrics: regret, which is the sum of expected losses occurred
because of suboptimal choices, and—as a measure of convergence of the agent's
estimates against the actual configuration of the environment—the Spearman
rank correlation (stats.spearmanr()).
The Spearman rank correlation is equal to the Pearson correlation (often briefly called just the
The Pearson correlation between two variables, and , can be expressed as follows:
The Spearman correlation, instead of operating on the raw scores, is calculated on the ranked scores. A
rank transformation means that a variable is sorted by value, and each entry is assigned its order. Given a
Spearman correlation ranges between for perfectly negatively correlated and for perfectly correlated.
means there's no correlation.
We can now track these metrics in order to compare the influence of the alpha
parameter (more or less exploration). We can then observe convergence and
cumulative regret over time:
So we have 20 different choices of web pages, and we collect regret and corr
as defined, and we run for 5000 iterations. If we plot this, we can get an idea of
how well this agent performed:
For the second run, we'll change alpha to 0.5, so we'll do less exploration:
We can see that the cumulative regret with alpha=0.5, less exploration, is much
lower than with alpha=2.0; however, the overall correlation of the estimates to
the environment parameters is lower.
So, with less exploration our agent models the true parameters of the
environment less well. This comes from the fact that with less exploration the
ordering of the lower ranked features has not converged. Even though they are
ranked as suboptimal, they haven't been chosen often enough to determine
whether they are worst or second worst, for example. This is what we see with
less exploration, and this could be fine since we might only care about knowing
which choice is best.
How it works...
We've used the Upper Confidence Bound version 1 (UCB1) algorithm (Auer
et al., Finite-time analysis of the multi-armed bandit problem, 2002), which is
easy to implement.
It works as follows:
Play each action once in order to get initial estimates for the mean rewards
(exploration phase).
For each round t update Q(a) and N(a), and play the action a' according to
this formula:
where is the lookup table for the mean reward and is the number
of times action has been played. is a parameter.
The second term in the preceding equation quantifies the uncertainty. The lower
the uncertainty, the more we rely on Q(a). The uncertainty decreases linearly
with the number of times an action has been played and increases
logarithmically with the number of rounds.
There are many variants of the bandit algorithm that address more complex
scenarios, for example, costs for switching between choices, or choices with
finite lifespans such as the secretary problem. The basic setting of the secretary
problem is that you want to hire a secretary from a finite pool of applicants. Each
applicant is interviewed in turn in random order, and a definite decision (to hire
or not) is to be made immediately after the interview. The secretary problem is
also called the marriage problem.
See also
Controlling a cartpole
The cartpole is a control task available in OpenAI Gym, and has been studied for
many years. Although it is relatively simple compared to others, it contains all
that we need in order to implement a reinforcement learning algorithm, and
everything that we develop here can be applied to other, more complex learning
tasks. It can also serve as an example of robotic manipulation in a simulated
environment. The advantage of taking one of the less demanding tasks is that
training and turnaround is quicker.
OpenAI Gym is an open source library that can help to develop reinforcement algorithms by standardizing
a broad range of environments for agents to interact with. OpenAI Gym comes with hundreds of
environments and integrations ranging from robotic control, and walking in 3D to computer games and
The cartpole task is depicted in the following screenshot of the OpenAI Gym
environment and consists of moving a cart to the left or right in order to balance
a pole in an upright position:
In this recipe, we'll implement the REINFORCE policy gradient method in
PyTorch to solve the cartpole task. Let's get to it.
Getting ready
There are many libraries that provide collections of test problems and
environments. One of the libraries with the most integrations is OpenAI Gym,
which we'll utilize in this recipe:
How to do it...
OpenAI Gym saves us work—we don't have to define the environment ourselves
and come up with reward signals, encode the environment, or state which actions
are allowed.
We'll first load the environment, define a deep learning policy for action
selection, define an agent that uses this policy to select actions to execute, and
finally we'll test how the agent performs in our task:
1. First, we'll load the environment. Every move that the pole doesn't fall over,
we get a reward. We have two available moves, left or right, and an
observation space that includes a representation of the cart position and
velocity and the pole angle and velocity, as in the following table:
The agent will create a policy network and use it to take decisions until
an end state is reached; then it will feed the cumulative rewards into the
network to learn. Let's start with the policy network.
4. Next, we'll test our agent. We'll start running our agent in the environment by
simulating interactions with the environment. In order to get a cleaner curve
for our learning rate, we'll set env._max_episode_steps to 10000. This
means the simulation stops after 10,000 steps. If we'd left it at 500, the default
value, the algorithm would plateau at a certain performance or its
performance would stagnate once about 500 steps are reached. Instead, we are
trying to optimize a bit more:
We should see the following output:
While the simulations are going on we are seeing updates every 100
iterations with average scores since the last update. We stop once a score
of 1,000 is reached. This is our score over time:
We can see that our policy is continuously improving—the network is
learning successfully to manipulate the cartpole. Please note that your
results can vary. The network can learn more quickly or more slowly.
In the next section, we'll get into how this algorithm actually works.
How it works...
Policy gradient methods find a policy with a given gradient ascent that
maximizes cumulative rewards with respect to the policy parameters. We've
implemented a model-free policy-based method, the REINFORCE algorithm (R.
Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, 1992).
This is what we've done in our policy network, and this helps us to make our
action choice.
You should be able to run our implementation on any Gym environment with
few to no changes. We've deliberately put in a few things (for example,
reshaping observations to a vector) to make it easier to reuse it; however, you
should make sure that your network architecture corresponds to the nature of
your observations. For example, you might want to use a 1D convolutional
network or a recurrent neural network for time series (such as in stock trading or
sounds) or 2D convolutions if your observations are images.
There's more...
There are a few more things that we can play around with. For one, we'd like to
see the agent interacting with the pole, and secondly, instead of implementing an
agent from scratch, we can use a library.
We can play many hundreds of games or try different control tasks. If we want
to actually watch our agent interact with the environment in a Jupyter notebook,
we can do it:
We should see our agent interacting with the environment now.
If you are on a remote connection (such as running on Google Colab), you might
have to do some extra work:
This will run the training. Your agents will be stored in a local directory, so you
can load them up later. RLlib lets you use PyTorch and TensorFlow with the
'torch': True option.
See also
Please note that the installation of these libraries can take a while and might take
up gigabytes of your hard disk.
Playing blackjack
One of the benchmarks in reinforcement learning is gaming. Many different
environments related to gaming have been designed by researchers or
aficionados. A few of the milestones in gaming have been mentioned in Chapter
1, Getting Started with Artificial Intelligence in Python. The highlights for many
would certainly be beating the human champions in both chess and Go—chess
champion Garry Kasparov in 1997 and Go champion Lee Sedol in 2016—and
reaching super-human performance in Atari games in 2015.
In this recipe, we get started with one of the simplest game environments:
blackjack. Blackjack has an interesting property that it has in common with the
real world: indeterminism.
Blackjack is a card game where, in its simplest form, you play against a card
dealer. You have a deck of cards in front of you, and you can hit, which means
you get one more card, or stick, where the dealer gets to draw cards. In order to
win, you want to get as close as possible to a card score of 21, but not surpass
21.
In this recipe, we'll implement a model in Keras of the value of different actions
given a configuration of the environment, a value function. The variant we'll
implement is called the DQN, which was used in the 2015 Atari milestone
achievement. Let's get to it.
Getting ready
How to do it...
We need an agent that maintains a model of what effects its actions have. These
actions are played back from its memory for the purpose of learning. We will
start with a memory that records past experiences for learning:
1. Let's implement this memory. This memory is essentially a FIFO queue. In
Python, you could use a deque; however, we found the implementation of the
replay memory in the PyTorch examples very elegant, so this is based on
Adam Paszke's PyTorch design:
We only really need two methods:
We need to push new memories, overwriting old ones in the process if our
capacity is reached.
We need to sample memories for learning.
The latter point is worth stressing: instead of using all the memories for
learning, we only take a part of them.
In the sample() method, we made a few alterations to get our data in the
right shape.
We've omitted a method from the listing, which defines the neural
network model:
This is a three-layer neural network with two hidden layers, one with 100
neurons and the other with 2 neurons, that come with ReLU activations,
and an output layer with 1 neuron.
3. Let's load the environment and initialize our agent. We initialize our agent
and the environment as follows:
This loads the Blackjack OpenAI Gym environment and our DQNAgent
as implemented in Step 2 of this recipe.
We can see the structure of this network (as shown by Keras's summary()
method):
For the simulation, one of our key questions is the value of the epsilon
parameter. If we set it too low, our agent won't learn anything; if we set it
too high, we'd lose money because the agent makes random moves.
4. Now let's play blackjack. We chose to steadily decrease epsilon in a linear
fashion, and then we exploit for a number of rounds. When epsilon reaches
0, we stop learning:
We do see an increase in the payouts over time; however, we are still below 0,
which means we lose money on average. This happens even if we stop learning
in the exploitation phase.
Our blackjack environment does not have a reward threshold at which it's
considered solved; however, a write-up lists 100 best episodes with an average
of 1.0, which is what we reach as well:
https://gym.openai.com/evaluations/eval_21dT2zxJTbKa1TJg9NB8eg/.
How it works...
In the simplest case, can be a lookup table with an entry for every state-action
pair.
And, consequently, the best policy can be determined according to this formula:
The DQN (Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015)
builds on NFQ, but introduces a few changes. These include updating
parameters only in mini-batches every few iterations, based on random samples
from a replay memory. Since, in the original paper, the algorithm learned from
pixel values on the screen, the first layers of the network are convolutional (we'll
introduce these in Chapter 7, Advanced Image Applications).
See also
Here is the website for Sutton and Barto's seminal book Reinforcement
Learning: An Introduction: http://incompleteideas.net/book/the-book-2nd.html.
They've described a simple agent for blackjack in there. If you are looking for
other card games, you can have a look at neuron-poker, an OpenAI poker
environment; they've implemented DQN and other algorithms:
https://github.com/dickreuter/neuron_poker.
For more details about the DQNs and how to use it, we recommend reading
Mnih et al.'s article, Playing Atari with Deep Reinforcement Learning:
https://arxiv.org/abs/1312.5602.
Finally, the DQN and its successors, the Double DQN and Dueling DQNs form
the basis for AlphaGo, which has been published as Mastering the game of Go
without human knowledge (Silver and others, 2017) in Nature:
https://www.nature.com/articles/nature24270.
Advanced Image Applications
The applications of artificial intelligence in computer vision include robotics,
self-driving cars, facial recognition, recognizing diseases in biomedical images,
and quality control in manufacturing, among many others.
In this chapter, we'll start with image recognition (or image classification),
where we'll look into basic models and more advanced models. We'll then create
images using Generative Adversarial Networks (GANs).
Technical requirements
We'll use many standard libraries, such as NumPy, Keras, and PyTorch, but we'll
also see a few more libraries that we'll mention at the beginning of each recipe as
they become relevant.
You can find the notebooks for this chapter's recipes on GitHub at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/tree/master/chapter07.
Getting ready
Before we can start, we have to install a library. In this recipe, we'll use scikit-
image, a library for image transformations, so we'll quickly set this up:
How to do it...
We'll first load and prepare the dataset, then we'll learn models for classifying
clothing items from the Fashion-MNIST dataset using different approaches. Let's
start by loading the Fashion-MNIST fashion dataset.
We should see the following image of a sneaker, the first image in the training
set:
DoG features
MLP
LeNet
Transfer learning with MobileNet
Let's start with DoG.
Difference of Gaussians
Let's write a function that extracts image features using a Gaussian pyramid:
We are nearly ready to start learning. We only need to iterate over all the images
and extract our Gaussian pyramid features. Let's create another function that
does that:
For training the model, we apply the featurize() function on our training
dataset. We'll use a linear support vector machine as our model. Then, we'll
apply this model to the features extracted from our test dataset - please note that
this might take a while to run:
We get 84% accuracy over the validation dataset from a linear support vector
machine using these features. With some more tuning of the filters, we could
have achieved higher performance, but that is beyond the scope of this recipe.
Before the publication of AlexNet in 2012, this method was one of the state-of-
the-art methods for image classification.
Another way to train a model is to flatten the images and feed the normalized
pixel values directly into a classifier, such as an MLP. That is what we'll try
now.
Multilayer perceptron
A relatively simple way to classify images is with an MLP. In this case of a two-
layer MLP with 10 neurons, you can think of the hidden layer as a feature
extraction layer of 10 feature detectors.
We have seen examples of MLPs already a few times in this book, so we'll skip
over the details here; perhaps of interest is that we flatten images from 28x28 to
a vector of 784. As for the rest, suffice it to say that we train for categorical
cross-entropy and we'll monitor accuracy.
We'll use the following function to wrap our training set. It should be fairly self-
explanatory:
After 50 epochs, our accuracy in the validation set is 0.886.
The next model is the classic ConvNet proposed for MNIST, employing
convolutions, pooling, and fully connected layers.
LeNet5
We can also have a look at the confusion matrix to see how well we distinguish
particular pieces of clothing from others:
MobileNet can be downloaded with weights for transfer learning. This means
that we leave most or all of MobileNet's weights fixed. In most cases, we would
only add a new output projection in order to discriminate a new set of classes on
top of the MobileNet representation:
For our transfer model, we have to append a pooling layer, and then we can
append an output layer just as in the previous two neural networks:
Please note that we freeze or fix the weights in the MobileNet model, and only
learn the two layers that we add on top.
The validation accuracy for MobileNet transfer learning is very similar to LeNet
and our MLP: 0.893.
How it works...
Image classification consists of assigning a label to an image, and this was where
the deep learning revolution started.
The following graph, taken from the preceding URL, illustrates the performance
increase for the ImageNet benchmark for image classification over time:
TOP 1 ACCURACY (also more simply called accuracy) on the y axis is a metric
that measures the proportion of correct predictions over all predictions, or in
other words, the ratio of how often an object was correctly identified. The State-
of-the-art line on the graph has been continuously improving over time (the x
axis), until now, reaching an 87.4% accuracy rate with the NoisyStudent method
(see here for details: https://paperswithcode.com/paper/self-training-with-noisy-
student-improves).
In the following graph, you can see a timeline of deep learning in image
recognition, where you can see the increasing complexity (in terms of the
number of layers) and the decreasing error rate in the ImageNet Large-Scale
Visual Recognition Challenge (ILSVRC):
Difference of Gaussian
We used utility functions from skimage to extract the features, then we applied a
linear support vector machine on top as the classifier. We could have tried other
classifiers, such as random forest or gradient boosting, instead in order to
improve the performance.
LeNet5
You can see the LeNet architecture in the following diagram (created using the
NN-SVG tool at http://alexlenail.me/NN-SVG):
Convolutions are a very important transformation in image recognition and are
one of the most important building blocks of very deep neural networks in image
recognition. Convolutions consist of feedforward connections, called filters or
kernels, which are applied to rectangular patches of the image (the previous
layer). Each resulting map is then the sliding window of the kernel over the
whole image. These convolutional maps are usually followed by subsampling by
pooling layers (in the case of LeNet, the maximum of each kernel is extracted).
We can retrain (fine-tune) the model to improve performance for our application,
or we could use the model as is and put additional layers on top in order to
classify new classes.
If we wanted to fine-tune the model (with or without the top), we would leave
the base model (MobileNetV2) trainable. Obviously, the training could take
much longer that way since many more layers would need to be trained. That's
why we've frozen all of MobileNetV2's layers during training, setting its
trainable attribute to False.
See also
You can find a review of ConvNet, from LeNet over AlexNet to more recent
architectures, in A Survey of the Recent Architectures of Deep Convolutional
Neural Networks by Khan and others (2020), available from arXiv:
https://arxiv.org/pdf/1901.06032.pdf.
Generating images
Adversarial learning with GANs, introduced by Ian Goodfellow and others in
2014, is a framework for fitting the distributions of a dataset by pairing two
networks against each other in a way that one model generates examples and the
others discriminate, whether they are real or not. This can help us to extend our
dataset with new training examples. Semi-supervised training with GANs can
help achieve higher performance in supervised tasks while using only small
amounts of labeled training examples.
We don't need any special libraries for this recipe. We'll use TensorFlow with
Keras, NumPy, and Matplotlib, all of which we've seen earlier. For saving
images, we'll use the Pillow library, which you can install or upgrade as follows:
How to do it...
For our approach with a GAN, we need a generator – a network that takes some
input, which could be noise – and a discriminator, an image classifier, such as
the one seen in the Recognizing clothing items recipe of this chapter.
Both the generator and discriminator are deep neural networks, and the two will
be paired together for training. After training, we'll see the training loss, example
images over epochs, and a composite image of the final epoch.
For training the network, we load the MNIST dataset and normalize it:
The images come in grayscale with pixel values of 0–255. We normalize into the
range -1 and +1. We then reshape to give a singleton dimension at the end.
For the error to feed through to the generator, we chain the generator with the
discriminator, as follows:
As our optimizer, we'll use Keras stochastic gradient descent:
Then, the discriminator learns when given fake and real images:
We concatenate true, 1, and fake, 0, images for the input to the discriminator.
The images are not perfect, but most of them are recognizable as digits.
How it works...
Generative models can generate new data with the same statistics as the training
set, and this can be useful for semi-supervised and unsupervised learning. GANs
were introduced by Ian Goodfellow and others in 2014 (Generative Adversarial
Nets, in NIPS; https://papers.nips.cc/paper/5423-generative-adversarial-nets) and
DCGANs by Alec Radford and others in 2015 (Unsupervised Representation
Learning with Deep Convolutional Generative Adversarial Networks;
https://arxiv.org/abs/1511.06434). Since the original papers, many incremental
improvements have been proposed.
In the GAN technique, the generative network learns to map from a seed – for
example, randomized input to the target data distribution – while the
discriminative network evaluates and discriminates data produced by the
generator from the true data distribution.
In training, we feed random noise into our generator and then let the
discriminator learn how to classify generator output against genuine images. The
generator is then trained given the output of the discriminator, or rather the
inverse of it. The less likely the discriminator judges an image a fake, the better
for the generator, and vice versa.
See also
The original GAN paper, Generative Adversarial Networks (Ian Goodfellow and
others; 2014), is available from arXiv: https://arxiv.org/abs/1406.2661.
There are many more GAN architectures that are worth exploring. Erik Linder-
Norén implemented dozens of state-of-the-art architectures in both PyTorch and
Keras. You can find them in his GitHub repositories
(https://github.com/eriklindernoren/PyTorch-GAN and
https://github.com/eriklindernoren/Keras-GAN, respectively).
Getting ready
We'll need torchvision for this recipe. This will help us download our dataset.
We'll quickly install it:
For PyTorch, we'll need to get a few preliminaries out of the way, such as to
enable CUDA and set tensor type and device:
In a break from the style in other recipes, we'll also get the imports out of the
way:
Now, let's get to it.
How to do it...
We'll first get the imports out of the way. Then, we'll load our dataset, define the
model components, including the encoder, decoder, and discriminator, then we'll
do our training, and finally, we'll visualize the resulting representations.
We'll need to set a few global variables that will define training and the dataset.
Then, we load our dataset:
The transformation in the normalize corresponds to the mean and standard
deviation of the MNIST dataset.
We'll show how to use the adversarial autoencoder with and without labels:
As you can see in the comment, we've broken out the training loop. The training
loop looks as follows:
For this code segment, we'll discuss in the How it works... section how three
different losses are calculated and back-propagated. Please also note the
supervised parameter, which defines whether we want to use supervised or
unsupervised training.
The training and validation errors go consistently down, as we can see in the
following graph:
If you run this, compare the generator and discriminator losses – it's interesting
to see how the generator and discriminator losses drive each other.
In the supervised condition, the projections of the encoder space do not have
much to do with the classes, as you can see in the following tsne plot:
This is a 2D visualization of the encoder's representation space of the digits. The
colors (or shades if you are looking at this in black and white), representing
different digits, are all grouped together, rather than being separated into
clusters. The encoder is not distinguishing between different digits at all.
What is encoded is something else entirely, which is style. In fact, we can vary
the input into the decoder on each of the two dimensions separately in order to
show this:
The first five rows correspond to a linear range of the first dimension with the
second dimension kept constant, then in the next five rows, the first dimension is
fixed and the second dimension is varied. We can appreciate that the first
dimension corresponds to the thickness and the second to inclination. This is
called style transfer.
How it works...
An autoencoder is a network of two parts – an encoder and a decoder – where
the encoder maps the input into a latent space and the decoder reconstructs the
input. Autoencoders can be trained to reconstruct the input by a reconstruction
loss, which is often the squared error between the original input and the restored
input.
Since adversarial autoencoders are GANs, and therefore rely on the competition
between generator and discriminator, the training is a bit more involved than it
would be for a vanilla autoencoder. We calculate three different types of errors:
In our case, we force prior distribution and decoder output within the range 0
and 1, and we can therefore use cross-entropy as the reconstruction error.
It might be helpful to highlight the code segments responsible for calculating the
different types of error.
There's an extra flag for feeding the labels into the decoder as supervised
training. We found that in the supervised setting, the encoder doesn't represent
the digits, but rather the style. We argue that this is the case since, in the
supervised setting, the reconstruction error doesn't depend on the labels
anymore.
Quite a few applications should come to mind when talking about video, such as
object tracking, event detection (surveillance), deep fake, 3D scene
reconstruction, and navigation (self-driving cars).
A lot of them require many hours or days of computation. We'll try to strike a
sensible compromise between what's possible and what's interesting. This
compromise might be felt more than in other chapters, where computations are
not as demanding as for video. As part of this compromise, we'll work on videos
frame by frame, rather than across the temporal domain. Still, as always, we'll
try to work on problems by giving examples that are either representative of
practical real-world applications, or that are at least similar.
In this chapter, we'll start with image detection, where an algorithm applies an
image recognition model to different parts of an image in order to localize
objects. We'll then give examples of how to apply this to a video feed. We'll then
create videos using a deep fake model, and reference more related models for
both creating and detecting deep fakes.
Technical requirements
We'll use many standard libraries, including keras and opencv, but we'll see a
few more libraries that we'll mention at the beginning of each recipe before
they'll become relevant.
You can find the notebooks for this chapter's recipes on GitHub at
https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-
Cookbook/tree/master/chapter08.
Localizing objects
Object detection refers to identifying objects of particular classes in images and
videos. For example, in self-driving cars, pedestrians and trees have to be
identified in order to be avoided.
Getting ready
For this recipe, we'll need the Python bindings for the Open Computer Vision
Library (OpenCV) and scikit-image:
We'll use a code based on the keras-yolo3 library, which was quick to set up
with only a few changes. We can quickly download this as well:
Finally, we also need the weights for the YOLOv3 network, which we can
download from the darknet open source implementation:
You should now have the example image, the yolo3-keras Python script, and
the YOLOv3 network weights in your local directory from which you run your
notebook.
How to do it...
We'll import the keras-yolo3 library, load the pretrained weights, and
then perform object detection given images or the video feed from a
camera:
2. We can then load our network with the pretrained weights as follows. Please
note that the weight files are quite big – they'll occupy around 237 MB of
disk space:
Our model is now available as a Keras model.
We should see our example image annotated with labels for each
bounding box, as can be seen in the following screenshot:
We can extend this for videos using the OpenCV library. We can capture
images frame by frame from a camera attached to our computer, run the
object detection, and show the annotated image.
Please note that this implementation is not optimized and might run relatively slowly. For faster
implementations, please refer to the darknet implementation linked in the See also section.
When you run the following code, please know that you can stop the
camera by pressing q:
We capture our image as grayscale, but then have to convert it back to
RGB using scikit-image by stacking the image. Then we detect objects
and show the annotated frame.
How it works...
We've implemented an object detection algorithm with Keras. This came out of
the box with a standard library, but we connected it to a camera and applied it to
an example image.
One of the main requirements of object detection is speed – you don't want to
wait to hit the tree before recognizing it.
Fast R-CNN is an improvement over R-CNN by the same author (2014). Each
region of interest, a rectangular image patch defined by a bounding box, is scale
normalized by image pyramids. The convolutional network can then process
these object proposals (from a few thousand to as many as many thousands)
through a single forward pass of a convolutional neural network. As an
implementation detail, Fast R-CNN compresses fully connected layers with
singular value decomposition for speed.
YOLO is a single network that proposed bounding boxes and classes directly
from images in a single evaluation. It was much faster than other detection
methods at the time; in their experiments, the author ran different versions of
YOLO at 45 frames per second and 155 frames per second.
The SSD is a single-stage model that does away with the need for a separate
object proposal generation, instead of opting for a discrete set of bounding boxes
that are passed through a network. Predictions are then combined across
different resolutions and bounding box locations.
YOLOv4 introduces several new network features to their CNN and they exhibit
fast processing speed, while maintaining a level of accuracy significantly
superior to YOLOv3 (43.5% average precision (AP), for the MS COCO dataset
at a real-time speed of about 65 frames per seconds on a Tesla V100 GPU).
There's more...
There are different ways of interacting with a web camera, and there are even
some mobile apps that allow you to stream your camera feed, meaning you can
plug it into applications that run on the cloud (for example, Colab notebooks) or
on a server.
One of the most common libraries is matplotlib, and it is also possible to live
update a matplotlib figure from the web camera, as shown in the following code
block:
This is the basic template for initiating your video feed, and showing it in a
matplotlib subfigure. We can stop by interrupting the kernel.
We'll mention a few more libraries to play with in the next section.
See also
Faking videos
A deep fake is a manipulated video produced by the application of deep learning.
Potential unethical uses have been around in the media for a while. You can
imagine how this would end up in the hands of a propaganda mechanism trying
to destabilize a government. Please note that we are advising against producing
deep fakes for nefarious purposes.
There are ethical applications of the deep fake technology, and some of them are
a lot of fun. Have you ever wondered how Sylvester Stallone may have looked
in Terminator? Today you can!
In this recipe, we'll learn how to create a deep fake with Python. We'll download
public domain videos of two films, and we'll produce a deep fake by replacing
one face with another. Charade was a 1963 film directed by Stanley Donen in a
style reminiscent of a Hitchcock film. It pairs off Cary Grant in his mid-fifties
and Audrey Hepburn in her early 30s. We thought we'd make the pairing more
age-appropriate. After some searching, what we found was Maureen O'Hara in
the 1963 John Wayne vehicle McLintock! to replace Audrey Hepburn.
Getting ready
Faceit is a wrapper around the faceswap library, which facilitates many of the
tasks that we'll need to perform for deep fake. We've forked the faceit repository
at https://github.com/benman1/faceit.
What we have to do is download the faceit repository and install the requisite
library.
You can download (clone) the repository with git (add an exclamation mark if
you are typing this in an ipython notebook):
We found that a Docker container was well-suited for the installation of
dependencies (for this you need Docker installed). We can create a Docker
container like this:
This should take a while to build. Please note that the Docker image is based on
Nvidia's container, so you can use your GPU from within the container.
Please note that, although there is a lightweight model that we could use, we'd
highly recommend you run the deep fake on a machine with a GPU.
Inside the container, we can run Python 3.6. All of the following commands
assume we are inside the container and in the /project directory.
How to do it...
We need to define videos and faces as inputs to our deep fake process.
If we don't provide the images, faceit will extract all face images irrespective of whom
they show. We can then place two of these images for the persons directory, and
delete the directories with faces under data/processed/.
2. We then need to define the videos that we want to use. We have the choice of
using the complete films or short clips. We didn't find good clips for the
McLintock! film, so we are using the whole film. As for Charade, we've
focused on the clip of a single scene. We have these clips on disk as
mclintock.mp4 and who_trust.mp4.
Please note that you should only download videos from sites that permit
or don't disallow downloading, even of public domain videos:
This defines the data used by our model as a couple of videos. Faceit
allows an optional third parameter that can be a link to a video, from
where it can be downloaded automatically. However, before you are
downloading videos from YouTube or other sites, please make sure this
is permitted in their terms of service and legal within your jurisdiction.
3. The creation of the deep fake is then initiated by a few more lines of code
(and a lot of tweaking and waiting):
The preprocess step consists of downloading the videos, extracting all the frames
as images, and finally extracting the faces. We are providing the faces already,
so you don't have to perform the preprocess step.
The following image shows Audrey Hepburn on the left, and Maureen O'Hara
playing Audrey Hepburn on the right:
The changes might seem subtle. If you want something clearer, we can use the
same model to replace Cary Grant with Maureen O'Hara:
In fact, we could produce a film, Being Maureen O'Hara, by disabling the face
filter in the conversion.
We could have used more advanced models, more training to improve the deep
fake, or we could have chosen an easier scene. However, the result doesn't look
bad at all sometimes. We've uploaded our fake video to YouTube, where you
can view it: https://youtu.be/vDLxg5qXz4k.
How it works...
The typical deep fake pipeline consists of a number of steps that we
conveniently glossed over in our recipe, because of the abstractions afforded in
faceit. These steps are the following, given person A and person B, where A is to
be replaced by B:
In our case, the face recognition library (face-recognition) has a very good
performance in terms of detection and recognition. However, it still suffers from
high false positives, but also false negatives. This can result in a poor
experience, especially in frames where there are several faces.
In the current version of the faceswap library, we would extract frames from our
target video in order to get landmarks for all the face alignments. We can then
use the GUI in order to manually inspect and clean up these alignments in order
to make sure they contain the right faces. These alignments will then be used for
the conversion: https://forum.faceswap.dev/viewtopic.php?t=27#align.
Each of these steps requires a lot of attention. At the heart of the whole operation
is the model. There can be different models, including a generative adversarial
autoencoder and others. The original model in faceswap is an autoencoder with a
twist. We've used autoencoders before in Chapter 7, Advanced Image
Applications. This one is relatively conventional, and we could have taken our
autoencoder implementation from there. However, for the sake of completeness,
we'll show its implementation, which is based on keras/tensorflow (shortened):
This code, in itself, is not terribly interesting. We have two functions, Decoder()
and Encoder(), which return decoder and encoder models, respectively. This is
an encoder-decoder architecture with convolutions. The PixelShuffle layer in
the upscale operation of the decoder rearranges data from depth into blocks of
spatial data through a permutation.
Now, the more interesting part of the autoencoder is in how the training is
performed as two models:
We have two autoencoders, one to be trained on A faces and one on B faces. Both
autoencoders are minimizing the reconstruction error (measured in mean
absolute error) of output against input. As mentioned, we have a single encoder
that forms part of the two models, and is therefore going to be trained both on
faces A and faces B. The decoder models are kept separate between the two faces.
This architecture ensures that we have a common latent representation between A
faces and B faces. In the conversion, we can take a face from A, represent it, and
then apply the decoder for B in order to get a B face corresponding to the latent
representation.
See also
We've put together some further references relating to playing around with
videos and deep fakes, as well as detecting deep fakes.
Deep fakes
We've collated a few links relevant to deep fakes and some more links that are
relevant to the process of creating deep fakes.
The face recognition library has been used in this recipe to select image regions
for training and application of the transformations. It is available on GitHub at
https://github.com/ageitgey/face_recognition.
As for more complex video manipulations with deep fakes, quite a few tools are
available, of which we'll highlight two:
The faceswap library has a GUI and even a few guides:
https://github.com/deepfakes/faceswap.
DeepFaceLab is a GUI application for creating deep fakes:
https://github.com/iperov/DeepFaceLab.
Many different models have been proposed and implemented, including the
following:
The paper DeepFakes and Beyond: A Survey of Face Manipulation and Fake
Detection (Ruben Tolosana and others, 2020) provides more links and more
resources to datasets and methods.
Deep Learning in Audio and Speech
In this chapter, we'll deal with sounds and speech. Sound data comes in the form
of waves, and therefore requires different preprocessing than other types of data.
We'll implement several applications with sound and speech in this chapter.
We'll first do a simple example of a classification task, where we try to
distinguish different words. This would be a typical application in a smart home
device to distinguish different commands. We'll then look at a text-to-speech
architecture. You could apply this to create your own audio books from text, or
for the voice output of your home-grown smart home device. We'll close with a
recipe for generating music. This is perhaps more of a niche application in the
commercial sense, but you could build your own music for fun or to entertain
users of your video game.
For the recipes in this chapter, please make sure you have a GPU available. On
Google Colab, make sure you activate a GPU runtime.
Getting ready
For this recipe, we'll need the librosa library as mentioned at the start of the
chapter. We'll also need to download the Speech Commands dataset, and for that
we'll need to install the wget library first:
Alternatively, we could use the !wget system command in Linux and macOS.
We'll create a new directory, download the archive with the dataset, and extract
the tarfile:
This gives us a number of files and directories within the data/train directory:
Most of these refer to speech commands; for example, the bed directory contains
examples of the bed command.
How to do it...
In this recipe, we'll train a neural network to recognize voice commands. This
recipe is inspired by the TensorFlow tutorial on speech commands at
https://www.tensorflow.org/tutorials/audio/simple_audio.
We'll first perform data exploration, then we'll import and preprocess our dataset
for training, and then we will create a model, train it, and check its performance
in validation:
1. Let's start with some data exploration: we'll listen to a command, look at its
waveform, and then at its spectrum. The librosa library provides
functionality to load sound files into a vector:
We can also get a Jupyter widget for listening to sound files or to the
loaded vector:
Pressing play, we hear the sound. Note that this works even over a
remote connection, for example, if we use Google Colab.
2. Now, let's get to the data importing and preprocessing. We have to iterate
over files, and store them as a vector:
For simplicity, we are only taking three commands here: bed, bird, and
tree. This is enough to illustrate the problems and the application of a
deep neural network to sound classification, and is simple enough that it
won't take very long. This process can, however, still take a while. It took
about an hour on Google Colab.
3. Let's create a deep learning model and then train and test it. First we need to
create our model and normalization. Let's do the normalization first:
We should see something like 0.805 as the output for the model accuracy in the
validation set.
How it works...
Sound is not that different from other domains, except for the preprocessing. It's
important to have at least a basic understanding about how sound is stored in a
file. At their most basic level, sounds are stored as amplitude over time and
frequency. Sounds are sampled at discrete intervals (this is the sampling rate).
48 kHz would be a typical recording quality for a DVD, and refers to a sampling
frequency of 48,000 times per second. The bit depth (also known as the dynamic
range) is the resolution for the amplitude of the signal (for example, 16 bits
means a range of 0-65,535).
For machine learning, we can do feature extraction from the waveform, and use
1D convolutions on the raw waveforms, or 2D convolutions on the spectrogram
representation (for example, Mel spectrograms – Davis and Mermelstein,
Experiments in syllable-based recognition of continuous speech, 1980). We've
dealt with convolutions before, in Chapter 7, Advanced Image Applications.
Briefly, convolutions are feedforward filters that are applied to rectangular
patches over the layer input. The resulting maps are usually followed by
subsampling by pooling layers.
The convolutional layers can be stacked very deeply (for example, Dai and
others, 2016: https://arxiv.org/abs/1610.00087). We've made it easy for the
reader to experiment with stacked layers. The number of layers, nlayers, is one
of the parameters in create_model().
Apart from librosa, useful libraries for audio processing in Python include
pydub (https://github.com/jiaaro/pydub) and scipy. The pyAudioProcessing
library comes with feature extraction and classification functionality for audio:
https://github.com/jsingh811/pyAudioProcessing.
There are a few more libraries and repositories that are interesting to explore:
For this recipe, please make sure you have a GPU available. On Google Colab,
make sure you activate a GPU runtime. We'll also need the wget library, which
we can install from the notebook as follows:
We also need to clone the pytorch-dc-tts repository from GitHub and install
its requirements. Please run this from the notebook (or run it from the terminal
without the leading exclamation marks):
Please note that you need to have Git installed in order for this to work. If you
don't have Git installed, you can download the repository directly from within
your web browser.
How to do it...
We'll download the Torch model files, load them up in Torch, and then we'll
synthesize speech from sentences:
1. Downloading the model files: We'll download the dataset from dropbox:
2. Loading the model: Let's get the dependencies out of the way:
Now we can load the model:
How it works...
In this recipe, we've loaded the model published by Hideyuki Tachibana and
others, Efficiently Trainable Text-to-Speech System Based on Deep
Convolutional Networks with Guided Attention (2017;
https://arxiv.org/abs/1710.08969). We used the implementation at
https://github.com/tugstugi/pytorch-dc-tts.
Text encoder
Audio encoder
Attention
Audio decoder
The interesting part of this is the guided attention mentioned in the title of the
paper, which is responsible for the alignment of characters with time. They
constrain this attention matrix to be nearly linear with time, as opposed to
reading characters in random order given a guided attention loss:
This favors values on the diagonal of the matrix rather than off it. They argue
that this constraint helps to speed up the training time considerably.
WaveGAN
Donahue and others train a GAN in an unsupervised setting for the synthesis of
raw audio waveforms. They try two different strategies:
For the first strategy, they had to develop a spectrogram that they could convert
back to text.
For the WaveGAN, they flattened the 2D convolutions into 1D while keeping
the size (for example, a kernel of 5x5 became a 1D kernel of 25). Strides of 2x2
became 4. They removed the batch normalization layers. They trained using a
Wasserstein GAN-GP strategy (Ishaan Gulrajani and others, 2017; Improved
training of Wasserstein GANs; https://arxiv.org/abs/1704.00028).
There's more...
We can also use the WaveGAN model to synthesize speech from text.
This should show us two examples of generated sounds, each with a Jupyter
widget:
If these don't sound particularly natural, don't be afraid. After all, we've used a
random initialization of the latent space.
See also
Generating melodies
Artificial intelligence (AI) in music is a fascinating topic. Wouldn't it be cool if
your favorite group from the 70s was bringing out new songs, but maybe more
modern? Sony did this with the Beatles, and you can hear a song on YouTube,
complete with automatically generated lyrics, called Daddy's car:
https://www.youtube.com/watch?v=LSHZ_b05W7o.
Getting ready
If you are on Colab, you need another tweak to allow Python to find your system
libraries:
This is a clever workaround for Python's foreign library import system, taken
from the original Magenta tutorial, at
https://colab.research.google.com/notebooks/magenta/hello_magenta/hello_mag
enta.ipynb.
How to do it...
We'll first put together the start of a melody, and then we will load the
MelodyRNN model from Magenta and let it continue the melody:
1. Let's put a melody together. We'll take Twinkle Twinkle Little Star. The
Magenta project works with a note sequence representation called
NoteSequence, which comes with many utilities, including conversion to and
from MIDI. We can add notes to a sequence like this:
We can visualize the sequence using Bokeh, and then we can play the
note sequence:
This should only take a few seconds. The Magenta model is remarkably
small compared to some other models we've encountered in this book.
We can now feed in our previous melody, along with a few parameters in
order to continue the song:
Once again, we get the Bokeh library plot and a play widget:
We can create a MIDI file from our note sequence like this:
We can feed different melodies via MIDI files into the model, or we can try with
other parameters; we can increase or decrease the randomness (the temperature
parameter), or let the sequence continue for longer periods (the num_steps
parameter).
How it works...
MelodyRNN is an LSTM-based language model for musical notes. In order to
understand MelodyRNN, we first need to understand how Long Short-Term
Memory (LSTM) works. Published in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber (Long short-term memory:
https://doi.org/10.1162%2Fneco.1997.9.8.1735), and updated numerous times
since, LSTM is the most well-known example of a Recurrent Neural Network
(RNN) and represents a state-of-the-art model for image recognition and
machine learning tasks with sequences such as speech recognition, natural
language processing, and time series. LSTMs were, or have been, behind
popular tools by Google, Amazon, Microsoft, and Facebook for voice
recognition and language translation.
The basic unit of an LSTM layer is an LSTM cell, which consists of several
regulators, which we can see in the following schematic:
This diagram is based on Alex Graves and others, Speech recognition with deep
recurrent neural networks, (2013), taken from the English language Wikipedia
article on LSTMs at https://en.wikipedia.org/wiki/Long_short-term_memory.
The regulators include the following:
An input gate
An output gate
A forget gate
We can explain the intuition behind these gates without getting lost in the
equations. An input gate regulates how strongly the input influences the cell, an
output gate dampens the outgoing cell activation, and the forget gate is a decay
on the cell activity.
See also
Please note that Magenta has different variations of the MelodyRNN model
available
(https://github.com/magenta/magenta/tree/master/magenta/models/melody_rnn).
Apart from MelodyRNN, Magenta provides further models, including a
variational autoencoder for music generation, and many browser-based tools for
exploring and generating music: https://github.com/magenta/magenta.
You can find the original implementation of Parag K. Mital's NIPS paper, Time
Domain Neural Audio Style Transfer (2017; https://arxiv.org/abs/1711.11160), at
https://github.com/pkmital/time-domain-neural-audio-style-transfer.
Natural Language Processing
Natural language processing (NLP) is about analyzing texts and designing
algorithms to process texts, making predictions from texts, or generating more
text. NLP covers anything related to language, often including speech similar to
what we saw in the Recognizing voice commands recipe in Chapter 9, Deep
Learning in Audio and Speech. You might also want to refer to the Battling
algorithmic bias recipe in Chapter 2, Advanced Topics in Supervised Machine
Learning, or the Representing for similarity search recipe in Chapter 3, Patterns,
Outliers, and Recommendations, for more traditional approaches. Most of this
chapter will deal with the deep learning models behind the breakthroughs in
recent years.
Classifying newsgroups
Chatting to users
Translating a text from English to German
Writing a popular novel
Technical requirements
As in most chapters so far, we'll try both PyTorch and TensorFlow-based
models. We'll apply different, more specialized libraries in each recipe.
Classifying newsgroups
In this recipe, we'll do a relatively simple supervised task: based on texts, we'll
train a model to determine what an article is about, from a selection of topics.
This is a relatively common task with NLP; we'll try to give an overview of
different ways to approach this.
You might also want to compare the Battling algorithmic bias recipe in Chapter
2, Advanced Topics in Supervised Machine Learning, on how to approach this
problem using a bag-of-words approach (CountVectorizer in scikit-learn). In
this recipe, we'll be using approaches with word embeddings and deep learning
models using word embeddings.
Getting ready
We'll be using a dataset from scikit-learn, but we still need to download the word
embeddings. We'll use Facebook's fastText word embeddings trained on
Wikipedia:
Please note that the download can take a while and should take around 6 GB of disk space. If you are
running on Colab, you might want to put the embedding file into a directory of your Google Drive, so you
don't have to download it again when you restart your notebook.
How to do it...
First, we'll download the dataset using scikit-learn functionality. We'll download
the newgroup dataset in two batches, for training and testing, respectively:
This conveniently gives us training and test datasets, which we can use in the
three approaches.
Let's begin with covering the first one, using a bag-of-words approach.
Bag-of-words
We'll build a pipeline of counting words and reweighing them according to their
frequency. The final classifier is a random forest. We train this on our training
dataset:
CountVectorizer counts tokens in texts and tfidfTransformer reweighs the
counts. We'll discuss the term frequency-inverse document frequency
(TFIDF) reweighting in the How it works... section.
After the training, we can test the accuracy on the test dataset:
We get an accuracy of about 0.805. Let's see how our other two methods will do.
Using word embeddings is next.
Word embeddings
We'll apply this vectorization to our dataset and then train a random forest
classifier on top of these vectors:
Let's see whether our last method does any better than this. We'll build
customized word embeddings using Keras' embedding layer.
This creates the dictionary. Now we need to tokenize the text and pad sequences
to the right length:
Now we are ready to build our neural network:
Our model contains half a million parameters. Approximately half of them sit in
the embedding, and the other half in the feedforward fully connected layer.
We fit our networks for a few epochs, and then we can test our accuracy on the
test data:
We get about 0.902 accuracy. We haven't tweaked the model architecture yet.
How it works...
We've already talked about the Skipgram and the Continuous Bag of Words
(CBOW) algorithms in the Making decisions based on knowledge recipe in
Chapter 5, Heuristic Search Techniques and Logical Inference (within the
Graph embedding with Walklets subsection).
Very briefly, word vectors are a simple machine learning model that can predict
the next word based on the context (the CBOW algorithm) or can predict the
context based on a single word (the Skipgram algorithm). Let's quickly look at
the CBOW neural network.
This illustration shows how, in the CBOW model, words are predicted based on
the surrounding context. Here, words are represented as bag-of-words vectors.
The hidden layer is composed of a weighted average of the context (linear
projection). The output word is a prediction based on the hidden layer. This is
adapted from an image on the French-language Wikipedia page on word
embeddings: https://fr.wikipedia.org/wiki/Word_embedding.
Intuitively, a king and a queen are similar societal positions, only one is taken up
by a man, the other by a woman. This is reflected in the embedding space
learned on billions of words. Starting with the vector of king, subtracting the
vector of man, and finally adding the vector of woman, the closest word that we
end up at is queen.
The embedding space can tell us a lot about how we use language, some of it a
bit concerning, such as when the word vectors exhibit gender stereotypes:
TFIDF
In the next recipes of this chapter, we'll go beyond the encodings of single words
and study more complex language models.
There's more...
We'll briefly look at learning your own word embeddings using Gensim,
building more complex deep learning models, and using pre-trained word
embeddings in Keras:
Let's read in a text file in order to feed it as the training dataset for
fastText:
This can be useful for transfer learning, search applications, or for cases
when learning the embeddings would take too long. Using Gensim, this
is only a few lines of code (adapted from the Gensim documentation).
The training itself is straightforward and, since our text file is small,
relatively quick:
You can find the Crime and Punishment novel at Project Gutenberg, where there are many more classic
novels: http://www.gutenberg.org/ebooks/2554.
You can retrieve vectors from the trained model like this:
2. Building more complex deep learning models: for more difficult problems,
we can use stacked conv1d layers on top of the embedding, as follows:
For training and testing, you have to feed in the word indices by looking
them up in our new dictionary and pad them to the same length as we've
done before.
See also
GloVe: https://nlp.stanford.edu/projects/glove/
fastText: https://fasttext.cc/docs/en/crawl-vectors.html
Word2vec: https://code.google.com/archive/p/word2vec/
Gensim: https://radimrehurek.com/gensim/
fastText: https://fasttext.cc/
spaCy: https://spacy.io/
Kashgari is a library built on top of Keras for text labeling and text classification
and includes Word2vec and more advanced models such as BERT and GPT2
language embeddings: https://github.com/BrikerMan/Kashgari.
Chatting to users
In 1966, Joseph Weizenbaum published an article about his chatbot ELIZA,
called ELIZA—a computer program for the study of natural language
communication between man and machine. Created with a sense of humor to
show the limitations of technology, the chatbot employed simplistic rules and
vague, open-ended questions as a way of giving an impression of empathic
understanding in the conversation, and was in an ironic twist often seen as a
milestone of artificial intelligence. The field has moved on, and today, AI
assistants are around us: you might have an Alexa, a Google Echo, or any of the
other commercial home assistants in the market.
In this recipe, we'll be building an AI assistant. The difficulty with this is that
there are an infinite amount of ways for people to express themselves and that it
is simply impossible to anticipate everything your users might say. In this recipe,
we'll be training a model to infer what they want and we'll respond accordingly
in consequence.
Getting ready
For this recipe, we'll be using a framework developed by Fariz Rahman called
Eywa. We'll install it with pip from GitHub:
Eywa has the main capabilities of what's expected from a conversational agent,
and we can look at its code for some of the modeling that's behind its power.
We are also going to be using the OpenWeatherMap Web API through the pyOWM
library, so we'll install this library as well:
With this library, we can request weather data in response to user requests as
part of our chatbot functionality. If you want to use this in your own chatbot, you
should register a free user account and get your API key on
OpenWeatherMap.org for up to 1,000 requests a day.
How to do it...
Our agent will process sentences by the user, interpret them, and respond
accordingly. It will first predict the intent of user queries, and then extract
entities in order to know more precisely what the query is about, before returning
an answer:
1. Let's start with the intent classes – based on a few samples of phrases each,
we'll define intents such as greetings, taxi, weather, datetime, and music:
This is to check for a specific place for the weather prediction. We can
test the entity extraction for the weather as well:
We ask for the weather in London, and, in fact, our entity extraction
successfully comes back with the place name:
4. Let's create some interaction based on the classifier and entity extraction.
We'll write a response function that can greet, tell the date, and give a weather
forecast:
We are leaving out functionality for calling taxis or playing music:
The question_and_answer() function answers a user query.
This wraps up our recipe. We've implemented a simple chatbot that first
predicts intent and then extracts entities. Based on intent and entities, a user
query is answered based on rules.
You should be able to ask for the date and the weather in different places,
however, it will tell you to upgrade your software if you ask for taxis or music.
You should be able to implement and extend this functionality by yourself if you
are interested.
How it works...
We've implemented a very simple, though effective, chatbot for basic tasks. It
should be clear how this can be extended and customized for more or other
tasks.
ELIZA
These are excerpts from Jez Higgins' ELIZA knock-off on GitHub: https://github.com/jezhiggins/eliza.py.
Sadly perhaps experiences with call centers might seem similar. They often
employ scripts as well, such as the following:
<Greeting>
"Thank you for calling, my name is _. How can I help you today?"
...
"Do you have any other questions or concerns that I can help you with today?"
While for machines, in the beginning, it is easier to hardcode some rules, if you
want to handle more complexity, you'll be building models that interpret
intentions and references such as locations.
Eywa
All three are very simple to use, though quite powerful. We've seen the first two
functionalities in action in the How to do it... section. Let's see the pattern
matching for food types based on semantic context:
We create a variable food with sample values: pizza, banana, yogurt, and
kebab. Using food terms in similar contexts will match our variables. The
expression should return this:
The usage looks very similar to regular expressions, however, while regular
expressions are based on words and their morphology, eywa.nlu.Pattern works
semantically, anchored in word embeddings.
A regular expression (short: regex) is a sequence of characters that define a search pattern. It was first
formalized by Steven Kleene and implemented by Ken Thompson and others in Unix tools such as QED, ed,
grep, and sed in the 1960s. This syntax has entered the POSIX standard and is therefore sometimes referred
to as POSIX regular expressions. A different standard emerged in the late 1990s with the Perl
programming language, termed Perl Compatible Regular Expressions (PCRE), which has been adopted
in different programming languages, including Python.
First of all, the eywa library relies on sense2vec word embeddings from
explosion.ai. Sense2vec word embeddings were introduced by Andrew Trask
and others (sense2vec – A Fast and Accurate Method for Word Sense
Disambiguation In Neural Word Embeddings, 2015). This idea was taken up by
explosion.ai, who trained part-of-speech disambiguated word embeddings on
Reddit discussions. You can read up on these on the explosion.ai website:
https://explosion.ai/blog/sense2vec-reloaded.
The classifier goes through the stored conversational items and picks out the
match with the highest similarity score based on these embeddings. Please note
that eywa has another model implementation based on recurrent neural networks.
See also
Libraries and frameworks abound for creating chatbots with different ideas and
integrations:
ParlAI is a library for training and testing dialog models. It comes with more
than 80 dialog datasets out of the box as well as, integration with Facebook
Messenger and Mechanical Turk:
https://github.com/facebookresearch/ParlAI.
NVIDIA has its own toolkit for conversational AI applications and comes
with many modules providing additional functionality such as automatic
speech recognition and speech synthesis: https://github.com/NVIDIA/NeMo.
Google Research open sourced their code for an open-domain dialog system:
https://github.com/google-research/google-research/tree/master/meena.
Rasa incorporates feedback on every interaction to improve the chatbot:
https://rasa.com/.
Chatterbot, a spaCy-based library:
https://spacy.io/universe/project/Chatterbot.
Getting ready
This tells you you are using an NVIDIA Tesla T4 with 0 MB of about 1.5 GB
used (1 MiB corresponds to approximately 1.049 MB).
We'll need a relatively new version of torchtext, a library with text datasets and
utilities for pytorch:
For the part in the There's more... section, you might need to install an additional
dependency:
We are using spaCy for tokenization. This comes preinstalled in Colab. In other
environments, you might have to pip-install it. We do need to install the
German core functionality, such as tokenization for spacy, which we'll rely on in
this recipe:
We'll load up this functionality in the main part of the recipe.
How to do it...
In this recipe, we'll be implementing a transformer model from scratch, and we'll
be training it for a translation task. We've adapted this notebook from Ben
Trevett's excellent tutorials on implementing a transformer sequence-to-
sequence model with PyTorch and TorchText:
https://github.com/bentrevett/pytorch-seq2seq.
We'll first prepare the dataset, then implement the transformer architecture, then
we'll train, and finally test:
1. Preparing the dataset – let's import all the required modules upfront:
The dataset we'll be training on is the Multi30k dataset. This is a dataset
of about 30,000 parallel English, German, and French short sentences.
These functions tokenize German and English text from a string into a
list of strings.
The mask in the self-attention layer is to avoid the model including the next token in its prediction (which
would be cheating).
3. Training the translation model, we can initialize the parameters using Xavier
uniform normalization:
We need to set the learning rate much lower than the default:
In our loss function, CrossEntropyLoss, we have to make sure to ignore
padded tokens:
4. Testing the model, we'll first have to write functions to encode a sentence for
the model and decode the model output back to get a sentence. Then we can
run some sentences and have a look at the translations. Finally, we can
calculate a metric of the translation performance across the test set.
We can compare this with the translation we get from our model:
Our translation looks actually better than the original translation. A purse
is not really a wallet (geldbörse), but a small bag (handtasche).
We can then calculate a metric, the BLEU score, of our model versus the
gold standard:
We get a BLEU score of 33.57, which is not bad while training fewer
parameters and the training finishes in a matter of a few minutes.
In translation, a useful metric is the Bilingual Evaluation Understudy (BLEU) score, where 1 is the best
possible value. It is the ratio of parts in the candidate translation over parts in a reference translation (gold
standard), where parts can be single words or a sequence of words (n-grams).
This wraps up our translation model. We can see it's actually not that hard to
create a translation model. However, there's quite a lot of theory, part of which
we'll cover in the next section.
How it works...
Until not long ago, Long Short-Term Memory networks (LSTMs) had been
the prevalent choice for deep learning models, however, since words are
processed sequentially, training can take a long time to converge. We have seen
in previous recipes how recurrent neural networks can be used for sequence
processing (please compare it with the Generating melodies recipe in Chapter 9,
Deep Learning in Audio and Speech). In yet other recipes, for example, the
Recognizing voice commands recipe in Chapter 9, Deep Learning in Audio and
Speech, we discussed how convolutional models have been replacing these
recurrent networks with an advantage in speed and prediction performance. In
NLP, convolutional networks have been tried as well (for example, Jonas
Gehring and others, Convolutional Sequence to Sequence Learning, 2017) with
improvements in speed and prediction performance with regard to recurrent
models, however, the transformer architecture proved more powerful and still
faster.
The transformer architecture was originally created for machine translation
(Ashish Vaswani and others, Attention is All you Need, 2017). Dispensing with
recurrence and convolutions, transformer networks are much faster to train and
predict since words are processed in parallel. Transformer architectures provide
universal language models that have pushed the envelope in a broad set of tasks
such as Neural Machine Translation (NMT), Question Answering (QA),
Named-Entity Recognition (NER), Textual Entailment (TE), abstractive text
summarization, and other tasks. Transformer models are often taken off the shelf
and fine-tuned for specific tasks in order to profit from general language
understanding acquired through a long and expensive training process.
An encoder – it encodes the input into a series of context vectors (aka hidden
states).
A decoder – it takes the context vector and decodes it into a target
representation.
The differences between the implementation in our recipe and the original
transformer implementation (Ashish Vaswani and others, Attention is All you
Need, 2017) is the following:
The encoder then passes through stacked modules, each consisting of attention,
feedforward fully connected layers, and normalization. Attention layers are
linear combinations of scaled multiplicative (dot product) attention layers
(Multi-Head Attention).
Some transformer architectures only contain one of the two parts. For example,
the OpenAI GPT transformer architecture (Alec Radfor and others, Improving
Language Understanding by Generative Pre-Training, 2018), which generates
amazingly coherent texts and consists of stacked decoders, while Google's
BERT architecture (Jacob Devlin and others, BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, 2019) also consists of
stacked encoders.
There's more...
Both Torch and TensorFlow have a repository for pretrained models. We can
download a translation model from the Torch hub and use it straight away. This
is what we'll quickly show. For the pytorch model, we need to have a few
dependencies installed first:
After this, we can download the model. It is quite big, which means it'll take up a
lot of disk space:
This model (Nathan Ng and others, Facebook FAIR's WMT19 News Translation
Task Submission, 2019) is state-of-the-art for translation. It even outperforms
human translations in precision (BLEU score). fairseq comes with tutorials for
training translation models on your own datasets.
The Torch hub provides a lot of different translation models, but also generic
language models.
See also
You can find a guide about the transformer architecture complete with PyTorch
code (and an explanation on positional embeddings) on the Harvard NLP group
website, which can also run on Google Colab:
http://nlp.seas.harvard.edu/2018/04/03/attention.html.
Lilian Weng of OpenAI has written about language modeling and transformer
models, and provides a concise and clear overview:
The same could be said of some human essays and utterances, however. Nassim
Taleb, in his book Fooled by Randomness, argued a person should be called
unintelligent if their writing could not be distinguished from an artificially
generated one (a reverse Turing test). In a similar vein, Alan Sokal's 1996 hoax
article Transgressing the Boundaries: Towards a Transformative Hermeneutics
of Quantum Gravity, accepted by and published in a well-known social science
journal, was a deliberate attempt by the university professor of physics to expose
a lack of intellectual rigor and the misuse of scientific terminology without
understanding. A possible conclusion could be that imitating humans might not
be the way forward toward intellectual progress.
OpenAI GPT-3, with 175 billion parameters, has pushed the field of language
models considerably, having learned facts in physics, being able to generate
programming code based on descriptions, and being able to compose
entertaining and funny prose.
Millions of fans across the world have been waiting for more than 200 years to
know how the story of Pride and Prejudice continues with Elizabeth and Mr
Darcy. In this recipe, we'll be generating Pride and Prejudice 2 using a
transformer-based model.
Getting ready
At the time of writing, Jane Austen's romantic early-19th century novel Pride
and Prejudice had by far the most downloads over the last 30 days (more than
47,000). We'll download the book in plain text format:
We'll be working in Colab, where you'll have access to Nvidia T4 or Nvidia K80
GPUs. However, you can use your own computer as well, using either GPUs or
even CPUs.
If you are working in Colab, you'll need to upload your text file to your Google
Drive (https://drive.google.com), where you can access it from Colab.
We'll be using a wrapper library for OpenAI's GPT-2 that's called gpt-2-simple,
which is created and maintained by Max Woolf, a data scientist at BuzzFeed:
This library will make it easy to fine-tune the model to new texts and show us
text samples along the way.
We then have a choice of the size of the GPT-2 model. Four sizes of GPT-2 have
been released by OpenAI as pretrained models:
The large model cannot currently be fine-tuned in Colab, but can generate text
from the pretrained model. The extra large model is too large to load into
memory in Colab, and can therefore neither be fine-tuned nor generate text.
While bigger models will achieve better performance and have more knowledge,
they will take longer to train and to generate text.
How to do it...
We've downloaded the text of a popular novel, Pride and Prejudice, and we'll
first fine-tune the model, then we'll generate similar text to Pride and Prejudice:
1. Fine-tuning the model: We'll load a pre-trained model and fine-tune it for our
texts.
At this point, you'd need to authorize the Colab notebook to have access
to your Google Drive. We'll use the Pride and Prejudice text file that we
uploaded to our Google Drive before:
2. Writing our new bestseller: We might need to get the model from Google
Drive and load it up into the GPU:
Please note that you might have to restart your notebook (Colab) again so
that the TensorFlow variables don't clash.
3. Now we can call a utility function in gpt-2-simple to generate the text into a
file. Finally, we can download the file:
Pride and Prejudice – the saga continues; reading the text, there are sometimes
some obvious flaws in the continuity, however, some passages are captivating to
read. We can always generate a few samples so that we have a choice of how our
novel continues.
How it works...
In this recipe, we've used the GPT-2 model to generate text. This is called
neural story generation and is a subset of neural text generation. Simply put,
neural text generation is the process of building a statistical model of a text or of
a language and applying this model to generate more text.
One major choice we have to make in our text generation is how to sample, and
we have a few choices:
Greedy search
Beam search
Top-k sampling
Top-p (nucleus) sampling
In greedy search, we take the highest rated choice each time, ignoring other
choices. In contrast, rather than taking a high-scoring token, beam search tracks
the scores of several choices in parallel in order to take the highest-scored
sequence. Top-k sampling was introduced by Angela Fan and others
(Hierarchical Neural Story Generation, 2018). In top-k sampling, all but the k
most likely words are discarded. Conversely, in top-p (also called: nucleus
sampling), the highest-scoring tokens surpassing probability threshold p are
chosen, while the others are discarded. Top-k and top-p can be combined in
order to avoid low-ranking words.
While the huggingface transformers library gives us all of these choices, with
gpt-2-simple, we have the choice of top-k sampling and top-p sampling.
See also
There are many fantastic libraries that make training a model or applying an off-
the-shelf model much easier. First of all, perhaps Hugging Face transformers,
You can find a tutorial on text generation with recurrent neural networks in
the TensorFlow documentation:
https://www.tensorflow.org/tutorials/text/text_generation.
In this chapter, we'll deal with monitoring and model versioning, visualizations
as dashboards, and securing a model against malicious hacking attacks that could
leak user data.
Technical requirements
For Python libraries, we will work with models developed in TensorFlow and
PyTorch, and we'll apply different, more specialized libraries in each recipe.
Getting ready
We won't use the notebook in this recipe. Therefore, we've omitted the
exclamation marks in this code block. We'll be running everything from the
terminal.
Altair has a very pleasant declarative way to plot graphs, which we'll see in the
recipe. Streamlit is a framework to create data apps – interactive applications in
the browser with visualizations.
We'll be building a simple app for model building. This is meant to show how
easy it is to create a visual interactive application for the browser in order to
demonstrate findings to non-technical or technical audiences.
For a very quick, practical introduction to streamlit, let's look at how a few
lines of code in a Python script can be served.
Streamlit hello-world
We'll write our streamlit applications as Python scripts, not as notebooks, and
we'll execute the scripts with streamlit to be deployed.
We'll create a new Python file, let's say streamlit_test.py, in our favorite
editor, for example, vim, and we'll write these lines:
This would show a select box or drop-down menu with the title Hello and a
choice between options A, B, and C. This choice will be stored in the
chosen_option variable, which we can output in the browser.
This should open our browser in a new tab or window showing our drop-down
menu with the three choices. We can change the option, and the new value will
be displayed.
This should be enough for an introduction. We'll come to the actual recipe now.
The main idea of our data app is that we incorporate decisions such as modeling
choices into our application, and we can observe the consequences, both
summarized in numbers and visually in plots.
We'll start by implementing the core functionality, such as modeling and dataset
loading, and then we'll create the interface to it, first in the side panel, and then
the main page. We'll write all of the code in this recipe to a single Python script
that we can call visualizing_model_results.py:
We need to load datasets into memory. This can include a download step,
and for bigger datasets, downloading could potentially take a long time.
Therefore, we are going to cache this step to disk, so instead of
downloading every time there's a button-click, we'll retrieve it from the
cache on disk:
Here, the dataset loading might take some time. However, caching means
that we only have to load each dataset exactly once, because
subsequently the dataset will be retrieved from cache, and therefore
loading will be much faster. This caching functionality, which can be
applied to long-running functions, is central to making streamlit respond
more quickly.
In the side panel, we'll be presenting the choices of datasets, model type,
and hyperparameters. Let's start by choosing the dataset:
This will load the datasets after we've made the choice between iris,
wine, and cover type.
This shows you the menu options in the browser. We'll show the results
of these choices in the main part of the browser page.
Finally, we'll show a facet plot of variables plotted against each other in
scatter plots. This is the part where we use the altair library:
Incorrectly classified examples are highlighted in these plots. Again, we've
made this part optional, activated by marking a checkbox.
The upper part of the main page for the Covetype dataset looks like this:
You can see the classification report and the confusion matrix. Below these
(not part of the screenshot) would be the data exploration and the data plots.
This concludes our demo app. Our app is relatively simple, but hopefully this
recipe can serve as a guide for building these apps for clear communication.
How it works...
This book is about hands-on learning, and we'd recommend this for streamlit as
well. Working with streamlit, you have a quick feedback loop where you
implement changes and see the results, and you continue until you are happy
with what you see.
Streamlit provides a local server that you can access remotely over the browser
if you want. So you can run your streamlit application server on Azure, Google
Cloud, AWS, or your company cloud, and see your results in your local browser.
Streamlit's API has an integration for many plotting and graphing libraries.
These include Matplotlib, Seaborn, Plotly, Bokeh, interactive plotting libraries,
such as Altair, Vega Lite, deck.gl for maps and 3D charts, and graphviz graphs.
Other integrations include Keras models, SymPy expressions, pandas
DataFrames, images, audio, and others.
Streamlit also comes with several types of widgets, such as sliders, buttons, and
drop-down menus. Streamlit also includes an extensible component system,
where each component consists of a browser frontend in HTML and JavaScript
and a Python backend, able to send and receive information bi-directionally.
Existing components interface with further libraries, including HiPlot, Echarts,
Spacy, and D3, to name but a few: https://www.streamlit.io/components.
You can play around with different inputs and outputs, you can start from
scratch, or you can improve on the code in this recipe. We could extend it to
show different results, build dashboards, connect to a database for live updates,
or build user feedback forms for subject matter experts to relay their judgment
such as, for example, annotation or approval.
See also
Aside from streamlit, there are other libraries and frameworks that can help to
create interactive dashboards, presentations, and reports, such as Bokeh, Jupyter
Voilà, Panel, and Plotly Dash.
If you are looking for dashboarding and live charting with database integration,
tools such as Apache Superset come in handy: https://superset.apache.org/.
In this recipe, we'll build a small inference server from scratch, and we'll focus
on the technical challenges around bringing AI into production. We'll showcase
how to develop a POC into a software solutions that is fit for production by
being robust, scaling to demand, responding timely, and that you can update as
fast as needed.
Getting ready
We'll have to switch between the terminal and the Jupyter environment in this
recipe. We'll create and log the model from the Jupyter environment. We'll
control the mlflow server from the terminal. We will note which one is
appropriate for each code block.
We'll use mlflow in this recipe. Let's install it from the terminal:
We'll assume you have conda installed. If not, please refer to the Setting up a
Jupyter environment recipe in Chapter 1, Getting Started with Artificial
Intelligence in Python, for detailed instructions.
We can start our local mlflow server with a sqlite database backend for
backend storage from the terminal like this:
This is where we can access this server from our browser. In the browser, we'll
be able to compare and check different experiments, and see the metrics of our
models.
In the There's more... section, we'll do a quick demo of setting up a custom API
using the FastAPI library. We'll quickly install this library as well:
How to do it...
We'll build a simple model from a column-separated value (CSV) file. We'll
try different modeling options, and compare them. Then we'll deploy this model:
We'll download a dataset as a CSV file and prepare for training. The
dataset chosen in this recipe is the Wine dataset, describing the quality of
wine samples. We'll download and read the wine-quality CSV file from
the UCI ML archive:
We split the data into training and test sets. The predicted column is
column quality:
2. Training with different hyperparameters:
Before running our training, we need to register the mlflow library with
the server:
We set our server URI. We can also give our experiment a name.
Each time we run the training set with different options, MLflow can log
the results, including metrics, hyperparameters, the pickled model, and a
definition as MLModel that captures library versions and creation time.
In our training function, we train on our training data, extracting metrics
of our model over the test data. We need to choose the appropriate
hyperparameters and metrics for comparison:
We fit the model, extract our metrics, print them to the screen, log them
to the mlflow server, and store the model artifact on the server as well.
After we've run this for a number of times with different parameters, we
can go to our server, compare model runs, and choose a model for
deployment.
3. Deploying a model as a local server. We can compare our models in the
browser. We should be able to find our wine experiments under the
experiments tab on our server.
We can then compare different model runs in the overview table, or get
an overview plot for different hyperparameters, such as this:
This contour plot shows us the two hyperparameters we've changed against
the Mean Average Error (MAE).
We can then choose a model for deployment. We can see the run ID for our
best model. Deployment of a model to a server can be done from the
command line, for example, like this:
We can pass data as JSON, for example, using curl, again from the terminal.
This could look as follows:
With this, we've finished our demo of model deployment with mlflow.
How it works...
A microservice is a single service that is independently deployable, maintainable, and testable. Structuring
an application as a collection of loosely coupled microservices is called a microservice architecture.
Another route would be to package your model and glue code for deployment
within the existing enterprise backend of your company. This integration has
several alternatives:
MLflow has command-line, Python, R, Java, and REST API interfaces for
uploading models to a model repository, for logging model results
(experiments), for uploading models to a model repository, for downloading
them again to use them locally, for controlling the server, and much more. It
offers a server, however, that also allows deployment to Azure ML, Amazon
Sagemaker, Apache Spark UDF, and RedisAI.
If you want to be able to access your mlflow server remotely, such as the case
usually when using the model server as an independent service (microservice),
we want to set the root to 0.0.0.0, as we've done in the recipe. By default, the
local server will start up at http://127.0.0.1:5000.
If we want to access models, we need to switch from the default backend storage
(this is where metrics will be stored) to a database backend, and we have to
define our artifact storage using a protocol in the URI, such as
file://$PWD/mlruns for the local mlruns/ directory. We've enabled a SQLite
database for the backend, which is the easiest way (but probably not the best for
production). We could have chosen MySQL Postgres, or another database
backend as well.
This is only part of the challenge, however, because models become stale or
might be unsuitable, facts we can only establish if we are equipped to monitor
model and server performance in deployment. Therefore, a note on monitoring is
in order.
Monitoring
For methods to detect outliers, please refer to the Discovering anomalies recipe
in Chapter 3, Patterns, Outliers, and Recommendations.
See also
While some tools support only one or a few modeling frameworks, others,
particularly BentoML and MLflow, support deploying models trained under all
major ML training frameworks such as FastAI, scikit-learn, PyTorch, Keras,
XGBoost, LightGBM, H2o, FastText, Spacy, and ONNX. Both of these further
provide maximum flexibility for anything created in Python, and they both have
a tracking functionality for monitoring.
Our recipe was adapted from the mlflow tutorial example. MLflow has many
more examples for different modeling framework integrations on GitHub:
https://github.com/mlflow/mlflow/.
Flask: https://palletsprojects.com/p/flask/
FastAPI: https://fastapi.tiangolo.com/
Using these, you can create endpoints that would take data such as images or
text and return a prediction.
Getting ready
Later during the recipe, we'll use the analysis script to calculate the privacy
bounds of our model.
How to do it...
We'll have to define data loaders for teacher models and the student model. The
teacher and student architectures are the same in our case. We'll train the
teachers, and then we train the student from the aggregates of the teacher
responses. We'll close with a privacy analysis executing the script from the
privacy repository.
1. Let's start by loading the data. We'll download the data using torch utility
functions:
This will load the MNIST dataset, and may take a moment. The
transform converts data to torch.FloatTensor. train_data and
test_data define the loaders for training and test data, respectively.
We define a training set for the student of 9,000 training samples and
1,000 test samples. Both sets are taken from the teachers' test dataset as
unlabeled training points – they will be labeled using the teacher
predictions:
2. Defining the models: We are going to define a single model for all the
teachers:
This is a convolutional neural network for image processing. Please refer
to Chapter 7, Advanced Image Applications, for more image processing
models.
Let's create another utility function for prediction from these models
given a dataloader:
This student data loader will be fed the aggregated teacher label:
This runs the student training.
Some parts of this code have been omitted from the training loop for the
sake of brevity. The validation looks like this:
The final training update reads as follows:
We see that it's a good model: 95.2 percent accuracy on the test dataset.
They provide a script to do this analysis based on the vote counts and the
used standard deviation of the noise. We've clone this repository earlier,
so we can change into a directory within it, and execute the analysis
script:
We need to save the aggregated teacher counts as a NumPy file. This can
then be loaded by the analysis script:
How it works...
We've created a set of teacher models from a dataset, and then we bootstrapped
from these teachers a student model that gives privacy guarantees. In this
section, we'll discuss some background about the problem of privacy in ML,
differential privacy, and how PATE works.
Leaking data about customers can bring great reputational damage to a company,
not to speak of fees from the regulator for being in violation of data protection
and privacy laws such as GDPR. Therefore, considering privacy in the creation
of datasets and in ML is as important as ever. As a point in case, data of 500,000
users from the well-known Netflix prize dataset for recommender development
was de-anonymized by co-referencing them to publicly available Amazon
reviews.
While a combination of a few columns could give too much away about specific
individuals, for example, an address or postcode together with the age would be
a give-away for anyone trying to trace data, ML models created on top of such
datasets can be insecure as well. ML models can potentially leak sensitive
information when hit by attacks such as membership inference attacks and
model inversion attacks.
Membership attacks consist, roughly speaking, of recognizing differences in the target model's predictions
on inputs that it was trained on compared to inputs that it wasn't trained on. You can find out more about
membership attacks from the paper Membership Inference Attacks against Machine Learning Models (Reza
Shokri and others, 2016). They showed that off-the-shelf models provided as a service by Google and others
can be vulnerable to these attacks.
In inversion attacks, given API or black box access to a model and some demographic information, the
samples used in the model training can be reconstructed. In a particularly impressive example, faces used
for training facial recognition models were reconstructed. Of even greater concern, Matthew Fredrikson
and others showed that models in personalized medicine can expose sensitive genomic information about
2014)
Differential privacy
The concept of DP, first formulated by Cynthia Dwork and others in 2006
(Calibrating Noise to Sensitivity in Private Data Analysis), is the gold standard
for privacy in ML. It centers around the influence of individual data points on
the decisions of an algorithm. Roughly speaking, this implies, in turn, that any
output from the model wouldn't give away whether an individual’s information
was included. In DP, data is perturbed with the noise of a certain distribution.
This not only can lead to safety against privacy attacks, but also to less
overfitting.
The key is then to set an upper bound to require nearly identical behavior of the
mapping (or mechanism) on neighboring databases:
In this formulation, the epsilon parameter is the multiplicative guarantee, and the
delta parameter the additive guarantee of probabilistically almost exact
outcomes. This means the DP cost that an individual incurs as a result of their
data being used is minimal. Delta privacy can be seen as a subset or the special
case, where epsilon is 0, and epsilon privacy as the case, where delta is 0.
These guarantees are achieved by masking small changes in the input data. For
example, a simple routine for this masking was described by Stanley L. Warner
in 1965 (Randomized response: A survey technique for eliminating evasive
answer bias). Respondents in surveys answer sensitive questions such as Have
you had an abortion? either truthfully or deterministically according to coin
flips:
1. Flip a coin.
2. If tails, respond truthfully: no.
3. If heads, flip a second coin and respond yes if heads, or respond no if tails.
Intuitively, accuracy decreases with the variance of the noise, so the variance has
to be chosen tight enough to provide good performance, but wide enough for
privacy.
The epsilon value depends on the aggregation, particularly the noise level, but
also on the context of the dataset and its dimensions. Please see How Much is
Enough? Choosing for Differential Privacy (2011), by Jaewoo Lee and Chris
Clifton, for a discussion.
See also
There are frameworks for both TensorFlow and PyTorch for encrypted ML:
ISBN: 978-1-78883-078-2
ISBN: 978-1-83921-953-5
Understand what artificial intelligence, machine learning, and data science are
Explore the most common artificial intelligence use cases
Learn how to build a machine learning pipeline
Assimilate the basics of feature selection and feature engineering
Identify the differences between supervised and unsupervised learning
Discover the most recent advances and tools offered for AI development in
the cloud
Develop automatic speech recognition systems and chatbots
Apply AI algorithms to time series data