Voulgaris, Bulut - AI For Data Science (AVG) (2018)
Voulgaris, Bulut - AI For Data Science (AVG) (2018)
Voulgaris, Bulut - AI For Data Science (AVG) (2018)
1. Introduction
1. About AI
2. AI facilitates data science
3. About the book
1. Optimization
2. Fuzzy inference systems
3. Artificial creativity
4. Additional AI methodologies
5. Glimpse into the future
6. About the methods
7. Summary
4. Chapter 3: Building a DL Network Using MXNet
1. Core components
2. MXNet in action
3. MXNet tips
4. Summary
1. TensorFlow architecture
2. Core components
3. TensorFlow in action
4. Visualization in TensorFlow: TensorBoard
5. High level APIs in TensorFlow: Estimators
6. Summary
1. Core components
2. Keras in action
3. Model Summary and Visualization
4. Converting Keras models to TensorFlow Estimators
5. Summary
1. PSO algorithm
2. Main PSO variants
3. PSO versus other optimization methods
4. PSO implementation in Julia
5. PSO in action
6. PSO tips
7. Summary
1. Big data
2. Specializations in data science
3. Publicly available datasets
4. Summary
1. Key terms
2. Reward hypothesis
3. Types of tasks
4. Reinforcement learning frameworks
1. Components
2. Extensions of conventional autoencoder models
3. Use cases and applications
1. Components
2. Training process
3. Pain points of a GAN model
22. Index
Landmarks
1. Cover
2 Lindsley Road
Basking Ridge, NJ 07920 USA
h ps://www.TechnicsPub.com
Cover design by Lorena Molinari
Edited by Sadie Hoberman and Lauren McCafferty
All rights reserved. No part of this book may be reproduced or transmi ed in any
form or by any means, electronic or mechanical, including photocopying, recording
or by any information storage and retrieval system, without wri en permission
from the publisher, except for the inclusion of brief quotations in a review.
The author and publisher have taken care in the preparation of this book, but make
no expressed or implied warranty of any kind and assume no responsibility for
errors or omissions. No liability is assumed for incidental or consequential damages
in connection with or arising out of the use of the information or programs
contained herein.
All trade and product names are trademarks, registered trademarks, or service
marks of their respective companies and are the property of their respective holders
and should be treated as such.
First Edition
First Printing 2018
Copyright © 2018 Yunus Emrah Bulut and Zacharias Voulgaris, PhD
ISBN, print ed. 9781634624091
ISBN, PDF ed. 9781634624121
Library of Congress Control Number: 2018951809
Introduction
There is no doubt that Artificial Intelligence (commonly abbreviated AI) is making
waves these days, perhaps more than the world anticipated as recently as the mid-
2010s. Back then, AI was an esoteric topic that was too math-heavy to a ract the
average computer scientist, but now, it seems to be a household term. While it was
once considered sci-fi lingo, it’s now common to see and hear the term “AI”
featured in ads about consumer products like smart phones.
This is to be expected, though; once an idea or technology reaches critical mass, it
naturally becomes more acceptable to a wider audience, even if just on the
application level. This level refers to what AI can do for us, by facilitating certain
processes, or automating others. However, all this often gives rise to a series of
misunderstandings. As AI itself has become more well-known, so have spread
various ominous predictions about potential dangers of AI — predictions that are
fueled by fear and fantasy, rather than fact.
Just like every other new technology, AI demands to be discussed with a sense of
responsibility and ethical stature. An AI practitioner, especially one geared more
towards the practical aspects of the field, understands the technology and its
limitations, as well as the possible issues it has, which is why he talks about it
without hyperbole and with projections of measured scope – that is, he projects
realistic applications of AI, without talking about scenarios that resemble sci-fi
films. After all, the main issues stemming for the misuse of a technology like this
have more to do with the people using it, rather than the technology itself. If an AI
system is programmed well, its risks are mitigated, and its outcomes are predictably
positive.
About AI
But what exactly is AI? For starters, it’s nothing like what sci-fi books and films
make it out to be. Modern AI technology helps to facilitate various processes in a
more automatic and autonomous way, with li le to no supervision from a human
user. AI initiatives are realistic and purely functional. Although we can dream
about what AI may evolve into someday, as AI practitioners, we focus on what we
know and what we are certain about, rather than what could exist in a few decades.
AI comprises a set of algorithms that make use of information – mainly in the form
of data – to make decisions and carry out tasks, much like a human would. Of
course, the emulation of human intelligence is not an easy task; as such, the AIs of
today are rudimentary and specialized. Despite their shortcomings, though, these
modern systems can be particularly good at the tasks they undertake, even be er
than humans. For example, an AI system, which is a standalone program
implementing one or more AI algorithms, that is created for identifying words from
speech, can be more accurate than humans doing the same task.
It’s important to note that all the AI systems we have today possess what is termed
narrow artificial intelligence. This means that current AIs can do a limited set of tasks
(or even just a single task) quite well, but offer at best mediocre performance at any
other task. For instance, an AI might be great at figuring out your age based on a
headshot, but that same AI almost certainly couldn’t tell a classical music piece
from a pop song.
Some AIs are designed to be used in robots, such as those designed for rescue
missions, able to navigate various terrains. Other AIs are specialized in crunching
data and facilitating various data analytics tasks. There are even AIs that emulate
creative processes, like the AIs that generate artistic works, using the pa erns they
deduce from catalogs of existing work. Chatbots and other such AIs are focused
solely on interacting with humans. The possibility of a more generalist AI (called
Artificial General Intelligence, or AGI) exists, but it may take a while before it can
manifest, or before we are ready to integrate it into our world.
Since all this may sound a bit abstract, let’s clarify it a bit. If a system can make
some decisions by capturing and analyzing signals related to the problem, that’s an
AI (sometimes termed “an AI system”). You’ve probably used an AI, even if you
didn’t know it. Online radios like Spotify and Pandora use AI to recommend songs,
and virtual assistants (like Siri) use AI to help you troubleshoot. Factors that help us
decide whether a system is AI include the system’s sophistication, its versatility,
and how able it is to perform complex tasks.
Professor Alan Turing was the first to talk about this topic in a scientific manner.
Upon studying this subject from both a theoretical and a practical perspective
(through the creation of the first modern-day computer, used to crack the Enigma
code in World War II1), he envisioned machines that could think and reason much
like humans.
One of Professor Turing’s most famous thought experiments is now named after
him. The Turing test is a simple yet powerful heuristic for determining if a computer
is advanced enough to manifest intelligence. This test involves taking either a
human or a computer, and concealing it with another human. Another human,
known as the examiner, then asks each of them a series of questions, without
knowing which is which. If the examiner cannot determine from the answers to
these questions whether he is speaking with a human or a computer, then the
computer is said to have passed the test. This simple test has remained a standard
for AI, still adding value to related research in the field in various ways.2
The connections among the nodes are weighted, indicating the contribution of each
node to the nodes of the next layer it is connected to, in the next layer. The weights
are initially randomized, when the network object is created, but are refined as the
ANN is trained.
Moreover, each node contains a mathematical function that creates a transformation
of the received signal, before it is passed to the next layer. This is referred to as the
transfer function (also known as the activation function). The sigmoid function is the
most well-known of these, but others include softmax, tanh, and ReLU. We’ll delve
more into these in a moment.
Furthermore, each layer has a bias node, which is a constant that appears unchanged
on each layer. Just like all the other nodes, the bias node has a weight a ached to its
output. However, it has no transfer function. Its weighted value is simply added to
the other nodes it is connected to, much like a constant c is added to a regression
model in Statistics. The presence of such a term balances out any bias the other
terms inevitably bring to the model, ensuring that the overall bias in the model is
minimal. As the topic of bias is a very complex one, we recommend you check out
some external resources4 if you are not familiar with it.
Once the transformed inputs (features) and the biases arrive at the end of the DL
network, they are compared with the target variable. The differences that inevitably
occur are relayed back to the various nodes of the network, and the weights are
changed accordingly. Then the whole process is repeated until the error margin of
the outputs is within a certain predefined level, or until the maximum number of
iterations is reached. Iterations of this process are often referred to as training
epochs, and the whole process is intimately connected to the training algorithm
used. In fact, the number of epochs used for training a DL network is often set as a
parameter and it plays an important role in the ANN’s performance.
All of the data entering a neuron (via connections with neurons of the previous
layer, as well as the bias node) is summed, and then the transfer function is applied
to the sum, so that the data flow from that node is y = f(Σ(wixi + b)), where wi is the
weight of node i of the previous layer, and xi its output, while b is the bias of that
layer. Also, f() is the mathematical expression of the transfer function.
This relatively simple process is at the core of every ANN. The process is equivalent
to that which takes place in a perceptron system—a rudimentary AI model that
emulates the function of a single neuron. Although a perceptron system is never
used in practice, it is the most basic element of an ANN, and the first system created
using this paradigm.
The function of a single neuron is basically a single, predefined transformation of
the data at hand. This can be viewed as a kind of meta-feature of the framework, as
it takes a certain input x and after applying a (usually non-linear) function f() to it, x
is transformed into something else, which is the neuron’s output y.
While in the majority of cases one single meta-feature would be terrible at
predicting the target variable, several of them across several layers can work
together quite effectively – no ma er how complex the mapping of the original
features to the target variable. The downside is that such a system can easily overfit,
which is why the training of an ANN doesn’t end until the error is minimal (smaller
than a predefined threshold).
This most rudimentary description of a DL network works for networks of the
multi-layer perceptron type. Of course, there are several variants beyond this type.
CNNs, for example, contain specialized layers with huge numbers of neurons,
while RNNs have connections that go back to previous layers. Additionally, some
training algorithms involve pruning nodes of the network to ensure that no
overfi ing takes place.
Once the DL network is trained, it can be used to make predictions about any data
similar to the data it was trained on. Furthermore, its generalization capability is
quite good, particularly if the data it is trained on is diverse. What’s more, most DL
networks are quite robust when it comes to noisy data, which sometimes helps
them achieve even be er generalization.
When it comes to classification problems, the performance of a DL system is
improved by the class boundaries it creates. Although many conventional ML
systems create straightforward boundary landscapes (e.g. rectangles or simple
curves), a DL system creates a more sophisticated line around each class
(reminiscent of the borders of certain counties in the US). This is because the DL
system is trying to capture every bit of signal it is given in order to make fewer
mistakes when classifying, boosting its raw performance. Of course, this highly
complex mapping of the classes makes interpretation of the results a very
challenging, if not unfeasible, task. More on that later in this chapter.
ETL processes
A DL framework can be helpful in fetching data from various sources, such as
databases and files. This is a rather time-consuming process if done manually, so
using a framework is very advantageous. The framework will also do some
forma ing on the data, so that you can start using it in your model without too
much data engineering. However, doing some data processing of your own is
always useful, particularly if you have some domain knowledge.
Interpretability
Interpretability is the capability of a model to be understood in terms of its
functionality and its results. Although interpretability is often a given with
conventional data science systems, it is a pain point of every DL system. This is
because every DL model is a “black box,” offering li le to no explanation for why it
yields the results it does. Unlike the framework itself, whose various modules and
their functionality is clear, the models developed by these frameworks are
convoluted graphs. There is no comprehensive explanation as to how the inputs
you feed them turn into the outputs they yield.
Although obtaining an accurate result through such a method may be enticing, it is
quite hard to defend, especially when the results are controversial or carry a
demographic bias. The reason for a demographic bias has to do with the data, by
the way, so no number of bias nodes in the DL networks can fix that, since a DL
network’s predictions can only be as good as the data used to train it. Also, the fact
that we have no idea how the predictions correspond to the inputs allows biased
predictions to slip through unnoticed.
However, this lack of interpretability may be resolved in the future. This may
require a new approach to them, but if it’s one thing that the progress of AI system
has demonstrated over the years, it is that innovations are still possible and that
new architectures of models are still being discovered. Perhaps one of the newer DL
systems will have interpretability as one of its key characteristics.
Model maintenance
Maintenance is essential to every data science model. This entails updating or even
upgrading a model in production, as new data becomes available. Alternatively, the
assumptions of the problem may change; when this happens, model maintenance is
also needed. In a DL se ing, model maintenance usually involves retraining the DL
network. If the retrained model doesn’t perform well enough, more significant
changes may be considered such as changing the architecture or the training
parameters. Whatever the case, this whole process is largely straightforward and
not too time-consuming.
How often model maintenance is required depends on the dataset and the problem
in general. Whatever the case, it is good to keep the previous model available too
when doing major changes, in case the new model has unforeseen issues. Also, the
whole model maintenance process can be automated to some extent, at least the
offline part, when the model is retrained as new data is integrated with the original
dataset.
Summary
Deep Learning is a particularly important aspect of AI, and has found a lot
of applications in data science.
Deep Learning employs a certain kind of AI system called an Artificial
Neural Networks (or ANN). An ANN is a graph-based system involving a
series of (usually non-linear) operations, whereby the original features are
transformed into a few meta-features capable of predicting the target
variable more accurately than the original features.
The main frameworks in DL are MXNet, TensorFlow, and Keras, though
Pytorch and Theano also play roles in the whole DL ecosystem. Also, Knet
is an interesting alternative for those using Julia primarily.
There are various programming languages used in DL, including Python,
Julia, Scala, Javascript, R, and C / C++. Python is the most popular.
A DL framework offers diverse functionality, including ETL processes,
building data models, deploying and evaluating models, and other
functions like creating visuals.
A DL system can be used in various data science methodologies, including
Classification, Regression, Reinforcement Learning, Dimensionality
Reduction, Clustering, and Sentiment Analysis.
Classification, regression, and reinforcement learning are supervised
learning methodologies, while dimensionality reduction and clustering
are unsupervised.
Applications of DL include making high-accuracy predictions for complex
problems; summarizing data into a more compact form; analyzing images,
sound, or video; natural language processing and sentiment analysis; text
prediction; linking images to captions; chatbots; and text summarization.
A DL framework needs to be assessed on various metrics (not just
popularity). Such factors include the programming languages it supports,
its raw performance, how well it handles ETL processes, the strength of its
documentation and user communities, and the need for future
maintenance.
It is not currently very easy to interpret DL results and trace them back to
specific features (i.e. DL results currently have low interpretability).
Giving more weight to raw performance or interpretability can help you
decide whether a DL system or conventional data science system is ideal
for your particular problem. Other factors, like the amount of
computational resources at our disposal, are also essential for making this
decision.
Optimization
Optimization is the process of finding the maximum or minimum of a given
function (also known as a fitness function), by calculating the best values for its
variables (also known as a “solution”). Despite the simplicity of this definition, it is
not an easy process; often involves restrictions, as well as complex relationships
among the various variables. Even though some functions can be optimized using
some mathematical process, most functions we encounter in data science are not as
simple, requiring a more advanced technique.
Optimization systems (or optimizers, as they are often referred to) aim to optimize in
a systematic way, oftentimes using a heuristics-based approach. Such an approach
enables the AI system to use a macro level concept as part of its low-level
calculations, accelerating the whole process and making it more light-weight. After
all, most of these systems are designed with scalability in mind, so the heuristic
approach is most practical.
pp p
Importance of optimization
Optimization is especially important in many data science problems—particularly
those involving a lot of variables that need to be fine-tuned, or cases where the
conventional tools don’t seem to work. In order to tackle more complex problems,
beyond classical methodologies, optimization is essential. Moreover, optimization is
useful for various data engineering tasks such as feature selection, in cases where
maintaining a high degree of interpretability is desired. We’ll investigate the main
applications of optimizers in data science later in this chapter.
Artificial creativity
Artificial creativity (AC) is a relatively new methodology of AI, where new
information is created based on relevant data it has been trained on. Its applications
span across various domains, including most of the arts, as well as industrial
design, and even data science.
This kind of AI methodology makes use of a specialized DL network that is trained
to develop new data that retains some characteristics of the data it was trained on.
When you feed such a specialized AI system some image data, and it’s been trained
on the artwork of a particular painter, it will produce new “artwork” that makes
use of the images it is fed, but using the painting pa erns of the artist it is trained to
emulate. The results may not win any art prizes, but they are certainly interesting
and original!
It is particularly fascinating when an AC system creates poetry, based on the verse
of certain poets. The results can be indistinguishable from human-wri en verse.
Take for example the following piece of verse by the AI poet Deep Gimble I:6
Madness in her face and i
the world that I had seen
and when my soul shall be to see the night to be the same and
I am all the world and the day that is the same and a day I had been
a young li le woman I am in a dream that you were in
a moment and my own heart in her face of a great world
and she said the li le day is a man of a li le
a li le one of a day of my heart that has been in a dream.
In data science, AC can aid in the creation of new data, which is quite useful in
certain cases. This new data may not be particularly helpful as an expansion of the
training set for that ANN, but it can be especially useful in other ways. For example,
if the original data is sensitive like medical data, and contains too much personally
identifiable information (PII), you can generate new data using AC that, although very
similar to the original data, cannot be mapped back to a real individual.
In addition, data created from an AC system can be useful for different data models
—perhaps as a new test set or even part of their training set. This way it can offer
the potential for be er generalization for these models, as there is more data
available for them to train or test on. This can be particularly useful in domains
where labeled data is hard to come by or is expensive to generate otherwise.
Additional AI methodologies
Beyond all the AI methodologies we’ve discussed so far, there exist several others
worth noting. These systems also have a role to play in data science, while their
similarities to DL systems make them easier to comprehend. Also, as the AI field is
constantly expanding, it’s good to be aware of all the new methodologies that pop
up.
The Extreme Learning Machine (or ELM) is an example of an alternative AI
methodology that hasn’t yet received the a ention it deserves. Although they are
architecturally like DL networks, ELMs are quite distinct in the way they are
trained. In fact, their training is so unconventional that some people considered the
whole approach borderline unscientific (the professor who came up with ELMs
most recently has received a lot of criticism from other academics).
Instead of optimizing all the weights across the network, ELMs focus on just the
connections of the two last layers—namely the last set of meta-features and their
outputs. The rest of the weights maintain their initial random values from the
beginning. Because the focus is solely on just the optimized weights of the last
layers, this optimization is extremely fast and very precise.
As a result, ELMs are the fastest network-based methodology out there, and their
performance is quite decent too. What’s more, they are quite unlikely to overfit,
which is another advantage. Despite its counter-intuitive approach, an ELM system
does essentially the same thing as a conventional DL system; instead of optimizing
all the meta-features it creates, though, it focuses on optimizing the way they work
together to form a predictive analytics model. We’ll talk more about ELMs in
Chapter 11.
Another new alternative AI methodology is Capsule Networks (CapsNets).
Although the CapsNet should be regarded as a member of the deep learning
methods family, its architecture and its optimization training method are quite
novel. CapsNets try to capture the relative relationships between the objects within
a relevant context. A CNN model that achieves high performance in image
recognition tasks may not necessarily be able to identify the same object from
different angles. CapsNets, though, capture those kinds of contextual relationships
quite well. Their performance on some tasks has already surpassed the leading
models by about 45%, which is quite astonishing. Considering their promising
future, we dedicate a section in Chapter 11 to discussing CapsNets.
Self-organizing Maps (SOMs) are a special type of AI system. Although they are also
ANNs of sorts, they are unique in function. SOMs offer a way to map the feature
space into a two-dimensional grid, so that it can be be er visualized afterwards.
Since it doesn’t make use of a target variable, a SOM is an unsupervised learning
methodology; as such, it is ideal for data exploration.
SOMs have been successfully applied in various domains, such as meteorology,
oceanography, oil and gas exploration, and project prioritization. One key
difference SOMs have from other ANNs is that their learning is based on
competition instead of error correction. Also, their architecture is quite different, as
their various nodes are only connected to the input layer with no lateral
connections. This unique design was first introduced by Professor Kohonen, which
is why SOMs are also referred to as “Kohonen Maps.”
The Generative Adversarial Network (GAN) is a very interesting type of AI
methodology, geared towards optimizing a DL network in a rather creative way. A
GAN comprises two distinct ANNs. One is for learning, and the other is for
“breaking” the first one – finding cases where the predictions of the first ANN are
off. These systems are comparable to the “white hat” hackers of cybersecurity.
In essence, the second ANN creates increasingly more demanding challenges for
the first ANN, thereby constantly improving its generalization (even with a limited
amount of data). GANs are used for simulations as well as data science problems.
Their main field of application is astronomy, where a somewhat limited quantity of
images and videos of the cosmos is available to use in training. The idea of GANs
has been around for over a decade, but has only recently managed to gain
popularity; this is largely due to the amount of computational resources demanded
by such a system (just like any other DL-related AI system). A more detailed
description of GANs is available in Appendix D, while in Chapter 9 we’ll also
revisit this topic.
Artificial Emotional Intelligence (AEI) is another kind of AI that’s novel on both the
methodological as well as the application levels. The goal of AEI is to facilitate an
understanding of the emotional context of data (which is usually text-based) and to
assess it just like a human. Applications of AEI are currently limited to
comprehension; in the future, though, more interactive systems could provide a
smoother interface between humans and machines. There is an intersect between
AEI and ANNs, but some aspects of AEI make use of other kinds of ML systems on
the back end.
Summary
h p://bit.ly/2Js4CKd.
CHAPTER 3
Core components
Gluon interface
Gluon is a simple interface for all your DL work using MXNet. You install it on your
machine just like any Python library:
pip install mxnet --pre --user
The main selling point of Gluon is that it is straightforward. It offers an abstraction
of the whole network building process, which can be intimidating for people new to
the craft. Also, Gluon is very fast, not adding any significant overhead to the
training of your DL system. Moreover, Gluon can handle dynamic graphs, offering
some malleability in the structure of the ANNs created. Finally, Gluon has an
overall flexible structure, making the development process for any ANN less rigid.
Naturally, for Gluon to work, you must have MXNet installed on your machine
(although you don’t need to if you are using the Docker container provided with
this book). This is achieved using the familiar pip command:
pip install mxnet --pre --user
Because of its utility and excellent integration with MXNet, we’ll be using Gluon
throughout this chapter, as we explore this DL framework. However, to get a be er
understanding of MXNet, we’ll first briefly consider how you can use some of its
other functions (which will come in handy for one of the case studies we examine
later).
NDArrays
The NDArray is a particularly useful data structure that’s used throughout an
MXNet project. NDArrays are essentially NumPy arrays, but with the added
capability of asynchronous CPU processing. They are also compatible with
distributed cloud architectures, and can even utilize automatic differentiation,
which is particularly useful when training a deep learning system, but NDArrays
can be effectively used in other ML applications too. NDArrays are part of the
MXNet package, which we will examine shortly. You can import the NDArrays
module as follows:
from mxnet import nd
To create a new NDArray consisting of 4 rows and 5 columns, for example, you can
type the following:
nd.empty((4, 5))
The output will differ every time you run it, since the framework will allocate
whatever value it finds in the parts of the memory that it allocates to that array. If
you want the NDArray to have just zeros instead, type:
nd.zeros((4, 5))
To find the number of rows and columns of a variable having an NDArray assigned
to it, you need to use the .shape function, just like in NumPy:
x = nd.empty((2, 7))
x.shape
Finally, if you want to find to total number of elements in an NDArray, you use the
.size function:
x.size
The operations in an NDArray are just like the ones in NumPy, so we won’t
elaborate on them here. Contents are also accessed in the same way, through
indexing and slicing.
Should you want to turn an NDArray into a more familiar data structure from the
NumPy package, you can use the asnumpy() function:
y = x.asnumpy()
Of course, the data assigned to a particular processing unit is not set in stone. It is
easy to copy data to a different location, linked to a different processing unit, using
the copyto() function:
y = x.copyto(mx.gpu(1)) # copy the data of NDArray x to the
2nd GPU
You can find the context of a variable through the .context a ribute:
print(x.context)
It is often more convenient to define the context of both the data and the models,
using a separate variable for each. For example, say that your DL project uses data
that you want to be processed by the CPU, and a model that you prefer to be
handled by the first GPU. In this case, you’d type something like:
DataCtx = mx.cpu()
ModelCtx = mx.gpu(0)
If you want to perform some additional processes that make the MXNet experience
even be er, it is highly recommended that you first install the following packages
on your computer:
graphviz (ver. 0.8.1 or later)
requests (ver. 2.18.4 or later)
numpy (ver. 1.13.3 or later)
You can learn more about the MXNet package through the corresponding GitHub
repository.8
MXNet in action
Now let’s take a look at what we can do with MXNet, using Python, on a Docker
image with all the necessary software already installed. We’ll begin with a brief
description of the datasets we’ll use, and then proceed to a couple specific DL
applications using that data (namely classification and regression). Upon mastering
these, you can explore some more advanced DL systems of this framework on your
own.
Datasets description
In this section we’ll introduce two synthetic datasets that we prepared to
demonstrate classification and regression methods on them. First dataset is for
classification, and the other for regression. The reason we use synthetic datasets in
these exercises to maximize our understanding of the data, so that we can evaluate
the results of the DL systems independent of data quality.
The first dataset comprises 4 variables, 3 features, and 1 labels variable. With
250,000 data points, it is adequately large for a DL network to work with. Its small
dimensionality makes it ideal for visualization (see Figure 2). It is also made to have
a great deal of non-linearity, making it a good challenge for any data model (though
not too hard for a DL system). Furthermore, classes 2 and 3 of this dataset are close
enough to be confusing, but still distinct. This makes them a good option for a
clustering application, as we’ll see later.
The second dataset is somewhat larger, comprising 21 variables—20 of which are
the features used to predict the last, which is the target variable. With 250,000 data
points, again, it is ideal for a DL system. Note that only 10 of the 20 features are
relevant to the target variable (which is a combination of these 10). A bit of noise is
added to the data to make the whole problem a bit more challenging. The
remaining 10 features are just random data that must be filtered out by the DL
model. Relevant or not, this dataset has enough features altogether to render a
dimensionality reduction application worthwhile. Naturally, due to its
dimensionality, we cannot plot this dataset.
Now, we can import the data like we’d normally do in a conventional DS project,
but this time store it in NDArrays instead of Pandas or NumPy arrays:
with open(“../data/data1.csv”) as f:
data_raw = f.read()
lines = data_raw.splitlines() # split the data into separate
lines
ndp = len(lines) # number of data points
X = nd.zeros((ndp, nf), ctx=data_ctx)
Y = nd.zeros((ndp, 1), ctx=data_ctx)
for i, line in enumerate(lines):
tokens = line.split()
Y[i] = int(tokens[0])
for token in tokens[1:]:
index = int(token[:-2]) - 1
X[i, index] = 1
Now we can split the data into a training set and a testing set, so that we can use it
both to build and to validate our classification model:
import numpy as np # we’ll be needing this package as well
n = np.round(N * r) # number of training data points
train = data[:n, ] # training set partition
test = data[(n + 1):,] # testing set partition
data_train =
gluon.data.DataLoader(gluon.data.ArrayDataset(train[:,:3],
train[:,3]), batch_size=BatchSize, shuffle=True)
data_test =
gluon.data.DataLoader(gluon.data.ArrayDataset(test[:,:3],
test[:,3]), batch_size=BatchSize, shuffle=True)
We’ll then need to repeat the same process to load the second dataset—this time
using data2.csv as the source file. Also, to avoid confusion with the dataloader
objects of dataset 1, you can name the new dataloaders data_train2 and data_test2,
respectively.
Classification
Now let’s explore how we can use this data to build an MLP system that can discern
the different classes within the data we have prepared. For starters, let’s see how to
do this using the mxnet package on its own; then we’ll examine how the same thing
can be achieved using Gluon.
First, let’s define some constants that we’ll use later to build, train, and test the MLP
network:
nhn = 256 # number of hidden nodes for each layer
WeightScale = 0.01 # scale multiplier for weights
ModelCtx = mx.cpu() # assign context of the model itself
no = 3 # number of outputs (classes)
ne = 10 # number of epochs (for training)
lr = 0.001 # learning rate (for training)
sc = 0.01 # smoothing constant (for training)
ns = test.shape[0] # number of samples (for testing)
Next, let’s initialize the network’s parameters (weights and biases) for the first layer:
W1 = nd.random_normal(shape=(nf, nhn), scale=WeightScale,
ctx=ModelCtx)
b1 = nd.random_normal(shape=nhn, scale=WeightScale,
ctx=ModelCtx)
Then let’s initialize the output layer and aggregate all the parameters into a single
data structure called params:
W3 = nd.random_normal(shape=(nhn, no), scale=WeightScale,
ctx=ModelCtx)
b3 = nd.random_normal(shape=no, scale=WeightScale,
ctx=ModelCtx)
params = [W1, b1, W2, b2, W3, b3]
Finally, let’s allocate some space for a gradient for each one of these parameters:
for param in params:
param.attach_grad()
Remember that without any non-linear functions in the MLP’s neurons, the whole
system would be too rudimentary to be useful. We’ll make use of the ReLU and the
Softmax functions as activation functions for our system:
def relu(X): return nd.maximum(X, nd.zeros_like(X))
def softmax(y_linear):
exp = nd.exp(y_linear - nd.max(y_linear))
partition = nd.nansum(exp, axis=0,
exclude=True).reshape((-1, 1))
return exp / partition
Note that the Softmax function will be used in the output neurons, while the ReLU
function will be used in all the remaining neurons of the network.
For the cost function of the network (or, in other words, the fitness function of the
optimization method under the hood), we’ll use the cross-entropy function:
def cross_entropy(yhat, y): return - nd.nansum(y *
nd.log(yhat), axis=0, exclude=True)
To make the whole system a bit more efficient, we can combine the softmax and the
cross-entropy functions into one, as follows:
def softmax_cross_entropy(yhat_linear, y):
return - nd.nansum(y * nd.log_softmax(yhat_linear),
axis=0, exclude=True)
After all this, we can now define the function of the whole neural network, based on
the above architecture:
def net(X):
h1_linear = nd.dot(X, W1) + b1
h1 = relu(h1_linear)
h2_linear = nd.dot(h1, W2) + b2
h2 = relu(h2_linear)
yhat_linear = nd.dot(h2, W3) + b3
return yhat_linear
The optimization method for training the system must also be defined. In this case
we’ll utilize a form of Gradient Descent:
def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
return param
For the purposes of this example, we’ll use a simple evaluation metric for the model:
accuracy rate. Of course, this needs to be defined first:
def evaluate_accuracy(data_iterator, net):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(model_ctx).reshape((-1,
784))
label = label.as_in_context(model_ctx)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
Finally, we can use to system to make some predictions using the following code:
def model_predict(net, data):
output = net(data)
return nd.argmax(output, axis=1)
SampleData = mx.gluon.data.DataLoader(data_test, ns,
shuffle=True)
for i, (data, label) in enumerate(SampleData):
data = data.as_in_context(ModelCtx)
im = nd.transpose(data,(1,0,2,3))
im = nd.reshape(im,(28,10*28,1))
imtiles = nd.tile(im, (1,1,3))
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print(‘model predictions are:’, pred)
print(‘true labels :’, label)
break
If you run the above code (preferably in the Docker environment provided), you
will see that this simple MLP system does a good job at predicting the classes of
some unknown data points—even if the class boundaries are highly non-linear.
Experiment with this system more and see how you can improve its performance
even further, using the MXNet framework.
Now we’ll see how we can significantly simplify all this by employing the Gluon
interface. First, let’s define a Python class to cover some common cases of Multi-
Layer Perceptrons, transforming a “gluon.Block” object into something that can be
leveraged to gradually build a neural network, consisting of multiple layers (also
known as MLP):
class MLP(gluon.Block):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense0 = gluon.nn.Dense(64) # architecture of 1st
layer (hidden)
self.dense1 = gluon.nn.Dense(64) # architecture of 2nd
layer (hidden)
self.dense2 = gluon.nn.Dense(3) # architecture of 3rd
layer (output)
def forward(self, x): # a function enabling an MLP to
process data (x) by passing it forward (towards the output
layer)
x = nd.relu(self.dense0(x)) # outputs of first hidden
layer
x = nd.relu(self.dense1(x)) # outputs of second hidden
layer
x = self.dense2(x) # outputs of final layer (output)
return x
Of course, this is just an example of how you can define an MLP using Gluon, not a
one-size-fits-all kind of solution. You may want to define the MLP class differently,
since the architecture you use will have an impact on the system’s performance.
(This is particularly true for complex problems where additional hidden layers
would be useful.) However, if you find what follows too challenging, and you don’t
have the time to assimilate the theory behind DL systems covered in Chapter 1, you
can use an MLP object like the one above for your project.
Since DL systems are rarely as compact as the MLP above, and since we often need
to add more layers (which would be cumbersome in the above approach), it is
common to use a different class called Sequential. After we define the number of
neurons in each hidden layer, and specify the activation function for these neurons,
we can build an MLP like a ladder, with each step representing one layer in the
MLP:
nhn = 64 # number of hidden neurons (in each layer)
af = “relu” # activation function to be used in each neuron
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(nhn , activation=af))
net.add(gluon.nn.Dense(nhn , activation=af))
net.add(gluon.nn.Dense(no))
This takes care of the architecture for us. To make the above network functional,
we’ll first need to initialize it:
sigma = 0.1 # sigma value for distribution of weights for
the ANN connections
ModelCtx = mx.cpu()
lr = 0.01 # learning rate
oa = ‘sgd’ # optimization algorithm
net.collect_params().initialize(mx.init.Normal(sigma=sigma),
ctx=ModelCtx)
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(), oa,
{‘learning_rate’: lr})
ne = 10 # number of epochs for training
Next, we must define how we assess the network’s progress, through an evaluation
metric function. For the purposes of simplicity, we’ll use the standard accuracy rate
metric:
def AccuracyEvaluation(iterator, net):
acc = mx.metric.Accuracy()
for i, (data, label) in enumerate(iterator):
data = data.as_in_context(ModelCtx).reshape((-1, 3))
label = label.as_in_context(ModelCtx)
output = net(data)
predictions = nd.argmax(output, axis=1)
acc.update(preds=predictions, labels=label)
return acc.get()[1]
Finally, it’s time to train and test the MLP, using the aforementioned se ings:
for e in range(ne):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ModelCtx).reshape((-1,
784))
label = label.as_in_context(ModelCtx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
cumulative_loss += nd.sum(loss).asscalar()
train_accuracy = AccuracyEvaluation(train_data, net)
test_accuracy = AccuracyEvaluation(test_data, net)
print(“Epoch %s. Loss: %s, Train_acc %s, Test_acc %s” %
(e, cumulative_loss/ns, train_accuracy,
test_accuracy))
Running the above code should yield similar results to those from conventional
mxnet commands.
To make things easier, we’ll rely on the Gluon interface in the example that follows.
Nevertheless, we still recommend that you experiment with the standard mxnet
functions afterwards, should you wish to develop your own architectures (or be er
understand the theory behind DL).
Regression
Creating a regression MLP system is similar to creating a classification one but with
some differences. In the regression case, the regression will be simpler, since
regressors are typically lighter architecturally than classifiers. For this example,
we’ll use the second dataset.
First, let’s start by importing the necessary classes from the mxnet package and
se ing the context for the model:
import mxnet as mx
from mxnet import nd, autograd, gluon
ModelCtx = mx.cpu()
To load data to the model, we’ll use the dataloaders created previously (data_train2
and data_test2). Let’s now define some basic se ings and build the DL network
gradually:
nf = 20 # we have 20 features in this dataset
sigma = 1.0 # sigma value for distribution of weights for
the ANN connections
net = gluon.nn.Dense(1, in_units=nf) # the “1” here is the
number of output neurons, which is 1 in regression
Let’s now initialize the network with some random values for the weights and
biases:
net.collect_params().initialize(mx.init.Normal(sigma=sigma),
ctx=ModelCtx)
Just like any other DL system, we need to define the loss function. Using this
function, the system understands how much of an error each deviation from the
target variable’s values costs. At the same time, cost functions can also deal with the
complexity of the models (since if models are too complex they can cost us
overfi ing):
square_loss = gluon.loss.L2Loss()
Now it’s time to train the network using the data at hand. After we define some
essential parameters (just like in the classification case), we can create a loop for the
network to train:
ne = 10 # number of epochs for training
loss_sequence = [] # cumulative loss for the various epochs
nb = ns / BatchSize # number of batches
for e in range(ne):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data): # inner
loop
data = data.as_in_context(ModelCtx)
label = label.as_in_context(ModelCtx)
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(BatchSize)
CumulativeLoss += nd.mean(loss).asscalar()
print(“Epoch %s, loss: %s” % (e, CumulativeLoss / ns))
loss_sequence.append(CumulativeLoss)
If you wish to view the parameters of the model, you can do so by collecting them
into a dictionary structure:
params = net.collect_params()
for param in params.values():
print(param.name, param.data())
Printing out the parameters may not seem to be useful as we have usually too many
of them and especially when we add new layers to the system, something we’d
accomplish as follows:
net.add(gluon.nn.Dense(nhn))
where nhn is the number of neurons in that additional hidden layer. Note that the
network requires an output layer with a single neuron, so be sure to insert any
additional layers between the input and output layers.
We’ll then save the data, but let’s put some of it into a dictionary first:
dict = {“X”: X, “Y”: Y}
Now we’ll set the name of the file and save it:
filename = “test.dat”
nd.save(filename, dict)
We can verify that everything has been saved properly by loading that checkpoint
as follows:
Z = nd.load(filename)
print(Z)
When using gluon, there is a shortcut for saving all the parameters of the DL
network we have developed. It involves the save_params() function:
filename = “MyNet.params”
net.save_params(filename)
To restore the DL network, however, you’ll need to recreate the original network’s
architecture, and then load the original network’s parameters from the
corresponding file:
net2 = gluon.nn.Sequential()with net2.name_scope():
net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))
net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))
net2.add(gluon.nn.Dense(num_outputs))
net2.load_params(filename, ctx=ctx)
It’s best to save your work at different parts of the pipeline, and give the checkpoint
files descriptive names. It is also important to keep in mind that we don’t have
“untraining” option and it is likely that the optimal performance happens before
the completion of the training phase. Because of this, we may want to create
checkpoints after each training epoch so that we can revert to it when we find out at
which point the optimal performance is achieved.
Moreover, for the computer to make sense of these files when you load them in
your programming environment, you’ll need to have the nd class of mxnet in
memory, in whatever programming language you are using.
MXNet tips
The MXNet framework is a very robust and versatile platform for a variety of DL
systems. Although we demonstrated its functionality in Python, it is equally
powerful when used with other programming languages.
In addition, the Gluon interface is a useful add-on. If you are new to DL
applications, we recommend you use Gluon as your go-to tool when employing the
MXNet framework. This doesn’t mean that the framework itself is limited to Gluon,
though, since the mxnet package is versatile and robust in a variety of
programming platforms.
Moreover, in this chapter we covered just the basics of MXNet and Gluon. Going
through all the details of these robust systems would take a whole book! Learn
more about the details of the Gluon interface in the Straight Dope tutorial, which is
part of the MXNet documentation.9
Finally, the examples in this chapter are executed in a Docker container; as such,
you may experience some lagging. When developing a DL system on a computer
cluster, of course, it is significantly faster.
Summary
h ps://bit.ly/2uweNb0.
h ps://github.com/apache/incubator-mxnet.
h p://gluon.mxnet.io/index.html.
CHAPTER 4
command would install Tensorflow for Python 3. However, if both Python 2 and
Python 3 are installed in your computer, then the command above might install the
TensorFlow for Python 2. In that case, you can also use the following command to
install TensorFlow for Python 3:
pip3 install –upgrade tensorflow
The TensorFlow framework is now installed for you to explore. In your code,
import the TensorFlow to use it:
import tensorflow
If you wish, you can rename it to “tf”. We will do this throughout the chapter
because it is the convention in the community:
import tensorflow as tf
TensorFlow architecture
The basic architecture of TensorFlow is shown in Figure 3. TensorFlow is designed
as a distributed system by nature, so it is quite easy to run TensorFlow models in
distributed se ings. The TensorFlow Distributed Execution Engine is responsible
for handling this capability of TensorFlow. As we mentioned before, TensorFlow
models can be run on top of CPUs and GPUs. However, other computation units
are also available to use. Recently, Google announced Tensor Processing Units (TPUs)
that are designed to swiftly run TensorFlow models. You can even run TensorFlow
in Android devices directly.
Although Python is the most commonly used language with TensorFlow, you can
use TensorFlow with C++, Java, Julia, Go, R, and more. TensorFlow includes two
relatively high-level abstraction modules called layers and datasets. The Layers
module provides methods that simplify the creation of fully connected layers,
convolutional layers, pooling layers, and more. It also provides methods like
adding activation functions or applying dropout regularization. The Datasets
module includes capabilities to manage your datasets.
Higher-level APIs (like Keras or Estimators) are easier to use, and they provide the
same functionality of these lower-level modules. Lastly, we should mention that
TensorFlow includes some pre-trained models out of the box.
Core components
To understand the core architecture of the TensorFlow framework, we introduce
some basic concepts. First, let’s begin with a fundamental design principle of
TensorFlow: TensorFlow is designed to work with “static graphs”. The
computational flow of your model will be converted to a graphical representation in
the framework before execution. The static graph in TensorFlow is the
computational graph and not the data. This means that before you run the code,
you must define the computational flow of your data. After that, all of the data that
is fed to the system will flow through this computational graph, even if the data
changes from time to time.
Let’s start with the basic concepts of the framework. The first concept you have to
understand is the “tensor” which is also included in the name of the framework.
Tensors are the units that hold the data. You can think of tensors as NumPy n-
dimensional arrays. The rank of the tensor defines the dimension, and the shape
defines the lengths of each dimension in a tuple form. So
[ [1.0, 2.0, 3.0], [4.0, 5.0, 6.0] ]
In the code above, we define two tensors x and y by the tf.constant operation. This
operation takes 1.0 and 2.0 as inputs and just produces their tensor equivalents and
nothing more. Then using x and y, we created another tensor called z. Now, what
do you expect from this code below?
print(z)
You are incorrect if you expect to see 3.0. Instead, you just see:
Tensor(“add:0”, shape=(), dtype=float32)
Defining graphs is different than executing the statements. For now, z is just a
tensor object and has no value associated to it. We somehow need to run the graph
so that we can get 3.0 from the tensor z. This is where another concept in the
TensorFlow comes in: the session.
Sessions in TensorFlow are the objects that hold the state of the runtime where our
graph will be executed. We need to instantiate a session and then run the operations
we have already defined:
sess = tf.Session()
The code above instantiates the session object. Now, using that object, we can run
our operations:
print(sess.run(z))
and we get 3.0 from the print statement! When we run an operation (namely a node
in the graph), the TensorFlow executes it by calculating the tensors that our
operation takes as input. This involves a backward calculation of the nodes and
tensors, until it reaches a natural starting point – just like in our tf.constant
operations above.
As you have already noticed, tf.constant simply provides constants as an operation;
it may not be suitable to work with external data. For these kinds of situations,
TensorFlow provides another object called the placeholder. You can think of
placeholders as arguments to a function. It is something that you’ll provide later on
in your code! For example:
k = tf.placeholder(tf.float32)
l = tf.placeholder(tf.float32)
m = k + l
This time we define k and l as placeholders; we will assign some values to them
when we run them in the session. Using the session above:
print(sess.run(m, feed_dict={k = 1.0, l = 2.0}))
will print 3.0. Here we used feed_dict object, which is a dictionary used to pass
values to the placeholders. Effectively, we pass 1.0 and 2.0 to k and l placeholders,
respectively, in the runtime. You can also use the feed_dict parameter of the run
method of session to update values of the tf.constants.
We have seen that constants and placeholders are useful TensorFlow constructs to
store values. Another useful construct is the TensorFlow variable. One can think of a
variable as something that lies between constants and placeholders. Like
placeholders, variables do not have an assigned value. However, much like
constants, they can have default values. Here is an example of a TensorFlow
variable:
v= tf.Variable([0], tf.float32)
In the above line, we define a TensorFlow variable called v and set its default value
as 0. When we want to assign some value different than the default one, we can use
the tf.assign method:
w= tf.assign(v, [-1.])
It is crucial to know that TensorFlow variables are not initialized when defined.
Instead, we need to initialize them in the session like this:
init = tf.global_variables_initializer()
sess.run(init)
The code above initializes all the variables! As a rule of thumb, you should use
tf.constant to define constants, tf.placeholder to hold the data fed to your model,
and tf.Variable to represent the parameters for your model.
Now that we have learned the basic concepts of TensorFlow and demonstrated how
to use them, you are all set to use TensorFlow to build your own models.
TensorFlow in action
We’ll begin our TensorFlow exercises by implementing a DL classification model,
utilizing the elements of TensorFlow we covered in the last section.
The datasets we use to demonstrate TensorFlow are the same synthetic datasets we
used in the previous section. We use them for classification and regression purposes
in this chapter. Remember that those datasets–as well as the codes we go over in
this section–are already provided with the Docker image distributed with this book.
You can run that Docker image to access the datasets and the source codes of this
chapter.
Classification
Before we begin to implement our classifier, we need to import some libraries to use
them. Here are the libraries we need to import:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
First, we should load the dataset and do a bit of preprocessing to format the data
we’ll use in our model. As usual, we load the data as a list:
# import the data
with open(“../data/data1.csv”) as f:
data_raw = f.read()
# split the data into separate lines
lines = data_raw.splitlines()
Then, we separate the labels and the three features into lists, called “labels” and
“features”:
labels = []
features = []
for line in lines:
tokens = line.split(‘,’)
labels.append(int(tokens[-1]))
x1,x2,x3 = float(tokens[0]), float(tokens[1]),
float(tokens[2])
features.append([x1, x2, x3])
Next, we make dummy variables of the three label categories, using Pandas’
get_dummies function:
labels = pd.get_dummies(pd.Series(labels))
After this, the labels list should look like this:
The next step is to split our data into train and test sets. For this purpose, we use the
scikit-learn’s train_test_split function that we imported before:
X_train, X_test, y_train, y_test =
train_test_split(features, \ labels, test_size=0.2,
random_state=42)
We’re now ready to build up our model using TensorFlow. First, we define the
hyperparameters of the model that are related with the optimization process:
# Parameters
learning_rate = 0.1
epoch = 10
Next, we define the hyperparameters that are related with the structure of the
model:
# Network Parameters
n_hidden_1 = 16 # 1st layer number of neurons
n_hidden_2 = 16 # 2nd layer number of neurons
num_input = 3 # data input
num_classes = 3 # total classes
‘h1’: tf.Variable(tf.random_normal([num_input,n_hidden_1])),
‘h2’: tf.Variable(tf.random_normal([n_hidden_1,n_hidden_2]))
,
‘out’: tf.Variable(tf.random_normal([n_hidden_2, \
num_classes]))
}
biases = {
‘b1’: tf.Variable(tf.random_normal([n_hidden_1])),
‘b2’: tf.Variable(tf.random_normal([n_hidden_2])),
‘out’: tf.Variable(tf.random_normal([num_classes]))
}
We can now define our graph in TensorFlow. To that end, we provide a function:
# Create model
def neural_net(x):
# Hidden fully connected layer with 16 neurons
layer_1 = tf.nn.relu(tf.add(tf.matmul(x, weights[‘h1’]),
\ biases[‘b1’]))
# Hidden fully connected layer with 16 neurons
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, \
weights[‘h2’]), biases[‘b2’]))
# Output fully connected layer with a neuron for each
class
out_layer = tf.add(tf.matmul(layer_2, weights[‘out’]),
\ biases[‘out’])
# For visualization in TensorBoard
tf.summary.histogram(‘output_layer’, out_layer)
return out_layer
This function takes the input data as an argument. Using this data, it first constructs
a hidden layer. In this layer, each input data point is multiplied by the weights of
the first layer, and added to the bias terms. Using the output of this layer, the
function constructs another hidden layer. Similarly, this second layer multiplies the
output of the first layer with the weights of its own and adds the result to the bias
term. Then the output of the second layer is fed into the last layer which is the
output layer of the neural network. The output layer does the same thing as the
previous layers. As a result, the function we define just returns the output of the last
layer.
After this, we can define our loss function, optimization algorithm, and the metric
we will use to evaluate our model:
# Construct model
logits = neural_net(X)
#For TensorBoard
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(“events”)
# Initialize the variables (assign their default value)
init = tf.global_variables_initializer()
As our loss function, we use the cross-entropy loss with softmax. Apart from this,
there are other loss functions that are pre-built in TensorFlow. Some of them are:
softmax, tanh, log_softmax, and weighted_cross_entropy_with_logits.
Adam is one of the most commonly used optimization algorithms in the machine
learning community. Some other optimizers available in TensorFlow are:
GradientDescentOptimizer, AdadeltaOptimizer, AdagradOptimizer,
MomentumOptimizer, FtrlOptimizer, and RMSPropOptimizer.
Accuracy is our evaluation metric, as usual.
Now it’s time to train our model!
with tf.Session() as sess:
print(“Optimization Finished!”)
Regression
Although today’s deep learning applications are quite successful in challenging
classification tasks, TensorFlow also enables us to build regression models in almost
the same manner. In this section, we’ll show you how to predict a continuous
outcome variable using regression.
It is critical to choose a different loss function than we used in the classification
model – one that is more suitable to a regression task. We’ll choose the L2 metric, as
it is one of the most popular metrics in regression analysis. In terms of evaluation,
we’ll use R-squared to assess the performance of our model in the test set.
We import the same libraries that we imported for the classification task:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
The dataset we use is the same synthetic set provided, with 20 features and 1
outcome variable. Below, we load the dataset and do some pre-processing to format
the data we’ll use in our model:
import the data
with open(“../data/data2.csv”) as f:
data_raw = f.read()
# split the data into separate lines
lines = data_raw.splitlines()
We can now set the hyperparameters of the model regarding the optimization
process, and define the structure of our model:
# Parameters
learning_rate = 0.1
epoch = 500
# Network Parameters
n_hidden_1 = 64 # 1st layer number of neurons
n_hidden_2 = 64 # 2nd layer number of neurons
num_input = 20 # data input
num_classes = 1 # total classes
# tf Graph input
X = tf.placeholder(“float”, [None, num_input])
Y = tf.placeholder(“float”, [None, num_classes])
This time, our outcome is single-value in nature, and we have 20 features. We set
the relevant parameters accordingly, above. Next, we store the model parameters in
two dictionaries as we did in the classification case:
# weights & biases
weights = {
‘h1’: tf.Variable(tf.random_normal([num_input,n_hidden_1])),
‘h2’: tf.Variable(tf.random_normal([n_hidden_1,n_hidden_2]))
,
‘out’: tf.Variable(tf.random_normal([n_hidden_2, \
num_classes]))
}
biases = {
‘b1’: tf.Variable(tf.random_normal([n_hidden_1])),
‘b2’: tf.Variable(tf.random_normal([n_hidden_2])),
‘out’: tf.Variable(tf.random_normal([num_classes]))
}
It’s time to define the structure of our model. The graph is exactly the same as the
graph of the classification model we used in the previous part:
# Create model
def neural_net(x):
# Hidden fully connected layer with 64 neurons
layer_1 = tf.add(tf.matmul(x, weights[‘h1’]),
biases[‘b1’])
# Hidden fully connected layer with 64 neurons
layer_2 = tf.add(tf.matmul(layer_1, weights[‘h2’]), \
biases[‘b2’])
# Output fully connected layer
out_layer = tf.matmul(layer_2, weights[‘out’]) \
+ biases[‘out’]
return out_layer
The difference between the classification model and the regression model is that the
la er uses the L2 loss as a loss function. This is because the outcome of the
regression model is continuous; as such, we must use a loss function that is capable
of handling continues loss values. We also use the Adam optimization algorithm in
this regression model.
# Construct model
output = neural_net(X)
Another difference between our classification and regression models is the metric
we use to evaluate our model. For regression models, we prefer to use the R-
squared metric; it is one of the most common metrics used to assess the
performance of regression models:
# Evaluate model using R-squared
total_error = tf.reduce_sum(tf.square(tf.subtract(Y, \
tf.reduce_mean(Y))))
unexplained_error = tf.reduce_sum(tf.square(tf.subtract(Y, \
output)))
R_squared = tf.subtract(1.0,tf.div(unexplained_error, \
total_error))
print(“Optimization Finished!”)
tf.summary.merge_all: Summary nodes do not alter the graph of the model, but we
need them to run our summary operations. The tf.summary.merge_all function
merges all of our summary operations so that we do not need to run each operation
one by one.
tf.summary.FileWriter: This function is used to store the summary (which was
generated using the tf.summary.merge_all function) to the disk. Here is an example
of how to do that:
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(“events”)
Once we define our file writers, we also need to initialize them inside the session:
train_writer.add_graph(sess.graph)
After we integrate summary functions to our code, we should write the summaries
to the files and visualize them. When we run our model, we also receive the
summaries:
summary, loss, acc = sess.run([merged, loss_op, accuracy],
feed_dict={X: batch_x, Y: batch_y})
Next, we can visualize the summaries in the browser. To do that, we need to run the
following command:
tensorboard —logdir=path/to/log-directory
where the log-directory refers to the directory where we saved our summary files.
When you open localhost:6006 in your browser, you should see the dashboard with
the summaries of your model, similar to the one in Figure 4.
High level APIs in TensorFlow: Estimators
So far, we’ve discussed the low-level structures of TensorFlow. We saw that we
must build our own graph and keep track of the session. However, TensorFlow also
provides a high-level API, where the tedious works are handled automatically. This
high-level API is called “Estimators”.
Estimators API also provides pre-made estimators. You can use these estimators
quickly, and customize them if needed. Here are some of the advantages of this
API, with respect to the low-level APIs of TensorFlow:
With fewer lines of codes, you can implement the same model.
Building the graph, opening and closing the session, and initializing the
variables are all handled automatically.
The same code runs in CPU, GPU, or TPU.
Parallel computing is supported. As such, if multiple servers are available,
the code you write on this API can be run without any modification of the
code you run on your local machine.
Summaries of the models are automatically saved for TensorBoard.
When you are writing your code using this API, you basically follow four steps:
Now we will demonstrate each of these steps using our synthetic data for
classification. First, we read the data from our .csv file, as usual:
# import the data
with open(“../data/data1.csv”) as f:
data_raw = f.read()
lines = data_raw.splitlines() # split the data into
separate lines
labels = []
x1 = []
x2 = []
x3 = []
for line in lines:
tokens = line.split(‘,’)
labels.append(int(tokens[-1])-1)
x1.append(float(tokens[0]))
x2.append(float(tokens[1]))
x3.append(float(tokens[2]))
features = np.array([x1,x2,x3]).reshape(250000,3)
labels = np.array(pd.Series(labels))
Second, we write a function that converts our features to a dictionary, and returns
the features and labels for the model:
def inputs(features,labels):
features = {‘x1’: features[:,0],
‘x2’: features[:,1],
‘x3’: features[:,2]}
labels = labels
return features, labels
Third, we write a function that transforms our data into a DataSet object:
def train_input_fn(features, labels, batch_size):
# Convert the inputs to a Dataset.
dataset =
tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
return dataset.shuffle(1000).repeat().batch(batch_size)
Before we run our model, we should select a pre-defined estimator that is suitable
for our needs. Since our task is classification, we use two fully-connected layers, as
we did previously. For this, the estimator’s API provides a classifier called
DNNClassifier:
# Build a DNN with 2 hidden layers and 256 nodes in each
hidden layer.
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 256 nodes each.
hidden_units=[256, 256],
# The model must choose between 3 classes.
n_classes=3,
optimizer=tf.train.AdamOptimizer(
learning_rate=0.1
))
As before, we defined two dense layers of size 256, we set the learning rate to 0.1,
and we set the number of classes to 3.
Now, we are ready to train and evaluate our model. Training is as simple as:
classifier.train(input_fn=lambda:train_input_fn(inputs(X_tra
in,y_train)[0], inputs(X_train,y_train)[1], 64), steps=500)
We provided the function that we wrote above, which returns the DataSet object for
the model as an argument to the train() function. We also set training steps to 500,
as usual. When you run the code above, you should see something like:
INFO:tensorflow:loss = 43.874107, step = 401 (0.232 sec)
INFO:tensorflow:Saving checkpoints for 500 into
/tmp/tmp8xv6svzr/model.ckpt.
INFO:tensorflow:Loss for final step: 34.409817.
<tensorflow.python.estimator.canned.dnn.DNNClassifier at
0x7ff14f59b2b0>
After this, we can evaluate the performance of our model in our test set:
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:train_input_fn(inputs(X_test,y_test)[0],
inputs(X_test,y_test)[1], 64), steps=1)
print(‘Test set accuracy:
{accuracy:0.3f}\n’.format(**eval_result))
Summary
You can also use Keras on top of Theano or CNTK, but using it on top of TensorFlow is by far the
most common usage in the industry.
CHAPTER 5
After you run the command above, the Keras deep learning framework should be
installed in your system. After importing it, you can use Keras in your Python code
as follows:
import keras
Keras abstracts away the low-level data structures of TensorFlow, replacing them
with intuitive, easily integrated, and extensible structures. When designing this
framework, the developers followed these guiding principles:
Core components
The basic component in Keras is called the model. You can think of the Keras model
as an abstraction of a deep learning model. When we start to implement a DL model
in Keras, we usually begin by creating a so-called model object. Of the many types of
models in Keras, Sequential is the simplest and the most commonly used.
Another basic structure in Keras is called the layer. Layer objects in Keras represent
the actual layers in a DL model. We can add layer objects to our model object by just
defining the type of the layer, the number of units, and the input/output sizes. The
most commonly used layer type is the Dense layer.
And that is all! You might be surprised that the authors forgot to mention some
other critical parts of the Keras framework. However, as you’ll see below, you can
now start to build up your model with what you’ve already learned so far!
Keras in action
Now it’s time to see Keras in action. Remember that the datasets and the codes
examined in this section are available to you via the Docker image provided with
the book.
The datasets used to demonstrate Keras are the same synthetic datasets used in the
chapter on TensorFlow. We’ll again use them for classification and regression
purposes.
Classification
Before we begin to implement our classifier, we need to import some libraries in
order to use them. Here are the libraries we need to import:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
from sklearn.model_selection import train_test_split
First, we should load the dataset, and do a bit of pre-processing to format the data
we’ll use in our model. As usual, we load the data as a list:
# import the data
with open(“../data/data1.csv”) as f:
data_raw = f.read()
lines = data_raw.splitlines() # split the data into
separate lines
Then, we separate the labels and the three features into lists, respectively called
labels and features:
labels = []
features = []
for line in lines:
tokens = line.split(‘,’)
labels.append(int(tokens[-1]))
x1,x2,x3 = float(tokens[0]),
float(tokens[1]), float(tokens[2])
features.append([x1, x2, x3])
Next, we make dummy variables of the three label categories using Pandas’
get_dummies function:
labels = pd.get_dummies(pd.Series(labels))
The next step is to split our data into train and test sets. For this purpose, we use the
scikit-learn’s train_test_split function that we imported before:
X_train, X_test, y_train, y_test =
train_test_split(features, labels, test_size=0.2,
random_state=42)
We’re now ready to build up our model using Keras. We first define our model and
then add three layers; the first two are the dense layers and the third is the output
layer:
model = Sequential()
model.add(Dense(units=16, activation=’relu’, input_dim=3))
model.add(Dense(units=16, activation=’relu’))
model.add(Dense(units=3, activation=‘softmax’))
As you can see, building a graph in Keras is quite an easy task. In the code above,
we first define a model object (which is sequential, in this case). Then we add three
fully-connected layers (called dense layers).
After we define our model and layers, we must choose our optimizer and compile
our model. For the optimizer, we use Adam, se ing its learning rate to 0.1:
sgd = optimizers.Adam(lr=0.1)
Then we compile our model. In doing so, we define our loss function to be
categorical cross-entropy, which is one of the pre-defined loss functions in Keras.
For the metric to evaluate the performance of our model, we use accuracy, as usual.
All these definitions can be implemented in a single line in Keras as seen here:
model.compile(loss=’categorical_crossentropy’, optimizer=sg
d, metrics= [‘accuracy’])
Now, it’s time to train our model in a single line! We train our models by calling the
fit function of the model object. As parameters, we provide our features and labels
as NumPy arrays—the batch size and the epochs. We define the batch size as 10.000
and the epochs as 5:
model.fit(np.array(X_train), np.array(y_train),
batch_size=10000, epochs = 5)
During the training you should see something like this in the console:
Epoch 1/5
200000/200000 [==============================] - 0s 2us/step
- loss: 0.3671 - acc: 0.8255
Epoch 2/5
200000/200000 [==============================] - 0s 2us/step
- loss: 0.0878 - acc: 0.9650
Epoch 3/5
200000/200000 [==============================] - 0s 2us/step
- loss: 0.0511 - acc: 0.9790
Epoch 4/5
200000/200000 [==============================] - 0s 2us/step
- loss: 0.0409 - acc: 0.9839
Epoch 5/5
200000/200000 [==============================] - 0s 2us/step
- loss: 0.0368 - acc: 0.9854
So our model’s loss value is approximately 0.03 and the accuracy in the test set is
about 0.99!
Regression
In Keras, building regression models is as simple as building classification models.
We first define our models and the layers. One thing to be aware of is that the
output layer of a regression model must produce only a single value.
We also must choose a different loss function. As we did in the TensorFlow chapter,
we use the L2 metric, as it is one of the most popular metrics in regression analysis.
Finally, we evaluate the performance of our model using R-squared.
Import the following libraries:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
import keras.backend as K
from sklearn.model_selection import train_test_split
We’ll again utilize the synthetic dataset from the previous chapter. Recall that it
includes 20 features and 1 outcome variable. Below, we load the dataset and pre-
process the data into the format we’ll use in our model:
# import the data
with open(“../data/data2.csv”) as f:
data_raw = f.read()
lines = data_raw.splitlines() # split the data into
separate lines
This time, our outcome is a single value and we have 20 features. So, we set the
relevant parameters accordingly.
It’s time to compile our model. First, though, we must define a function that
calculates the R-squared metric. Unfortunately, as of this writing, Keras does not
provide a built-in R-squared metric in its package. As such, consider our
implementation:
def r2(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
After that we choose Adam as our optimizer and set the learning rate to 0.1:
sgd = optimizers.Adam(lr=0.1)
Now we can compile our model. We use the mean-squared error as our loss
function, and we feed our r2() function to the model as a metric:
model.compile(optimizer=sgd,
loss=’mean_squared_error’,
metrics=[r2])
Depending on the structure of the model, the output should look something like
this:
____________________________________________________________
_____
Layer (type) Output Shape Param
#
============================================================
=====
dense_1 (Dense) (None, 2) 4
____________________________________________________________
_____
dense_2 (Dense) (None, 1) 3
============================================================
=====
Total params: 7
Trainable params: 7
Non-trainable params: 0
Of course, visualizations are not only more aesthetically pleasing, but also can help
you easily explain and share your findings with stakeholders and team members.
Graphically visualizing the model in Keras is straightforward. A module named
keras.utils.vis_utils includes all the utilities for visualizing the graph using a library
called graphviz. Specifically, the plot_model() function is the basic tool for
visualizing the model. The code below demonstrates how to create and save the
graph visualization for a model:
from keras.utils import plot_model
plot_model(model, to_file = “my_model.png”)
Depending on the structure of the model, the png file should contain a graph like
the one in Figure 5:
The plot_model() function accepts two optional arguments:
show_shapes: if True the graph shows the output shapes. The default
se ing is False.
show_layer_names: if True the graph shows the names of the layers.
The default se ing is True.
Once we convert our Keras model into TensorFlow Estimator, we can use this
estimator in TensorFlow code (as we demonstrated in the previous chapter).
Before closing the chapter, we encourage our users to read more about the Keras
framework. If you are using DL models for research purposes, Keras is probably the
most convenient tool for you. Keras will save a lot of time in implementing the
many models you’ll try. If you’re a data science practitioner, Keras is one of the best
choices for you both in prototyping and production. Hence, enhancing your
understanding and expertise in Keras is beneficial regardless of your particular
problem.
Summary
TensorFlow now integrates Keras in its core. So, no need to install Keras additionally if you already
installed TensorFlow. However, we want to emphasize Keras as a separate framework as you
might prefer to use it on top of Theano or CNTK. In this respect, we prefer to install Keras
separately.
CHAPTER 6
PSO algorithm
The logic behind PSO is to have a set of potential solutions (akin to a swarm of
particles) that continuously evolve, becoming be er and be er, based on some
fitness function the system tries to maximize or minimize. The particles “move”
with varying speeds throughout several dimensions (also called variables),
influenced by the best-performing particle, so that they collectively reach an
optimal solution in an efficient manner.
In addition, each particle “remembers” its best performance historically, and it takes
that into account when changing its position. Naturally, the best performing
particle may be a different one over the duration of the search (you can imagine the
group of solutions moving towards the best possible solution like a swarm of
insects, so which insect is closest to that solution is bound to be different every time
you look at the swarm). Still, there is generally an improvement in the best solution
over time, even if the rate of this improvement gradually diminishes. This is
because the closer you get to the best solution, the more likely the swarm is bound
to deviate from it (albeit slightly) while “zeroing in” on that best solution.
All these traits make the PSO algorithm ideal for optimizing the parameters of a
complex system. PSO is relatively new as an algorithm; its creators, Dr. Eberhart
and Dr. Kennedy, invented it in 1995. The pseudocode of PSO is as follows:
For each particle in the swarm
Initialize particle by setting random values to its
initial state
End
Do
For each particle in the swarm
Calculate fitness value
If the fitness value is better than the best fitness
value in its history (pBest): pBest <-- fitness value of
particle
End
gBest <-- particle with the best fitness value of all the
particles in the swarm
For each particle
Calculate particle velocity according to equation A
Update particle position according to equation B
End
Repeat Until maximum iterations is reached OR minimum error
criteria is attained
The key equations for the updating of a particle’s speed and position, respectively,
are the following:
eq. A: v[] += c1 * rand() * (pbest[] - present[]) + c2 *
rand() * (gbest[] - present[])
eq. B: present[] += v[]
Note that c1 and c2 are parameters of the PSO algorithm, each having the default
value of 2.0 (though any values between 0.0 and 4.0 can be used), while the value of
each is independent of the value of the other. Also, the velocity of a particle for any
given dimension is limited to Vmax (another parameter set by the user) to avoid the
particles swarming out of control (something that would destabilize the whole
process). The exact value of this parameter depends on the problem.
Other parameters include the number of particles (usually at least 20, with more
complex problems usually requiring more particles), the range of the values of the
particles (which is dependent on the problem), and the stopping conditions—
namely the total number of iterations and the minimum error threshold. These
stopping conditions are also dependent on the problem.
To make PSO faster, we can include an additional parameter that affects the
progress the algorithm makes as it searches through the solution space. If a certain
number of iterations take place without any significant progress in the objective
function, then the algorithm can terminate; in these cases, the swarm usually has
go en stuck in a local optimum.
Main PSO variants
Just like most well-established algorithms in AI, PSO has its share of variants, most
of which are be er-suited for certain sets of problems. The most important of these
variants are:
PSO with inertia (by Shi and Eberhart): a variation of PSO that uses an
“inertia weight” (usually around 0.9), which gradually decreases, allowing
for a narrowing of the search over time. This enables PSO to switch from
exploratory to exploitative mode, yielding more accurate solutions.
PSO with Neighborhood Operator (by Suganthan): a popular variant of
PSO that considers other particles in the same neighborhood. The idea is
that through this method the chances of ge ing trapped in a local
optimum are greatly reduced, making the whole system more robust.
Discrete PSO (by Kennedy and Eberhart): a variant of PSO that enables
the solution of discrete optimization problems.
Constriction PSO (by Clerc and Kennedy): a version of PSO that doesn’t
make use of the Vmax parameter. It manages keep velocities in check by
introducing a couple of additional parameters, one of which is the
constriction coefficient χ (suggested value: 0.7289). These parameters
ensure that the velocity of the algorithm remains manageable, making
PSO converge smoothly.
Fully informed PSO (by Mendes et al.): a case of Constriction PSO, where
the two parameters are the same; this is generalized to any number of
particles. This enables each particle to be influenced by each other particle,
making the whole process more stable.
Bare-bones PSO (by Kennedy): a lighter version of the original PSO
algorithm, with the whole velocity aspect dropped altogether.
Firefly (by Xin-She Yang): a somewhat different approach to the whole
“movement” part of the algorithm. We’ll examine this in more detail in the
following section.
Note that since PSO is a topic of active research, there are continually new variants
being developed. The variants mentioned here are just the few that have stood the
test of time; they go on to show that there is a lot of promise in this optimization
framework.
Firefly optimizer
The Firefly optimizer is one of the most interesting variants of PSO—in fact, it is
unique enough to be considered a different optimizer by many people. However,
when examined in depth, it becomes clear that the Firefly optimizer is a more
creative take on the key components of PSO.
Despite sharing the same principles as PSO, a couple of distinct differences give
Firefly its niche. For starters, each particle is a racted by all other particles—not just
the best-performing one. This is reminiscent of the Fully-Informed PSO variant.
In addition, there is no velocity in this algorithm, since the concept of inertia is
replaced by “fogginess.” In other words, light is not dispersed perfectly as if in a
vacuum (such a case would make the algorithm very unstable and the particles’
movement chaotic). This is expressed by the light-absorption coefficient γ, which
ensures that the a raction fades exponentially, while the intensity of the light
follows the Newtonian model of gravity. The exponential diminishing of the
a ractiveness of other particles ensures that the fireflies don’t get too confused and
that it is generally the closest well-performing firefly that has the most impact.
Other than all that, the Firefly algorithm follows the same strategy as the PSO
algorithm. You can read about the details of the Firefly algorithm in its
documentation.12
The key advantages of Firefly over PSO are that it is faster and more accurate,
across a variety of objective functions. On the downside, it has a bunch of
parameters that need to be set, and tweaking the algorithm is quite a challenge.
Fortunately, the Firefly algorithm is good enough to be useful off the shelf with the
default parameter values. The code that accompanies this book includes an
implementation of Firefly that we’ll examine in Chapter 10, when we discuss
optimization ensembles.
Despite its length, the core of the algorithm is simple, quite fast, and relatively light
on computational resources. Note that most of the parameters are optional, since
their default values are predefined. Simply feed it the fitness function and the
number of variables, and decide whether you want it to be minimized or not. If you
don’t specify the la er, the PSO method defaults to minimization of the fitness
function.
Note that we use here the “vanilla” version of PSO, with minimal add-ons. As a
result, its performance is not great. We’ll investigate a more improved Julia script of
PSO in Chapter 10, along with its parallelized version.
PSO in action
The first practical application of PSO proposed by its creators was training ANNs.
However, PSOs flexible nature has made it useful in various other domains, such as
combinatorial optimization, signal processing, telecommunications, control
systems, data mining, design, power systems, and more. Also, as more specialized
algorithms for training ANNs became available, PSO ceased being a relevant option
for optimizing the weights of an ANN.
Although most versions of PSO involve a single-objective approach (having a single
fitness function), with some changes, PSO can be used in multiple-objective and
dynamic problems (with varying configurations). The possibility of having
constraints in the solution space has also been explored (the constraints in this la er
case are inherently different from the constriction PSO variant).
So, even though PSO was originally a data science-oriented algorithm, its
applicability has made it a useful tool for all sorts of problems. This clearly shows
how AI is an independent field that dovetails well with almost any data-related
scenario.
Nevertheless, some organizational problems require the use of an optimizer, rather
than a machine learning system. Examples of such issues include creating an
optimal schedule, finding the best way to stock a warehouse, or working out the
most efficient route for a delivery driver. These problems are so common in so
many industries that familiarity with a robust optimizer like PSO can be a good
distinguishing factor, professionally. Besides, having a variety of skills can help you
develop a more holistic view of a challenging situation, empowering you to find a
be er strategy for tackling it.
Note that just like any other AI optimizer, PSO does not provide the best solution to
a problem, nor does it have mathematical precision. However, it is very efficient. As
such, PSO adds a lot of value in cases where an approximate solution is sufficient,
especially if the time it takes to find this solution is also important. Furthermore,
when the problems involve functions that cannot be easily analyzed mathematically
(e.g. functions that aren’t “smooth” enough to calculate a derivative function), a
method like PSO is the most viable option.
Though more compact, this may not be as useful for complex functions involving a
lot of variables.
Whatever the case, we expect to get a solution that’s close to (0, 0, 0, 0, 0), since this
is the solution that corresponds to the absolute minimum of this function (which is
in this case 1.0 since 0^2 + 0^2 + |0| + sqrt(|0*0|) + 1 = 1).
Next, we need to run the PSO algorithm, using the above function as an input. We’ll
work with the default values for the input parameters, for the majority of them:
pso(F, 5)
For one of the runs of PSO, the solution [-0.0403686, 0.0717666, -0.0102388,
0.0966982, -0.129386] was yielded, corresponding to a fitness score of approximately
1.243. Although this solution is not particularly impressive, it is decent, considering
the complexity of the problem and the fact that we used the most basic version of
the optimizer.
We can try a smaller swarm – say, of 20 particles – for comparison:
pso(F, 5, true, 20)
The result in this case was [0.164684, -0.241848, 0.0640438, -0.0186612, -0.882855],
having a fitness score of about 1.388. Additional runs may yield be er scores. This
shows that PSO systems can yield acceptable results, even without lots of particles.
We can measure how long this whole process takes using the @time meta-
command, as follows:
@time pso(F, 5)
In this case, for a solution of comparable fitness, we learn that the whole process
took about 0.008 seconds—not bad at all. As a bonus, we get some information
about how many computational resources the process consumes. That is, 7.179 MB
of RAM through its 87.6K allocations. Note that for this report to be accurate, the
command must run more than once. This is true of all Julia functions benchmarked
using this meta-command.
The time the whole problem took was about 0.009 seconds, while it took about
15.006 MB of memory, and around 183.1K allocations. Clearly, this is a somewhat
tougher problem, involving a larger swarm, so it takes a bit more time and memory
(though the time overhead is quite small). If we were to solve either one of these
problems with a deterministic optimizer, though, it would probably take the same
computer longer.
PSO tips
Despite its simplicity, avoiding suboptimal results with PSO requires some
a ention to detail. For instance, if you use a low value for Vmax, the algorithm will
take a long time to converge (not to mention the increased risk of it ge ing stuck at
a local optimum, yielding a mediocre solution). On the other hand, a very large
value would make the whole process very unstable (and unable to converge on any
optimum).
Furthermore, a very large number of particles make the whole system fairly slow;
too few particles make it difficult to find the optimum solution. The empirical
default value of 10 times the number of variables seems to work well for all the
benchmarks tried, but it’s just a rule of thumb; make sure you experiment with this
parameter when you fine-tune your PSO model.
In addition, in some cases, PSO is used with a variable Vmax parameter, to ensure
that it converges more smoothly. For example, you can reduce it by a factor k, every
so many iterations, so that as it approaches the optimum value of the function, the
particles of the swarm will be closer together, yielding a be er precision. Once you
get the hang of PSO, you can experiment with such parameters to improve its
performance.
What’s more, it’s a good idea to make sure that the swarm covers a meaningful area
when deployed, to ensure that it won’t get stuck in a local optimum. In other
words, if you are optimizing a set of three parameters that all take place between 0
and 1, it’s best to spread the swarm to cover as much volume as possible, instead of
having them all close to (0, 0, 0). This is because if the optimal solution is close to (0,
1, 1), for example, it could take the swarm a long time to approach it.
How much area exactly a swarm covers when deployed is something you may want
to experiment with, since it largely depends on the problem at hand. Also consider
the distribution of the particles across the various dimensions of the problem space.
The distribution used in this implementation is Gaussian, as shown through the
randn() function used to initialize the particles.
The algorithm’s performance can be greatly improved if you parallelize it. The best
way to do so involves defining a number of workers, each one undertaking an
instance of the algorithm, and then comparing their findings, taking the smaller or
larger of their solutions, depending on the type of optimization problem you are
solving. Make sure you use the @everywhere meta-command in front of all the
functions, however, or the parallelization will not work. We’ll further examine the
parallelized version of PSO in Chapter 10.
Finally, PSO is still a work in progress, so don’t be afraid to experiment a bit,
changing it to suit the problem you need to solve. We also recommend you try to
implement the Firefly algorithm. We’ll be using the la er a bit in Chapter 10, where
we’ll explore the possibilities of optimization ensembles.
Summary
h ps://arxiv.org/pdf/1308.3898.pdf.
h ps://bit.ly/2NyIiBv.
CHAPTER 7
The crossover process ensures that the genome changes over time, through traits
that already manifest in the parents (e.g. eye color). Which of these traits survive in
the long run depends on another aspect of the evolutionary process called fitness.
Not all chromosomes get to cross over, since there exists a selection process to ensure
that the best-performing chromosomes are most likely to have descendants in the
next generation, much like in a species, only the be er equipped individuals (e.g.
faster, more adaptable, with be er immune systems, etc.) manage to survive and
procreate, ensuring that their genes don’t die out.
Fitness is the measure of how well these chromosomes perform as potential
solutions to the problem we are solving. Just like with PSO, we are trying to
maximize or minimize a fitness function that evaluates a solution. As the number of
chromosomes must remain constant through the whole evolutionary process
(otherwise we’d risk a population explosion, draining our computational resources),
only the best chromosomes make it to the next generation, based on their fitness.
Elitism is an auxiliary aspect of the GAs framework that is often used to ensure that
the best solution is constantly present. It’s like a fail-safe, guarding against the
possibility that the new genome is worse than that of the previous generation, due
to some bad crossovers and/or bad mutations. Elitism makes sure that the best
performing chromosome or chromosomes remain in the next generation regardless.
Although elitism was not part of the original GA framework, it is strongly
recommended you make use of it, as it has been shown to generally improve the
performance of the optimizer. However, if you overdo it with elitism by ge ing too
many well-performing chromosomes to the next generation at the expense of other,
not as well-performing chromosomes, you may end up with an overly
homogeneous population. This would result in an optimization process that
converges prematurely with the yielded solution more likely to be sub-optimal.
Note that the elitism option is controlled by a parameter that indicates how many
best-performing chromosomes to keep (see elitism() function later on).
The search space of problems tackled with GAs ideally involves a huge number of
potential solutions to the problem—usually larger than what could be solved
analytically. A modest example: if a GA tried to solve a problem where each
chromosome has 60 genes represented as bits, it would have 260 or over a billion
billion potential solutions.
In general, problems that lend themselves to GAs fall under the umbrella of “NP-
hard” problems. These are problems whose solving cannot be reduced to a fast
process, as they take exponential time. This means that if the dimensionality of the
problem increases by a factor of 2, the complexity of the problem is bound to
quadruple, or worse.
A typical NP-hard problem with many applications in logistics is the Traveling
Salesman Problem (TSP). This involves finding the optimal way to traverse a graph so
that at the end of your trip you are back where you started. Despite its simple
description, this is an exceptionally difficult problem as the number of nodes in that
graph gets larger.
As the scope of these problems makes finding the best solution quite unrealistic, we
opt for a “good enough” solution—one that yields a quite large (or small) value for
the fitness function we are trying to maximize or minimize.
4. Loop Process: Repeat steps 2-3 until a stopping condition has been met.
5. Output results: Output the best-performing chromosome and its fitness
score.
The selection process involves one of two main methods to stochastically determine
which chromosomes get to be parents (candidates of the crossover process) and
which don’t. These are roule e wheel selection and rank selection.
The first approach involves creating a “wheel” based on the fitnesses of all the
chromosomes, by basically normalizing them so that they add up to 1. This
normalization takes place based on a scaling function like exp(x) or sqrt(x),
depending on the problem at hand. After, we obtain a random number in the [0, 1)
interval, and we pick the chromosome corresponding to the wheel section that
includes that random number. We then repeat that process one more time to find
the other parent.
The rank selection approach uses the ranking of the fitness scores instead of the
scores themselves. So, the worst performing chromosome will have a value of 1, the
second worse a value of 2, and the best one a value of n, where n is the total number
of chromosomes. In all the other aspects, it’s the same as the roule e wheel
approach. The rank selection approach ensures that all chromosomes have a decent
chance of ge ing selected, especially in cases where a small number of
chromosomes dominate the population in terms of performance (because they are
significantly be er than the rest).
With so many parameters in the GA framework, it can be overwhelming to figure
out how to use it for your optimization problems. What follows are some rules of
thumb for selecting values for these parameters. Naturally, fiddling with these
parameters is a great way to learn, but these guidelines will help you at least get
started.
As far as crossover is concerned, you can use a probability between 0.8 and 0.95.
This means that around 90% of the time, there will be a crossover taking place for a
given chromosome.
Regarding mutation probability, a value around 0.005 to 0.01 generally works. Over
time, mutation on its own can produce a decent solution without any crossover at
all. Se ing this too high will result in a highly unstable genome that will change
uncontrollably and never converge.
Population size is a bit trickier to set, since a larger population would still work, but
take longer for the algorithm to run. That’s why having a number of chromosomes
equal to the number of genes in a chromosome is generally a good place to start.
When it comes to selection type, generally the roule e wheel method is fine.
However, if you find that a small set of chromosomes monopolize the solution
process (resulting in largely sub-optimal results for the whole system), then rank
selection may be a be er option.
The code is wri en to be easily customized whenever needed, as per the functional
programming paradigm that Julia follows.
function sample_ff(x::Array{<:Real, 1}, c::Array{<:Real, 1}
= ones(Int64, length(x))) # function to maximize
z = abs.(x) .* c
return 1 / (1 + sum(z))
end
ShouldActivate(p::Float64) = rand() < p # activation trigger
for an event of probability p
evaluation(ff::Function, x::Array{<:Real, 1}) = ff(x)
function scaling(y::Array{Float64, 1}) # scaling fitness
function values
y_ = exp(y)
# y_ = sqrt(y) # another alternative for scaling
return y_ / sum(y_)
end
function selection(X::Array{<:Real, 2}, y::Array{Float64,
1}, nv::Int64, ips::Int64)
y_ = scaling(y)
c = 0
ind = 1
xs = Array{eltype(X)}(undef, 2, nv)
ys = Array{Float64}(undef, 2)
while true
if ShouldActivate(y_[ind])
c += 1
xs[c, :] = X[ind, :]
ys[c] = y[ind]
if c == 2; return xs, ys; end
end
ind += 1
if ind > ips; ind = 1; end
end
end
function mutation(x::Array{<:Real, 1}, p::Float64,
nv::Int64) # change population a bit
new = false
for i = 1:nv
if ShouldActivate(p)
new = true
if eltype(x) <: Bool
x[i] = !x[i]
else
x[i] = 1 - x[i]
end
end
end
return x, new
end
function crossover(xs::Array{<:Real, 2}, ys::Array{Float64,
1}, p::Float64, nv::Int64)
d = rand(1:2) # dominant gene
z = xs[d, :]
w = ys[d]
new = false
if ShouldActivate(p)
new = true
r = 3 - d # recessive gene
q = rand(1:(nv-1))
z[1:q] = xs[r, 1:q]
w = [NaN]
end
return z, w, new
end
function elitism(X::Array{<:Integer, 2}, y::Array{Float64,
1}, n::Int64)
YX = hcat(y, X)
YX_sorted = swi(YX)[1]
X_elite = round.(Int64, YX_sorted[1:n, 2:end])
y_elite = YX_sorted[1:n, 1]
return X_elite, y_elite
end
function GeneratePopulation(ips::Int64, nv::Int64, s::Int64
= -1) # function for creating original population
if s == -1
X = rand(Bool, ips, nv)
for i = 1:ips
if sum(X[i,:]) == 0
X[i,rand(1:nv)] = true
end
end
else
x = falses(nv)
for i = 1:s; x[i] = true; end
X = Array{Bool}(undef, ips, nv)
for i = 1:ips
X[i,:] = x[randperm(nv)]
end
end
return X
end
function runga(ff::Function, nv::Int64, ips::Int64 = nv,
s::Int64 = div(nv, 2),
ng::Int64 = 1000, pm::Float64 = 0.01, pc::Float64 = 0.9,
pr::Float64 = 0.1) # wrapper function
X = GeneratePopulation(ips, nv, s)
y = Array{Float64}(undef, ips)
for i = 1:ips
y[i] = evaluation(ff, X[i,:]) # fitness scores of
population
end
n = round(Int64, ips*pr) # elite size
X_ = Array{Bool}(undef, ips, nv) # offspring population
y_ = Array{Float64}(undef, ips) # fitness scores of
offspring population
for i = 1:ng
X_[1:n, :], y_[1:n] = elitism(X, y, n)
for j = (n+1):ips
xs, ys = selection(X, y, nv, ips)
z, w, new1 = crossover(xs, ys, pc, nv)
z, new2 = mutation(z, pm, nv)
if new1 || new2
y_new = evaluation(ff, z)
else
y_new = w
end
X_[j,:] = copy(z)
y_[j] = y_new[1]
end
X = copy(X_)
y = copy(y_)
end
ind = findmax(y)[2]
return X[ind, :], y[ind]
end
GAs in action
Let’s now examine a couple of GAs at work. Before typing any code, make sure you
have loaded the packages Random and Statistics from the Base package, as their
functions will be used in these examples. You can do this as follows:
using Random, Statistics
We’ll examine two problems using the GAs optimization framework. The first
problem helps you get the hang of the method using an example you may have
encountered before. The second example is more related to the data science craft, so
we’ll use our second synthetic dataset for that.
For starters, let’s examine an old puzzle problem involving some geometry. We
chose this “matchstick problem” since it is easy to comprehend, while at the same
time challenging enough to have non-obvious solutions. The goal is to form two
squares, limiting the total number of matchsticks used to a maximum of 8.
In the first example, we’re looking at a maximization problem. So, given a set of 8
matches, what’s the best arrangement so that we have two squares, in a 2x2 grid
(Figure 7)?
We must first define the right fitness function. This may seem a bit of a daunting
task, so let’s break it down into smaller sub-tasks, each with its own function.
(That’s one of the strengths of a functional programming language, anyway.) This
process may seem a bit overwhelming at first, but at the very least, it’s good
practice for your analytical skills (which are quite transferable to other aspects of AI
and data science).
Let’s start by pinpointing all the possible squares, in relation to the indexes of
matchsticks, based on the grid illustrated above:
global S = [[1, 3, 4, 6], [2, 4, 5, 7], [6, 8, 9, 11], [7,
9, 10, 12], [1, 2, 3, 5, 8, 10, 11, 12]];
Then we can create a function that shows whether the sum of true values in a
chromosome is valid or not:
IsSumRight(x::Array{Bool, 1}, n::Int64) = (sum(x) == n)
In this case, n would be 8. However, it’s good practice to make these functions
parametric, so that you can experiment with variants of the problem afterwards if
needed.
Next, we’ll need to create a function that calculates how many squares are formed
with the matches from a given chromosome. This would look something like this:
function NumberOfSquares(x::Array{Bool, 1})
z = 0
for s in S
q = true
for i in s
q = q && x[i]
end
if q; z += 1; end
end
return z
end
Also, we’ll need a function to help us decipher a chromosome, turning its 1s and 0s
into indexes we can interpret using the grid:
function MatchIndexes(X::Array{Bool, 1})
q = collect(1:12) .* X
return q[q .> 0]
end
Now it’s time for the actual fitness function, which makes use of all the above
auxiliary functions in a way that’s meaningful to the problem at hand:
function ff1(x::Array{Bool, 1})
z = 2.0 # fitness score
if IsSumRight(x, 8)
z += 1
Q = MatchIndexes(x)
OK = true
for q in Q
ok = false
for s in S
if q in s
ok_ = true
for s_ in s
if !(s_ in Q)
ok_ = false
break
end
end
if ok_; ok = true; end
end
end
if !ok
OK = false
end
end
if !OK; z -= 1.5; end # loose ends in the squares
formation
s = NumberOfSquares(x)
if s == 2 # have needed number of squares
z += 1.0
elseif s == 1 # have just one square
z += 0.75
end
end
return z
end
This function may seem a bit complex because it also scans for loose matchsticks,
which it penalizes since they are not part of the desirable solution (i.e. their use
would make the problem easier).
Naturally, we’ll need to maximize this function. We assign the values we assign to
the fitness score of the chromosome for the various states, because even a
suboptimal solution needs to be rewarded somehow. Otherwise it would be much
more difficult for the optimizer to know if it’s going towards the right direction
since it would only be able to discern between a great solution and a terrible one.
With the intermediate rewards, however, for solutions that are OK (towards the
right direction) but still not acceptable, the whole solution landscape becomes
smoother and easier for the algorithm to navigate. Also, it’s important to avoid
using extreme values such as Inf or NaN as part of the fitness function, as this
would confuse the whole system during the selection process.
Running the GA algorithm for this problem should be straightforward at this point,
though for a problem like this, it is best to use more generations than the default (at
least 2500):
x, f = runga(ff1, 12, 12, 8, 25000)
println(MatchIndexes(x))
println(f)
There are two possible solutions to this problem (which are also symmetrical), and
the system correctly identifies one of them:
[2, 4, 5, 6, 7, 8, 9, 11]. This corresponds to the configuration shown in Figure 8:
This solution has a fitness score of 4.0, which is the highest possible. Note that to
obtain this solution, you may need to run the optimizer a couple of times at least,
since it won’t always converge on it. Alternatively, it may converge on a symmetric
arrangement, also having a fitness score of 4.0.
For the second example of the GA framework, we’ll look at a simple data science
problem involving data engineering. Given a set of 40 features, we’ll figure out the
20 features that are least correlated to each other. In this case, we’ll use the second
dataset (consisting of 20 features), along with a set of new features derived from the
original 20 ones, using the square root operator. As you might expect, these 20 new
features are strongly correlated to the original 20 ones, so 20 of the features of the
final feature set should be removed through this whole process.
Let’s get started by loading the dataset:
data = readcsvfile(“data2.csv”);
Note that you could also load the data using the read() function from the CSV
library by typing CSV.read(“data2.csv”). However, the custom-made function
readcsvfile() used here is somewhat faster as it is customized to the problem at
hand. We won’t show the code for this method for brevity, but you can find it in the
corresponding Jupyter notebook.
Next, we’ll need to isolate the features of the dataset, and create the new features
that the GA system should be able to identify and eliminate:
features = data[:, 1:20]
Z = Array{Float64}(undef, 250000, 40)
Z[:, 1:20] = features
for i = 1:20
Z[:, 20 + i] = sqrt.(features[:, i])
end
Clearly, the new features are going to be correlated to the original ones, which we
can verify as follows:
cor(Z[:,1], Z[:,21])
However, we’ll need easy access to all correlations among the various features. As
such, it makes sense to compute all these correlations beforehand and store them in
a variable that all functions can easily access:
global CM = cor(Z);
for i = 1:40
CM[i,i] = 0.0
end
Note that we set the correlation of each variable with itself to 0, to avoid allowing
that value to dominate all the others.
Now, it’s time to define the fitness function for this problem, which should look
something like this:
function ff2(X::Array{Bool, 1})
N = length(X)
f = collect(1:N) .* X
f = f[f .> 0]
n = length(f)
if n != 20
return -Inf
end
y = Array{Float64}(undef, n)
for i = 1:n
y[i] = (maximum(abs.(CM[f, f])))
end
return -maximum(y)
end
In order to find the optimal selection of features, we must ensure that only one
version of each feature exists in the final feature set (e.g. by ge ing rid of all the
newer features, or all the older ones, or some combination of both, so that for each
one of the original features only one version of it remains in the reduced feature
set). As such, we’ll need to make the final result of the fitness function negative; this
makes the whole process a minimization problem. Upon doing this, we can now
run the GAs optimizer as follows:
x, f = runga(ff2, 40)
Genetic programming
Moreover, Genetic Programming (GP) is a powerful GA variant that focuses on
continuous variables. As it is quite distinct from the standard GA, and since its
usefulness in data science applications is noteworthy, we’ll take a closer look at GP.
Finally, any add-on to the standard GA, such the elitism process, can be viewed as a
variant of sorts. After all, the broad framework lends itself to tweaking, so if you
delve into it a bit, you are bound to come up with your own variant of the
optimization method.
Genetic Programming is an interesting variant of GAs, which has received a lot of
a ention in the past few years. It involves “building” a synthetic function off other
simpler functions that take the form of genes. However, there is also some data that
is used in those functions, so the inputs of a GP model include a couple of data
streams, one for the inputs and one for the outputs of the function that needs to be
approximated.
Note that this problem can be solved with a regression model, though the model’s
success will be quite limited by linear combinations. Of course, you can have a non-
linear regression model, but you’d have to create the non-linear features yourself
(which takes considerable effort).You could also use an ANN, but you wouldn’t be
able to see how the inputs are mapped to the outputs (just like the “black box” issue
we discussed in previous chapters). If you require a mapping through a function
that you can view in its entirety, GP is the best way to go.
However, the function that GP yields is not necessarily a good model since it is
bound to overfit if you use the error as a fitness function (which you’d minimize
afterwards). That’s why it would make more sense to use some heuristic, such as
the correlation coefficient or a similarity metric in general, as the fitness function
(which you’d maximize in this case). Such problems have been solved successfully
for feature fusion scenarios, as part of a predictive analytics model.15
The functions that GP uses can be anything from a simple polynomial to some
complex trigonometrical function. They must be in their most basic form, since
more complex functions will inevitably emerge through the evolution process of the
GP system. For example, if you have the functions tan(x), x2, and x3, GP will at one
point have chromosomes consisting of tan(x2), tan(x3), (tan(x))2, x6, etc. That’s why
special care must be taken when selecting those functions so that the data used with
them makes sense. For instance, no point in using sqrt(x) if you have data points of
negative x, though sqrt(abs(x)) could be used instead.
GP is a truly ingenious method that often adds more value than the conventional
GA variants, opening new possibilities. Apart from feature fusion, it has been used
for building LISP programs. Don’t let the fact that many AI experts don’t know
about GP deter you from trying it out.
GA framework tips
Although useful, many GAs are not so straightforward; you must be aware of
several considerations. For starters, as GAs have a lot of parameters, some fine-
tuning is usually necessary. Sometimes configuring a GA properly is a big part of its
effectiveness, especially in complex problems.
Also, even though GAs are capable of handling continuous optimization problems,
they are generally not that effective in those situations. That’s why if you want to
tackle such problems you have to either use some other optimization framework, or
some specialized variant of GAs, such as GP.
Moreover, the standard GA is not all that robust, compared to its variants. So, if you
are to use the GA framework for a complex problem, it is best to explore the
algorithm’s various add-ons, so that you get a more effective and more efficient
optimization system.
Furthermore, just like PSO, GAs are great if your fitness function cannot be
differentiated, or the process of differentiating is very expensive computationally.
However, if you have easy access to the derivatives of the fitness function, other
optimization methods may be more effective for finding an accurate solution.
Summary
h ps://bit.ly/2qw8gvZ.
Unfortunately, NDA restrictions prohibit the discussion of details about such problems.
CHAPTER 8
During the iterations, it’s important to keep in mind the following points:
When selecting a new update for the solution, you first need to randomly
pick a data point in the search space (representing a potential solution)
from the predefined neighborhood area.
The new solution may not necessarily be be er, as there is a chance that a
sub-optimal update may take place, based on the temperature.
The cooling is typically geometric; it involves a cooling parameter β which
usually takes values between 0.8 and 0.99. The higher β is, the slower the
cooling. At an iteration k the temperature is Tk = β Tk-1. If this cooling
strategy is employed, make sure you have a non-zero positive value for
the minimum temperature set, to ensure that the search will eventually
“cool off.”
The acceptance probability is p = exp(- ΔC / T), where ΔC is the difference
in fitness function values between the current and previous solutions, and
T is the temperature at that time.
If cooling is sufficiently slow, the global optimum will be reached
eventually. However, this is bound to take some time.
Note that the sample_ff() function is not essential, but it’s useful to have a fitness
function to test the algorithm with, to make sure that it works properly. Also, the
neighbor function is not entirely necessary as an independent function. However,
since that’s a weak point of the SA algorithm (and hasn’t been explained properly in
any of the sources of the algorithm we’ve found), it would be best to have it as an
auxiliary function so that we can explore it more. After all, in an optimization
problem, not all variables are of the same scale. It’s helpful to have different radius
options for each variable’s neighborhood, which is possible through the
SelectNeighbor() function and its r parameter.
With all that out of the way, we can now look at the SA algorithm itself, which can
be coded as follows:
function SA(ff::Function, nv::Int64, maximize::Bool = true,
x::Array{<:Real, 1} = rand(nv), T::Float64 = 1e4,
beta::Float64 = 0.9, Tmin::Float64 = 1e-9, r::Array{Float64,
1} = 0.5*ones(nv))
x = map(Float64, x)
E = ff(x)
E_best = E
x_best = copy(x)
while T > Tmin
x_n = SelectNeighbor(x, nv, r)
E_n = ff(x_n)
DE = E_n - E
if (maximize && DE >= 0)||(!maximize && DE <= 0)
x = copy(x_n)
E = E_n
elseif ChancePick(abs(DE), T)
x = copy(x)
E = E_n
end
if (maximize && (E > E_best))||(!maximize && (E <
E_best))
E_best = E
x_best = x
end
T *= beta
end
return x_best, E_best
end
This function has a global minimum of 2.0, for the solutions [π, π, 0, R] and [π, π, R,
0], where R is any real number. However, due to the nature of the search space,
these two groups of solutions are not so obvious—not to an optimization algorithm,
anyway.
For this problem we’ll start with the potential solution x = [2, 4, 1, -1], which has a
value of 4.716. That’s pre y bad, but we don’t want to make it too easy for the
algorithm! We don’t need to provide SA with a starting position. Doing so, though,
gives us a be er understanding of how the optimizer performs. Otherwise, if we
give it a random solution, it may make things too easy, misleading us about
algorithm’s performance. Running the SA function on this data is quite
straightforward:
SA(ff1, 4, false, x)
This yields a solution and corresponding fitness function value of [3.13373, 3.10918,
0.0130858, -0.683512], 2.00953, which is not bad at all. The last variable (x4) is
irrelevant in this scenario (since it is covered by variable x3, in the fitness function),
while all the other ones that contribute to the solution are fairly close to the actual
minimum the algorithm was after.
The second example tackles an equally challenging function, involving a couple of
trigonometric expressions as well as a simple polynomial. This function, which we’ll
call ff2, takes the following form in Julia:
ff2(x::Array{<:Real, 1}) = -x[1]^2 + 2 / (1 +
abs(tan(x[2]))) - cos(x[3])
Naturally, it has a global maximum of 3.0, which corresponds to the solution [0, 0,
π]. Since we want to push the algorithm to its limits, we’ll start with a potential
solution of x = [1, -1, 2], which corresponds to a pre y terrible fitness score of about
0.198. Let’s now make use of the SA function, to see how this solution can be
improved. For this problem it takes the form:
SA(ff2, 3, true, x)
Summary
h p://bit.ly/2Oac0fP.
h ps://bit.ly/2I9fOei.
h ps://bit.ly/2GdhJBW.
h p://bit.ly/2MiIfZM.
CHAPTER 9
CNN components
CNNs have evolved considerably since their introduction in the 1980s. However,
most of them use some variation of the LeNet architecture, which was introduced by
Yann LeCun and flourished in the 1990s. Back then, CNNs were used primarily for
character recognition tasks; this niche changed as they become more versatile in
other areas like object detection and segmentation.
Figure 10 shows a simple CNN architecture, geared towards image classification. It
is composed of several layers, each specialized in some way. These layers eventually
develop a series of meta-features, which are then used to classify the original data
into one of the classes that are represented as separate neurons in the output layer.
In the output layer, usually a function like sigmoid is used to calculate scores for
each class. Those scores can be interpreted as probabilities. For example, if the score
for the first class is 0.20 we can say that the probability of the observation belong to
the first class is 20%.
1. Convolution
2. Non-linearity (usually through the ReLU function, though tanh and
sigmoid are also viable options)
3. Pooling (a special kind of sub-sampling)
4. Classification through the fully connected layer(s)
We’ll go over each of these operations below. Before that, note that all the data of
the original image (or audio clip, etc.) takes the form of a series of integer features.
In the case of an image, each one of these features corresponds to a particular pixel
in that image. However, CNN can also use sensor data as input, making it a very
versatile system.
Convolution
This is where CNNs get their name. The idea of convolution is to extract features
from the input image in a methodical and efficient manner. The key benefit of this
process is that it considers the spatial relationships of the pixels. This is
accomplished by using a small square (aka the “filter”) that traverses the image
matrix in pixel-long steps. Figure 11 below demonstrates the convolution operator:
Non-linearity
Non-linearity is essential in all DL systems and since convolution is a linear
operation, we need to introduce non-linearity in a different way. One such way is
the ReLU function which is applied to each pixel in the image. Note that other non-
linear functions can also be used, such as the hyperbolic tangent (tanh) or the
sigmoid. Descriptions of these functions are in the glossary.
Pooling
Since the feature maps and the results of the non-linear transformations to the
original data are rather large, in the part that follows, we make them smaller
through a process called pooling. This involves some summarization operation, such
as taking the maximum value (called “max pooling”), the average, or even the sum
of a particular neighborhood (e.g. a 3x3 window). Various experiments have
indicated that max pooling yields the best performance. Finally, the pooling process
is an effective way to prevent overfi ing.
Classification
This final part of a CNN’s functionality is almost identical to that of an MLP which
uses softmax as a transfer function in the final layer. As inputs, the CNN uses the
meta-features created by pooling. Fully-connected layers in this part of the CNN
allow for additional non-linearity and different combinations of these high-level
features, yielding a be er generalization at a relatively low computational cost.
Training process
When training a CNN, we can use various algorithms; the most popular is
backpropagation. Naturally, we must model the outputs using a series of binary
vectors, the size of which is the number of classes. Also, the initial weights in all the
connections and the filters are all random. Once the CNN is fully trained, it can be
used to identify new images that are related to the predefined classes.
Then we define the batch size as 128, the number of classes as 10 (the number of
digits from 0 to 9), the epochs to run the model as 12, and the input image
dimension as (28,28), since all of the corresponding images are 28 by 28 pixels:
batch_size = 128
num_classes = 10
epochs = 12
img_rows, img_cols = 28, 28
Next, we obtain the MNIST data and load it to variables, after spli ing as train and
test sets:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
It is time for some pre-processing—mostly reshaping the variables that hold the
data:
if K.image_data_format() == ‘channels_first’:
x_train = x_train.reshape(x_train.shape[0], 1, img_rows,
img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows,
img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows,
img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows,
img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype(‘float32’)
x_test = x_test.astype(‘float32’)
x_train /= 255
x_test /= 255
print(‘x_train shape:’, x_train.shape)
print(x_train.shape[0], ‘train samples’)
print(x_test.shape[0], ‘test samples’)
Then we convert the vectors that hold classes into binary class matrices:
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
After these steps, we are now ready to build our graph, using a sequential model.
We first add two convolutional layers on top of each other, then we apply the max-
pooling operation to the output of the second convolutional layer. Next, we apply
dropout. Before we feed the resulting output to the dense layer, we fla en our
variables, to comply with the input shapes of the dense layer. The output of this
dense layer is regulated with dropout; the resulting output is then fed into the last
dense layer for classification. The softmax function is used to turn the results into
something that can be interpreted in terms of probabilities. Here is the code snippet
of the model building part:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation=’relu’,
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation=’relu’))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=’relu’))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation=’softmax’))
We next compile our model using cross-entry loss and the Adadelta optimization
algorithm. We use accuracy as the evaluation metric, as usual:
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=[‘accuracy’])
It is time to train our model on the training set that we separated from the original
MNIST dataset before. We just use the fit() function of the model object to train our
model:
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
After 12 epochs, our model reaches 99% accuracy on the test set—a quite
satisfactory result for a simple model like this. This example demonstrates how
successful CNNs are in image classification tasks. Below is the output of the all of
the code above:
Downloading data from https://s3.amazonaws.com/img-
datasets/mnist.npz
11493376/11490434 [==============================] - 15s
1us/step
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 105s 2ms/step
- loss: 0.2674 - acc: 0.9184 - val_loss: 0.0584 - val_acc:
0.9809
Epoch 2/12
60000/60000 [==============================] - 106s 2ms/step
- loss: 0.0893 - acc: 0.9734 - val_loss: 0.0444 - val_acc:
0.9863
Epoch 3/12
60000/60000 [==============================] - 108s 2ms/step
- loss: 0.0682 - acc: 0.9798 - val_loss: 0.0387 - val_acc:
0.9864
Epoch 4/12
60000/60000 [==============================] - 109s 2ms/step
- loss: 0.0565 - acc: 0.9835 - val_loss: 0.0365 - val_acc:
0.9889
Epoch 5/12
60000/60000 [==============================] - 110s 2ms/step
- loss: 0.0472 - acc: 0.9860 - val_loss: 0.0311 - val_acc:
0.9899
Epoch 6/12
60000/60000 [==============================] - 110s 2ms/step
- loss: 0.0418 - acc: 0.9878 - val_loss: 0.0343 - val_acc:
0.9893
Epoch 7/12
60000/60000 [==============================] - 109s 2ms/step
- loss: 0.0354 - acc: 0.9895 - val_loss: 0.0266 - val_acc:
0.9918
Epoch 8/12
60000/60000 [==============================] - 107s 2ms/step
- loss: 0.0341 - acc: 0.9897 - val_loss: 0.0306 - val_acc:
0.9910
Epoch 9/12
60000/60000 [==============================] - 102s 2ms/step
- loss: 0.0298 - acc: 0.9907 - val_loss: 0.0282 - val_acc:
0.9915
Epoch 10/12
60000/60000 [==============================] - 103s 2ms/step
- loss: 0.0290 - acc: 0.9911 - val_loss: 0.0273 - val_acc:
0.9915
Epoch 11/12
60000/60000 [==============================] - 108s 2ms/step
- loss: 0.0285 - acc: 0.9911 - val_loss: 0.0283 - val_acc:
0.9915
Epoch 12/12
60000/60000 [==============================] - 118s 2ms/step
- loss: 0.0253 - acc: 0.9920 - val_loss: 0.0249 - val_acc:
0.9918
Test loss: 0.024864526777043875
Test accuracy: 0.9918
RNN components
RNNs have “recurrent” as part of their name because they perform the same task
for each element of a sequence, while the output depends on the previous
computations. This takes the form of loops in an RNN’s architecture, such as the
one in Figure 14.
This architecture could make it possible for the network to consider an unlimited
number of previous states of the data. In reality, though, it usually includes merely
a few steps. This is enough to give the RNN system a sense of “memory,” enabling
it to see each data point within the context of the other data points preceding it.
Since the recurrent connections in an RNN are not always easy to depict or
comprehend (particularly when trying to analyze its data flow), we often “unfold”
them. This creates a more spread-out version of the same network, where the
temporal aspect of the data is more apparent, as you can see in Figure 15. This
process is sometimes referred to as “unrolling” or “unfolding”.
Data flow and functionality
The data in an RNN flows in loops, as the system gradually learns how each data
point correlates with some of the previous ones. In this context, the hidden nodes of
an RNN (which are often referred to as “states”) are basically the memory of the
system. As you would expect, these nodes have a non-linear activation function
such as ReLU or tanh. The activation function of the final layer before the output
usually has a softmax function, though, so as to approximate probabilities.
Contrary to a traditional DL system, which uses different weights at each layer, an
RNN shares the same parameters across all steps. This is because it is basically
performing the same task at every step, with the only difference being the inputs.
This significantly decreases the total number of parameters it must learn, making
the training phase significantly faster and computationally lighter.
Training process
When training an RNN, we employ many of the same principles as with other DL
networks—with a key difference in the training algorithm (which is typically
backpropagation). RNNs demand an algorithm that considers the number of steps
we needed to traverse before reaching the node when calculating the gradient of the
error of each output node. This variant of the training algorithm is called
Backpropagation Through Time (BPTT). Because the gradient function is unstable
as it goes through an RNN, the BPTT is not good at helping the RNN learn long-
term dependencies among its data points. Fortunately, this issue is resolved using a
specialized architecture called the LSTM, which we’ll discuss in the next section.
RNN variants
When it comes to variants of RNNs, the ones that stand out are Bidirectional RNNs
(as well as their “deep” version), LSTMs, and GRUs.
RNNs in action
Here we provide an example of text classification using the LSTM variant of the
RNN. The code is implemented in Python and Keras. The dataset we use is from the
IMDB movie database; it is already available with the Keras’ datasets module. The
dataset includes IMDB’s users’ comments on movies and their associated
sentiments. Our task is to classify comments as positive or negative sentiments—a
binary classification task. The code below is taken from the official Keras
repository.25
First, we begin by importing the relevant libraries, as usual. Notice that we also
import the IMDB dataset from Keras’ datasets module:
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
Then, we set the maximum number of features to 20,000; the maximum number of
words in a text to 80; and batch size to 32:
max_features = 20000
maxlen = 80
batch_size = 32
Next, we load the dataset into some variables, after spli ing train and test sets:
(x_train, y_train), (x_test, y_test) =
imdb.load_data(num_words=max_features)
We need to pad some text comments, as some of them are shorter than 80 words,
and our model only accepts inputs of the same length. In short, padding works by
adding a predefined word to the end of a sequence to make the sequence of the
desired length. The code below pads the sequences:
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
Now, we are all set to build our sequential model. First we add an embedding layer,
and then we add an LSTM. Last, we add the dense layer for classification:
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation=‘sigmoid’))
After we build our model, we can now train on our train set. We use binary cross-
entropy loss as our loss function, the Adam algorithm as our optimizer, and
accuracy as our evaluation metric:
model.compile(loss=’binary_crossentropy’,
optimizer=’adam’,
metrics=[‘accuracy’])
Last, we test the performance of our model using the test set:
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
print(‘Test score:’, score)
print(‘Test accuracy:’, acc)
The model achieves almost 82% accuracy on the test set after 10 epochs, which is a
satisfactory result for such a simple model. The output of the code above is:
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 91s 4ms/step
- loss: 0.4555 - acc: 0.7836 - val_loss: 0.3905 - val_acc:
0.8279
Epoch 2/10
25000/25000 [==============================] - 90s 4ms/step
- loss: 0.2941 - acc: 0.8813 - val_loss: 0.3796 - val_acc:
0.8316
Epoch 3/10
25000/25000 [==============================] - 94s 4ms/step
- loss: 0.2140 - acc: 0.9178 - val_loss: 0.4177 - val_acc:
0.8311
Epoch 4/10
25000/25000 [==============================] - 96s 4ms/step
- loss: 0.1565 - acc: 0.9416 - val_loss: 0.4811 - val_acc:
0.8238
Epoch 5/10
25000/25000 [==============================] - 96s 4ms/step
- loss: 0.1076 - acc: 0.9614 - val_loss: 0.6152 - val_acc:
0.8150
Epoch 6/10
25000/25000 [==============================] - 92s 4ms/step
- loss: 0.0786 - acc: 0.9727 - val_loss: 0.7031 - val_acc:
0.8225
Epoch 7/10
25000/25000 [==============================] - 91s 4ms/step
- loss: 0.0513 - acc: 0.9831 - val_loss: 0.7056 - val_acc:
0.8166
Epoch 8/10
25000/25000 [==============================] - 91s 4ms/step
- loss: 0.0423 - acc: 0.9865 - val_loss: 0.8886 - val_acc:
0.8112
Epoch 9/10
25000/25000 [==============================] - 91s 4ms/step
- loss: 0.0314 - acc: 0.9895 - val_loss: 0.8625 - val_acc:
0.8140
Epoch 10/10
25000/25000 [==============================] - 91s 4ms/step
- loss: 0.0255 - acc: 0.9920 - val_loss: 1.0046 - val_acc:
0.8152
25000/25000 [==============================] - 14s
563us/step
Test score: 1.0045836124545335
Test accuracy: 0.8152
Before closing this chapter, we will bring to your a ention several other advanced
DL models (e.g. GANs), contained in appendices of this book. We strongly
encourage you to read those appendices to learn more on DL models. After that,
you can find many other useful resources to dig into the details of DL models.
RNNs are ideal tools to solve the following problems:
Summary
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs) are two of the most popular neural network architectures beyond
ANNs. CNNs are very good at image related tasks like image recognition
and image captioning and RNNs are quite efficient in sequential tasks like
machine translation.
CNNs achieve state-of-the-art performance on some computer vision
tasks, and have even surpassed human performance.
RNNs are very suitable for tasks that can be represented as sequences.
There are many variants of RNNs. Two most popular are LSTMs and
GRUs. LSTMs are comprised of four gates and GRUs are comprised of two
gates. In this respect, GRUs are lightweight versions of LSTMs.
h ps://bit.ly/2qfAjPM.
h p://bit.ly/2o175Da.
h p://bit.ly/2o5B63Y.
h p://bit.ly/2Hfxky2.
h p://bit.ly/21HCl4h.
h ps://bit.ly/2FuKCog.
CHAPTER 10
The key strength of this setup is that the two optimizers work simultaneously, so
the whole system is quite fast. You can create an ensemble of more than two
optimizers, each with its own set of parameters, all sharing the same fitness
function—provided there are enough processing units to run them.
Note that if the optimizers used in the ensemble have significantly different running
times, this will result in delays. This is because the ensemble wrapper function must
wait for all the workers to finish, not being able to outpace its slowest optimizer.
We can verify that we now have 4 workers (one for each PSO system in the
ensemble) as follows:
nprocs()
Even though we only need 3 workers for this example, we strongly recommended
adding all the workers you will need for the entire project during this initialization
phase. Otherwise, you risk serious confusion among the functions used by these
workers.
Afterwards, we need to load a function called swi() to help us with sorting matrices,
available in a custom script called SortWithIndexes.jl:
include(“SortWithIndexes.jl”)
Next, we need to define the fitness function (let’s call it FF) for the optimizers, and
we’ll opt to minimize it:
@everywhere function FF(X::Array{Float64})
return y = X[1]^2 + abs(X[2]) + sqrt(abs(X[3]*X[4])) + 1.0
end
For this function to be accessible to all the workers, it must be preceded by the
@everywhere meta-command. The same prefix is added to the pso() function,
implementing the PSO algorithm for each available worker, as well as in the swi()
function for the matrix sorting process. If you bring additional workers onboard
later, they will not have access to the FF function—until you run the previous code
again. Similarly, if you plan to run different versions of an optimizer among these
workers, it is best to activate each worker and run its corresponding optimizer code
individually, without the @everywhere meta-command.
For the wrapper function driving the whole ensemble, we’ll use the following code.
It’s basically a meta-PSO function, encompassing all the optimizers in the ensemble,
through their corresponding workers:
function ppso(ff::Function, minimize::Bool = true, nv::Int64
= 4, ps::Int64 = 10*nv, ni::Int64 = 2000)
np = nprocs()
F = Any[ff for i = 1:np]
Z = pmap(pso, F)
G = Array{Float64}(undef, np)
for i = 1:np
G[i] = Z[i][2]
end
if minimize
ind = indmin(G)
else
ind = indmax(G)
end
println(G)
return Z[ind]
end
In this case we don’t need to use the @everywhere meta-command, since the function
doesn’t need to run on every single worker. Also, the println() command is not
entirely necessary, though it can be helpful if you’d like to see how the various
members of the ensemble perform.
To run the PSO ensemble on the fitness function F, just type:
ppso(FF, true, 4)
This ensemble’s solution is very close to the global optima of [0, 0, 0, R] and [0, 0, R,
0] (where R is any real number), having a fitness value of 1.0. The solution found in
this case is [-0.00527571, -1.66536e-6, 1.44652e-5, -3.29118e-6], corresponding to a
fitness value of about 1.00008. This is quite close to the actual optimal value, even
though the whole process of finding this solution was not noticeably slower than
running a single PSO search.
Because of the default se ings of the ppso() function, even if we typed ppso(FF), the
ensemble would still work. Also remember that to use this function for a
maximization problem, you must set the second parameter to “false” first:
ppso(FF, false, 4)
This returns a solution very close to the global optimum, [-0.0103449, 6.77669e-8,
-7.33842e-7, -7.93017e-10], corresponding to the fitness value of about 1.0001. The
fitness values of the best solutions of the optimizers are: 1.01052, 1.00636, 1.0003,
and 1.00011, respectively. Although the final result is not as accurate as the previous
example’s, it is quite good overall. Considering that the PSO systems didn’t do so
well in this problem, the ensemble still managed to perform well, thanks to the
other two Firefly optimizers.
The co() function is versatile by design, in terms of the number of workers used. It
basically allocates half the available workers to one optimizer, and the remaining
workers to the other. You can experiment with it using different numbers of
workers, particularly if you have enough processing units at your disposal.
How optimization ensembles fit into the data science
pipeline
At this point you may be wondering how all this optimization ensemble stuff is
relevant to the data science process. After all, you can get insights with more
conventional methods such as basic machine learning system and statistical models,
or through the deep learning systems we examined in the first part of the book.
How do optimization ensembles improve upon these existing methodologies?
In short, most modern data science problems involve optimization in one way or
another, and conventional optimizers just won’t cut it. Older methods worked well
for the simple problems they were built for. Just a few decades ago, analysts (with
lots of patience and a knack for math) solved simple problems by hand. But data
science is a huge realm with massive quantities of complex data, though, involving
many dimensions and variables.
What’s more, although the optimizers we studied may be able to handle these
problems, oftentimes we don’t have the time to fine-tune these systems. That’s
where optimization ensembles come in. From performing feature selection to
optimizing the boundaries of parameters in a 1-class SVM for anomaly detection,
optimization is essential to many data science projects.
More importantly, solving complex optimization problems efficiently through
ensembles will give you a be er perspective on problem-solving in data science.
Many available methods are already optimized, but sometimes you may need to
break new ground, tackling a problem that hasn’t been tackled before (at least, not
in the open-source world). That’s where experience with ensembles (and other
topics just beyond the crash-course level) can come in handy, since not all problems
can or should be solved with off-the-shelf methods. The scope of a good data
scientist is no longer limited to traditional data science methods, expanding rapidly
to include AI methods like optimization ensembles.
Ensemble tips
The optimization ensembles covered in this chapter may seem straightforward, but
as usual, you must keep a few key things in mind. For example, you must ensure
that the optimizers are set up properly, by configuring the corresponding
parameters with values that make sense for the problem at hand. Otherwise, the
results may not be that useful or there is going to be a lot of wasted computational
resources (e.g. from delayed convergence).
Also, you are be er off using optimizers you are quite comfortable with and that
you understand well. This way, you can obtain good results from such a system,
even if the optimizers you use are instances of the same algorithm. Besides, due to
their stochastic nature, they are bound to yield different solutions anyway every
time you run them, so pu ing them in an ensemble can definitely have some
positive impact on the result of the optimization.
When building an optimization ensemble, pay a ention to the computing power at
your disposal. Julia can create as many workers as requested, but if the power of the
workers exceeds the capacity of the processing threads available in your machine,
the ensemble will not be efficient. Some threads will need to take on more tasks
than others, leading to an imbalance of workload that is bound to delay the whole
system. Remember, the workers may correspond to individual CPUs or GPUs when
their number is small. If that number exceeds the total processing units accessible to
Julia, though, there is bound to be an issue, even if it doesn’t throw any errors or
exceptions.
What’s more, using an optimization ensemble does not guarantee that the solution
will be a global optimum. Sometimes, even a group of ensembles can become
trapped in local optima, yielding a sub-optimal solution at the end of the pipeline.
Still, the chances of that are smaller when using an ensemble than when using a
single optimizer.
Summary
Architectures of ELMs
The simplest Extreme Learning Machine is a feed-forward network with a single
hidden layer; it’s actually quite similar to a single-layer perceptron model.
However, if we use different ELMs for each layer, it is easy to extend this single-
hidden-layer architecture to multi-hidden-layer feed forward networks.
Let’s start with a single-hidden-layer feed forward architecture. Figure 18 shows the
basic architecture of an ELM, with an input layer, a hidden layer, and an output
layer. As mentioned earlier, the architecture is comparable to that of a single-layer
feed forward network. The difference is that when we optimize single-layer ELM,
we do not tune the weights of the hidden layer. We just tune the output layer’s
weights.
The building blocks of multi-hidden-layer feed forward ELMs are simply single-
hidden-layer feed forward ELMs. In this case, we combine different single-layer
elements stack by stack. In effect, each hidden layer is just another ELM. This
essentially makes a multi-layer ELM an ensemble of ELMs, though a bit more
sophisticated than those we saw in the previous chapter.
Figure 19 demonstrates a multi-hidden-layer feed forward ELM as a combination of
two single layer ELMs. In this multi-hidden-layer architecture, consider each
hidden layer representative of an operation or role. For example, each single ELM
layer could deal with compression, feature learning, clustering, regression, or
classification, which are the basic operations or roles of any ELM.
In another alternative architecture for multi-hidden-layer ELMs, a hidden node
itself can be a sub-network formed by several nodes. As a result, some local parts of
a single ELM can contain multi-hidden layers.
Again, the basic tenet of ELMs is that hidden layers should not be tuned. Although
randomly assigning values to the weights of hidden layers is an option, it’s not the
only available option. Alternatively, a hidden node in any given layer might be a
linear transformation of the nodes in the previous layer. As a result, some nodes are
randomly generated and some are not, but none of them are tuned.29 One last thing
to note before closing the discussion of ELMs is that ELMs can be combined with
other learning models. The output of the other models can be fed into the ELMs, or
the output of the ELMs can be fed into other models as inputs. This is referred to as
stacked models, and is another form of ensembles.30
Architecture of CapsNets
CapsNet includes six layers. The first three are the encoders, and the last three are
the decoders:
1. Convolutional layer
2. PrimaryCaps layer
3. DigitCaps layer
4. Fully connected layer
5. Fully connected layer
6. Fully connected layer
Fuzzy sets
At the core of FL is the concept of “fuzzy sets”, which are entities that don’t have
crisp boundaries and therefore lend themselves to a potentially infinite number of
grades. Just as our eyes cannot identify one exact point where the color green ends
and the color yellow begins, a fuzzy set A has different levels of membership for the
various data points that may have A as an a ribute.
For instance, if you want to know how far away a place is, you could represent its
distance as a series of fuzzy sets. Each fuzzy set would correspond to a certain
threshold or level of distance—like walking distance, a bus ride, or a day-trip; or the
more abstract terms close, medium, and far). Each one of these levels may be
expressed as a mathematical function (represented as the le er µ), referring to
different parts of the distance variable (e.g. far). These functions almost always
overlap in some way, and are often triangular or trapezoidal in shape, for lower
computational cost (see Figure 24).
It is important to note that these levels are often selected in a way that makes sense
to us, since they tend to correspond to how we express these quantities in our own
natural language. The inherent ambiguity of these expressions is perfectly
acceptable, however, since it captures the uncertainty that is always involved in
these scenarios (e.g. “far” may mean different things to different people). This
uncertainty is modeled through the values of the membership functions of these
levels.
The values for each means of transport are derived from the rule application
process, which involves predefined mathematical operators like min, mean, or max,
that apply to the membership functions for the variables involved.
The combining of the above values into a “crisp” number (i.e. one that has no
membership values a ached to it) for each transport option is done through the
defuzzification method. This is usually performed using a method like “moment of
area,” corresponding to the center point of the polygon deriving from the merged
membership functions). So, we may end up with something like the following:
From this, we can predict with reasonable confidence that, in this particular
situation, it is best to walk. Note, however, that for a different city, the mappings
would be different, and the results would likely differ too.
Fortunately, plenty of packages (in most data science languages) enable the use of
FL and FIS in various applications. Below are some of the most interesting ones.
Python
scikit-fuzzy: an extension of the well-known scikit learn package, focusing on the
core FL operations. More at h ps://pypi.org/project/scikit-fuzzy.
fuzzywuzzy: an interesting package focusing on string matching using FL. More at
h ps://github.com/seatgeek/fuzzywuzzy.
pyfuzzy: an older package covering FL operations. More at
h p://pyfuzzy.sourceforge.net/ .
Peach: possibly the most relevant package out there on this topic, covering various
AI systems used in data science as predictive analytics systems. More at
h ps://code.google.com/archive/p/peach/.
Julia
Fuzzy.jl: a compact package containing various tools for both basic FL as well as
FIS. This is an unofficial package that’s still under development. More at
h ps://github.com/phelipe/Fuzzy.jl.
Summary
h p://en.wikipedia.org/wiki/Perceptron.
Minsky M, Papert S. Perceptrons: an introduction to computational geometry. Cambridge: MIT
Press; 1969.
Sosulski DL, Bloom ML, Cutforth T, Axel R, Da a SR. Distinct representations of olfactory
information in different cortical centres. Nature. 2011;472:213–6; Eliasmith C, Stewart TC, Choo X,
Bekolay T, DeWolf T, Tang Y, Rasmussen D. A large-scale model of the functioning brain. Science.
2012;338:1202–5; Barak O, Rigo i M, Fusi S. The sparseness of mixed selectivity neurons controls
the generalization–discrimination trade-off. J Neurosci. 2013;33(9):3844–56; Rigo i M, Barak O,
Warden MR, Wang X-J, Daw ND, Miller EK, Fusi S. The importance of mixed selectivity in complex
cognitive tasks. Nature. 2013;497:585–90.
Moreover, another kind of randomness can also be integrated into the ELMs. In this case, input
layers to hidden nodes can be fully or partially randomly connected according to different
continuous probability distribution function.
A great resource for the properties of the ELMs can be found here: h ps://bit.ly/2JoHhcg.
h ps://bit.ly/2PP3NQ0.
Other minor differences between CNNs and CapsNets exist, but are not as significant. Consult
Footnote 22 for a link with further reading.
h ps://arxiv.org/abs/1710.09829.
CHAPTER 12
Next Steps
We’ve covered many things in this book. Most of the book is dedicated to the
popular and useful approaches that have broad applicability in today’s AI
challenges. We discussed basic deep learning concepts and models, as well as
several programming libraries that prove to be very convenient in implementing
deep learning models. The basic optimization algorithm that is mostly used in deep
learning today is backpropagation, and we’ve provided several examples that use
this optimization method.
However, modern tasks involving AI are rather huge; backpropagation is only
applicable to the functions that are differentiable. Prominent AI researcher François
Chollet put it this way:34
“Backprop is a centralized answer, that only applies to differentiable functions
where “control” means adjusting some parameters and where the optimization
objective is already known. Its range of applicability is minuscule.”
With this in mind, we provided other optimization algorithms (such as Particle
Swarm Optimization, Genetic Algorithms, and Simulated Annealing) to cover the
optimization possibilities of many of the AI tasks, as well as several other
applications in both the industry and in scientific research. And finally, we
presented a general picture of alternative AI frameworks, so that you can have a
complete grasp of artificial intelligence in today’s data science world.
In this final chapter of the book, we’ll cover the “next steps” required to advance
beyond this book and enhance your understanding of the data science domain.
First, we briefly discuss big data, which has become essentially inescapable. After
that, we outline some of the prominent specialization areas of data science today.
We hope that this discussion will provide you with some perspectives regarding the
broad areas that can be involved in your future projects. Finally, we list some
useful, publicly-available datasets that you can use to practice and to research.
Big data
The term “Big Data” is probably one of the most commonly used terms in computer
science today. Although the definition of the term is somewhat blurry (as the name
includes the very subjective term “big”), we can succinctly define it as the amount
of data that is big enough so that an average personal computer is unable to process
it.” By this definition, the term “big” refers to terabytes, petabytes, or even exabytes
of data (to give you a sense of proportion, if you gather all the data a person would
normally access throughout his or her lifetime, it’s unlikely to reach over 1 exabyte).
To further narrow the definition, consider the four Vs described by IBM.35 Big data is
constantly increasing in volume, arriving in growing velocity, sourced by ever-
increasing variety and with uncertainty around veracity. These so-called four Vs—
volume, velocity, variety, and veracity—are among the most important characteristics
of big data in today’s highly-connected digital world.
Why is big data important? The answer to this question lies within just two
statements:
The successes of the deep learning algorithms that we covered in the previous
chapters are also related to big data. The two main reasons that deep learning has
been so successful are the advances in the computational power of modern
computers, and the huge amount of data that is available to use.
To work with huge amounts of data, two recent technologies are quite important;
we dedicate the rest of the big data discussion to these technologies. One of them is
Hadoop and the other one is Spark, which are both open-source projects created by
Apache Foundation. We encourage you to learn more about them, as they are
especially useful in tackling huge amounts of data.
Hadoop
If we want to store data just for backup purposes, then the amount of data may not
be a first-order concern for a data scientist. More often than not, though, we want to
analyze that data, gleaning useful insights from the information that is hidden
there. A distributed data storage technology called Hadoop was developed just to
meet this need.
Hadoop (usually referred to as “Hadoop cluster”) has four key components:
Commons, YARN, HDFS, and MapReduce. Commons is Hadoop’s utilities
structure, and YARN is a tool for resourcing and scheduling in Hadoop. HDFS is
the abbreviation for Hadoop Distributed File System; this is where the data is
actually stored. MapReduce is a data processing tool for distributed systems. Figure
26 illustrates the components of Hadoop in a simple way.
As a data scientist, you probably use a larger set of tools than just the above four.
You use these tools because they make your life easier. They usually abstract away
the architectural and administrative parts of working in a Hadoop cluster. The basic
thing a data scientist want to experience with data that is stored on a Hadoop
cluster is probably querying the data as if querying a single machine relational
database management system rather than administering a Hadoop cluster.
To this end, you can use Hive and its querying language HiveQL. Using HiveQL, one
can forget about the distributed nature of the data and write queries as if all the
data is on a single machine. Due to the huge volumes of data on HDFS, though, the
HiveQL queries usually take more time than traditional SQL queries! Other tools
(like Presto) provide faster access to the data on HDFS, but they are usually more
computationally expensive, and their requirements are more demanding.
Apache Spark
So far we’ve covered how to store and access big data in a clustered and distributed
manner. Since the ultimate objective of a data scientist is to build analytical models
upon that data, we need some other technology to conveniently work with the data
stored on HDFS.
Apache Spark is one of the most commonly-used technologies that provides
distributed computing. Spark simplifies some common tasks including resource
scheduling, job execution, and reliable communication between nodes of a cluster.
As of 2018, Spark supports five programming languages: Scala, Python, Julia, R,
and Java.
Spark uses a data representation called the Resilient Distributed Dataset (RDD),
which is very similar to the dataframes in Python, Julia, and R. RDD supports
distributed computing by working on multiple machines, where it uses the
memories of all these distributed computers in a cluster. Once an RDD is in place,
you can use Spark to interact with this distributed data as if the data were on the
memory of a single machine. This way, Apache Spark isolates the distributed nature
of the computing infrastructure for the user, which is very convenient when
working on a data science task. Figure 27 depicts the basic architecture of Apache
Spark.
The Python example below illustrates how a few lines of code in Apache Spark can
easily count words in a document stored on HDFS:
text_file = sc.textFile(“hdfs://...”)
counts = text_file.flatMap(lambda line: line.split(“ “)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(“hdfs://...”)
The first line reads distributed data from an HDFS. The flatMap() function separates
each word in the text file. The map() function combines each word with 1 as a key-
value tuple. The reduceByKey() function adds up the numbers for the same keys
(which are words, in this example). The last line just saves the results—that’s it!
Specializations in data science
Lately data science has become a broad discipline, so specializing in a specific
domain is usually a necessity rather than a luxury. Here we discuss some different
niche areas within data science, in hopes of helping you decide which area you are
most interested in.
Data Engineering
Earlier in this chapter, we talked about big data and some related technologies. We
pointed out that large amounts of data require special treatment, in terms of storing
and computing environments. Today, managing huge amounts of data is a time-
consuming task that demands a specialization related to systems administration
and database management.
Data engineers are responsible for gathering, storing, and securing data. They are
also the ones who manage the infrastructure of a data science or machine learning
pipeline within a company or organization. In this respect, data engineering is
related more to system administration than to data science. However, for those of
you with interest in working with these aspects of technology, data engineering
positions might be the most relevant.
Computer Vision
Computer vision is another AI domain that aims to make machines understand the
visual aspects of the world we all live in. The models and methods used in
computer vision process the digital images or videos that are provided to them.
Image recognition, object detection, and object tracking are only a small subset of
the tasks tackled by computer vision today.
Some fantastic applications of computer vision expertise are motion recognition,
augmented reality, autonomous cars, robotics, image restoration, and image
generation. Also, computer vision has a lot to offer in Augmented Reality
applications, where virtual reality objects are overlayed on real world views.
Because of all that, we believe that computer vision will continue to be one of the
most promising areas of study within artificial intelligence and data science.
Internet of Things
This term (often abbreviated IoT) refers to the fact that in today’s world, billions of
gadgets and all kinds of machines are constantly connected to the Internet (chances
are that you are reading an electronic version of this book through one of them!). As
computers go from small to tiny (e.g. Raspberry Pi), they can fit into more and more
commonplace devices, leading to an explosion of affordable “smart” gadgets. The
data that all these connected devices generate is tremendous, and provides a very
vast playing field for data science. The big data infrastructure required to store and
process this huge quantity of data necessitates big data expertise.
Furthermore, analyzing this much data is also a special focus of data science.
Anomaly detection and predictive maintenance are just two examples of the
applications of data science within the IoT. With the continuing increase of the
number of devices that are connected to each other and to the Internet, IoT will
become one of the core areas of study in data science.
Social Sciences
This field is too large to begin to dissect in detail, but it is clear that analyzing
people (their habits, their spending, or their relationships, for instance) on a large
scale demands sophisticated methods. Time series methods in particular have huge
applicability in social sciences, especially given modern advances in computational
power and the availability of data.
Obviously, this is not an exhaustive list of niches within data science. These are
simply highlights that are relatively popular nowadays. Furthermore, many
interdisciplinary areas (such as robotics, for instance) can include different
components of artificial intelligence (like computer vision and NLP). Consider, for
example, how Augmented Reality combines computer vision with many other
processes as we briefly saw earlier.
It takes a lot of practice to become an expert in just about anything, including any
one of these data science sub-domains. Of course, you won’t be able to practice
without access to useful tools and datasets. To this end, in the next section, we
provide some resources you may find useful.
Computer vision
Natural language
Speech in English
Summary
h ps://bit.ly/2wsdhIC.
We want to warn our readers that they will find all sorts of additional “Vs” in the literature and on
the Internet. Here we use IBM’s definition as it is brief and concise. Source: h ps://ibm.co/2AFln5v.
h ps://www.kaggle.com/.
Closing Thoughts
Our hope is that you have received a very broad yet practical perspective on AI
from this book. Since this field changes so rapidly, with advances in AI coming
from both older and newer players around the world, keep your eyes out for the
upcoming second edition!
What will not require any updates for a second edition, however, is the general
framework of thinking about data science and AI from a practitioner’s lens. After
all, no ma er how much the tools change, core methodologies are bound to remain
relevant. We made a deliberate effort to focus on principles and approaches,
without ge ing lost in the hype that surrounds this field.
We hope that you will use this book as a basis to further your understanding and
aptitude in AI, so that you make good use of this promising technology. You don’t
need to be an AI expert to benefit from AI systems— especially when it comes to
data science related projects. However, if you really want to maximize your benefit,
we recommend dedicating at least 6 months of your time to serious study of this
field. No book can replace a hands-on course with individualized instruction; such
classes can be found at universities or other specialized learning institutions. After
all, data science AI is a field that involves problem-solving and creativity, rather
than the mechanical application of some “recipe” geared towards a particular
programming language.
Speaking of programming, with the rapid growth of functional programming
languages these days, we thought it would be useful to expose you to both a
conventional OOP language (like Python) and a newer one (like Julia), which is
geared more towards the functional paradigm. This way, whatever happens with
the constantly changing programming landscape in the AI field, you’ll be ready.
Thank you for taking the time to read this book. We hope you have found it useful
and we look forward to contributing to this field through additional materials on its
various relevant topics. We believe that curiosity is the main driver of the successes
of the AI and data science today. If this book has provoked your curiosity about this
field, then we count ourselves successful.
Glossary
A
Activation function: see transfer function.
Agents: also known as bots or intelligent agents, they are autonomous software
programs that aim to “sense” through sensors and “act” through actuators,
according to their target function. Often they are leveraged as a way to mimic
human behavior as assistants in a variety of functions. There are five classes of
agents according to Russell & Norvig: simple reflex agents, model-based reflex
agents, goal-based agents, utility-based agents, and learning agents.
Algorithm: a step-by-step procedure for calculations and logical operations. In an
AI se ing, algorithms can be designed to facilitate machine learning and acquire
knowledge by themselves, rather than relying on hard-coded rules.
Amalgamation: the process of pu ing a system into a single file, that can be ported
to different computer platforms. The executable file that it yields, has a very limited
number of dependencies (sometimes it’s even self-sufficient), making it a flexible
option for deploying a system.
Apache Spark: a unified analytics engine for large scale data processing. It is
designed for distributed computing and supports distributed data storage like
HDFS.
API (Application Programming Interface): a set of definition, protocols, tools, and
routines, for interacting with a program, usually via the Internet. APIs are essential
for linking programming languages to various frameworks like deep learning ones.
Architecture (of a Deep Learning system): the way the neurons of an artificial neural
network are organized and the connections among them.
Artificial creativity: a novel methodology of AI whereby the AI system emulates
human creativity in a variety of domains, including painting, poetry, music
composition, and even problem-solving. Artificial creativity has an important role
in data science also.
Artificial Intelligence (AI): a field of computer science dealing with the emulation
of human intelligence using computer systems and its applications on a variety of
domains. AI’s application on data science is noteworthy and an important factor in
the field, since the 2000s.
Artificial Neural Network (ANN): a graph-based artificial intelligence system,
implementing the universal approximator idea. Although ANNs have started as a
machine learning system, focusing on predictive analytics, they have expanded over
the years to include a large variety of tasks. ANNs comprise of a series of nodes
called neurons, which are organized in layers. The first layer corresponds to all the
inputs, the final layer to all the outputs, and the intermediary layers to a series of
meta-features the ANN creates, each having a corresponding weight. ANNs are
stochastic in nature so every time they are trained over a set of data, the weights are
noticeably different.
Augmented Reality (AR): a sub-field / application of artificial intelligence that delves
in the combination of real world data, such as a video feed of what’s in front of you,
with computer-generated data stream, related to what is observed, in a single video
feed. The augmented view can have an informative role or increase functionality in
some other way. AR requires specialized hardware as well as software combining
computer vision and other processes.
Autoencoder: an artificial neural network system designed to represent codings in a
very efficient manner. Autoencoders are a popular artificial intelligence system that
are used for dimensionality reduction, as well as a few other unsupervised learning
applications.
Automated Machine Learning (AutoML): Google’s AI project responsible for
creating an AI that designs and implements its own AI, for computer vision
purposes.
B
Backpropagation: a very popular training algorithm for deep learning systems and
artificial neural networks in general. It involves moving the errors of the network
backwards (towards the inputs), and changing the weights of the various neurons
based on the partial derivatives of the error function, as well as their location on the
network.
Big data: an area of computer science that is interested in the efficient processing
and storage of very large amounts of data. Although defining the term changes
person to person, one can succinctly define big data as the amount of data that is
big enough so that an average personal computer is unable to process it.
Binarization: the data engineering process of turning a discreet variable into a
binary feature.
C
Chatbot: an artificial intelligence system that emulates a person on a chat
application. A chatbot takes as its inputs text, processes it in an efficient manner,
and yields a reply in text format. A chatbot may also carry out simple tasks, based
on its inputs and it can reply with a question in order to clarify the objective
involved.
Chromosome: a potential solution to a problem, modeled using Genetic Algorithms.
Classification: a very popular data science methodology, under the predictive
analytics umbrella. Classification aims to solve the problem of assigning a label (aka
class) to a data point, based on pre-existing knowledge of categorized data,
available in the training set.
Classifier: a predictive analytics system geared towards classification problems.
Cloud (computing): a model that enables easy, on-demand access to a network of
shareable computing resources that can be configured and customized to the
application at hand. The cloud is a very popular resource in large-scale data analytics
and a common resource for data science applications.
Clustering: a data science methodology involving finding groups in a given dataset,
usually using the distances among the data points as a similarity metric.
Cognitive computing (CC): a set of processes and methods that involves self-
learning systems that use data mining, pa ern recognition, natural language
processing and speech recognition to mimic the way the human brain works. CC
can be viewed as a special kind of artificial intelligence.
Computational Intelligence (CI): a subclass of artificial intelligence, geared
towards computational problems, such as optimization.
Computer cluster: a collection of computers sharing resources and working
together, usually as a single machine. Computer clusters are very useful for tackling
big data problems in-house, though more often than not, are found in data centers,
forming public computer clouds.
Computer Vision: an application of artificial intelligence, where a computer is able to
discern a variety of visual inputs and effectively “see” a lot of different real-world
objects in real-time. Computer vision is an essential component of all modern
robotics systems.
Context: a characteristic of an NDArray, whereby the data is assigned to a particular
processing unit (a GPU), to be er utilize the available computing resources.
Convolutional Neural Networks (CNNs or ConvNets): a type of deep learning ANN
that involves a series of specialized layers when processing the (usually high-
dimensionality) data, for various predictive analytics applications (mainly
classification). CNNs are closely tied to computer vision and Natural Language
Processing.
Crossover: a process in the Genetic Algorithms framework, whereby two chromosomes
merge resulting to a new pair of chromosomes, that are candidates for the next
generation.
D
Data analytics: a general term to describe the field involving data analysis as its
main component. Data analytics is more general than data science, although the two
terms are often used interchangeably.
Data engineering: the part of the data science pipeline where data is acquired,
cleaned, and processed, so that it is ready to be used in a data model. Most artificial
intelligence systems handle a large part of the data engineering once they are given
the data that we want them to model.
Data exploration: the part of the data science pipeline where the various variables are
examined using statistics and data visualization, in order to understand it be er and
work out the best ways to tackle it in the stages that follow.
Data model: a data science module processing and/or predicting some piece of
information, using existing data, after the la er has been pre-processed and made
ready for this task. Data models add value and are comprised of non-trivial
procedures. In AI, data models are usually sophisticated systems making use of
several data-driven processes under the hood.
Data science: the interdisciplinary field undertaking data analytics work on all
kinds of data, with a focus on big data, for the purpose of mining insights and/or
building data products.
Data visualization: the process of creating visuals based on the original data, or the
data stemming from the data model built using the original data.
Dataset: the data available to be used in a data analytics project, in the form of a table
or a matrix. A dataset may need some work before it is ready for being used in a
data model, though in many artificial intelligence models, you can use it as is.
Data structure: a collection of data points in a structured form, used in
programming as well as various parts of the data science pipeline.
Deep belief network (DBN): several Restricted Bolzmann Machines stacked together
in a deep learning network fashion.
Deep Learning (DL): an artificial intelligence methodology, employing large artificial
neural networks, to tackle very complex problems. DL systems require a lot of data in
order to yield a real advantage in terms of performance.
Dimensionality reduction: the process of reducing the number of features in a
dataset, usually through the merging of the original features in a more compact form
(feature fusion), or through the discarding of the less information-rich features
(feature selection). Dimensionality reduction can be accomplished in many ways,
usually using some specialized artificial intelligence systems such as autoencoders and
Restricted Bolzmann Machines.
Docker: a container software geared towards creating programming environments
on a computer, containing all the required programs and data, so that an
application can run on that computer smoothly, even if it was developed on a
machine with a completely different configuration.
E
Elitism: an aspect of Genetic Algorithms, according to which, the best performing
chromosome or chromosomes are preserved as they are, for the next generation.
Ensemble: “The process by which multiple models, such as classifiers or experts,
are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the
(classification, prediction, function approximation, etc.) performance of a model, or
reduce the likelihood of an unfortunate selection of a poor one.” (Dr. Robi Polikar).
Ensembles may also involve AI systems too, such as optimizers, in order to a ain a
be er performance than a single such system.
Embedding: a low-dimensional representation of a given set of data. Embeddings
are quite common in deep learning systems, particularly in autoencoders and in
representing words in Natural Language Processing tasks.
Epoch: an iteration in the training phase of an Artificial Neural Network.
Error function: the function used for assessing the deviation of predicted values of a
machine learning model from the actual values (target variable). In artificial neural
network models, the error function needs to be continuous.
ETL (Extract, Transform and Load): a process in all data related pipelines, having
to do with pulling data out of the source systems (usually databases) and placing it
into a data warehouse or a data governance system. ETL is an important part of
data acquisition, preceding any data modeling efforts.
Extreme Learning Machines (ELMs): a relatively new type of artificial neural
networks that are very fast to train and exhibit decent performance in predictive
analytics problems. Their key characteristics is that most of the connections have
random weights, apart from those of the last layer (outputs), which are optimized
during the training process.
F
Feature: a processed variable capable of being used in a data science model. Features
are generally the columns of a dataset.
Feature engineering: the process of creating new features, either directly from the
data available, or via the processing of existing features. Feature engineering is part
of data engineering, in the data science process.
Feature fusion: see fusion.
Feature selection: the data science process according to which the dimensionality of
a dataset is reduced through the selection of the most promising features and the
discarding of the less promising ones. How promising a feature is depends on how
well it can help predict the target variable and is related to how information-rich it
is.
Feed-forward network: see Multi-Layer Perceptron.
Filter: a process in convolutional neural networks whereby features are created
from an image by scanning it through a moving window (e.g. a 3x3 matrix).
Fitness function: an essential part of most artificial intelligence systems, particularly
optimization related ones. It depicts how close the system is ge ing to the desired
outcome and helps it adjust its course accordingly. In most AI systems the fitness
function represents an error or some form of cost, which needs to be minimized,
though in the general case it can be anything and depending on the problem, it may
need to be maximized.
Framework: a set of tools and processes for developing a certain system, testing it,
and deploying it. Most AI systems today are created using a framework. A
framework is usually accompanied by a library/package in the programming
languages it supports. In the deep learning case, for example, a framework can be a
programming suite like MXNet, that enables a variety of DL related processes and
classes to be utilized.
Fusion: usually used in conjunction with feature (feature fusion), this relates to the
merging of a set of features into a single meta-feature that encapsulates all, or at
least most, of the information in these features. This is a popular method of
dimensionality reduction and it is an integral part of every deep learning system.
Fuzzy Inference System (FIS): an AI system based on Fuzzy Logic, geared towards
making predictions using inference rules. A FIS is quite useful particularly when
interpretability is a concern. However, it is limited to lower dimensionality datasets.
Fuzzy Logic: a term coined by Lotfi Aliasker Zadeh in 1965, referring to a different
way of processing information which, unlike classical logic, also involves partial
truths (instead of just the conventional black-and-white logical paradigm). Fuzzy
logic uses degrees of truth as a mathematical model of vagueness and allows for all
intermediate possibilities between digital values of YES and NO, much like how a
human will assess the nature of a situation in full color and multi-polar fashion,
rather than a bi-polar, monochrome way. Fuzzy logic is a well-established part of
artificial intelligence.
G
Gene: an element of a chromosome, in a Genetic Algorithms optimization model. Genes
are coded as bits and they represent a characteristic, referred to as a trait.
Generalization: a key characteristic of a data science model, where the system is able
to handle data beyond its training set in a reliable way. A proxy to good
generalization is similar performance between the training set and a testing set, as
well as consistency among different training-testing set partitions of the whole
dataset.
Generation: an iteration in the Genetic Algorithms framework.
Generative Adversarial Networks (GANs): a family of AI systems, each of which
y y
comprises two ANNs, one geared towards learning and one geared towards
“breaking” the first ANN, by creating data that makes the first ANN get its
predictions wrong. As a GAN is trained, each one of its components becomes be er
at its task, resulting in a highly effective AI system that can yield good
generalization, even with a limited amount of data.
Genetic Algorithms (GAs): a family of optimization algorithms resembling the
process of gene selection, combining, and mutation. They are well-suited for
problems involving discreet variables, though they can be applied to continuous
variables also. GAs are a well-established artificial intelligence methodology that
finds many applications in data science, such as feature selection.
Genetic Programming (GP): an artificial intelligence methodology based on genetic
algorithms, but geared towards finding an optimal mathematical function, to
approximate a mapping of a particular variable.
Genome: the set of all the chromosome data in a Genetic Algorithms system. Genome
and population are often used interchangeably in this context.
Gluon: a component of the MXNet deep learning framework, acting as an interface
for the framework’s various functions.
GPU (Graphics Processing Unit): a specialized component of a computer’s
hardware, designed for processing data related to the computer’s display such as
image-like data. GPUs can be leveraged to obtain additional computing power for
resource-demanding tasks, such as AI systems, as in the case of Deep Learning.
Graph: a kind of dimensionless structure that is an abstraction of the objects
involved in a process as well as their relationships (connections). It is characterized
by nodes and arcs, representing the objects and the connections respectively. The
la er also carry other characteristics too, such as weights, corresponding to the
strength of each connection.
Graph analytics: a data science methodology making use of Graph Theory to tackle
problems through the analysis of the relationships among the entities involved.
GRU (Gated Recurrent Units): a simpler version of an LTSM network.
H
Hadoop: a distributed data storage technology developed by Apache Foundation.
HDFS: abbreviation for Hadoop Distributed File System, a framework for storing
and accessing data over a computer cluster, in an effective and efficient manner.
Heuristic: an empirical metric or function that aims to provide some useful tool or
insight, to facilitate a method or project of data science or artificial intelligence.
Hive: an open source project of Apache Foundation. It is software that facilitates
reading, writing, and managing large datasets residing in distributed storage using
a variant of SQL called HiveQL.
HiveQL: query language of Hive.
I
IDE (Integrated Development Environment): a system designed for facilitating the
creation and running of scripts as well as their debugging. Jupyter is a popular IDE
for data science applications.
Interpretability: the ability to more thoroughly understand a data model’s outputs
and derive how they relate to its inputs (features). Lack of interpretability is an issue
for deep learning systems.
J
Julia: a modern programming language of the functional programming paradigm,
comprising characteristics for both high-level and low-level languages. Its ease of
use, high speed, scalability, and sufficient amount of packages, make it a robust
language well-suited for data science. Recently v. 1.0 of the language was released,
making it officially production-ready.
Jupyter: a popular browser-based IDE for various data science languages, such as
Python and Julia.
K
Kaggle: a web site that hosts data science competitions and provides many useful
datasets.
Keras: is a very popular high level deep learning API that can run on top of
TensofFlow, CNTK, and Theano.
Kohonen’s Maps: see Self-organizing Maps.
L
Layer: a set of neurons in an artificial neural network. Inner layers are usually
referred to as hidden layers and consist mainly of meta-features created by the
system.
Logistic function: see sigmoid function.
LTSM (Long Short Term Memory) network: a kind of Recurrent Neural Network
that specializes in long-term dependencies among the data points. Comprising of
four distinct artificial neural networks, an LTSM network is ideal of natural language
processing applications.
M
Machine Learning (ML): a set of algorithms and programs that aim to process data
without relying on statistical methods. ML is generally faster and some methods of
it are significantly more accurate than the corresponding statistical ones, while the
assumptions they make about the data are fewer. There is a noticeable overlap
between ML and artificial intelligence systems designed for data science.
Mapping: the process of connecting a variable or a set of variables, to a variable we
are trying to predict (aka target variable). Mappings can be analytical using a
mathematical function, or not, such as employing a set of rules, or a network of
functions, as in the case of an artificial neural network. Mappings are inherent in
every data model.
Meta-features (aka super features or synthetic features): high quality features that
encapsulate larger amounts of information, usually represented in a series of
conventional features. Meta-features are either synthesized in an artificial intelligence
system, or created through dimensionality reduction.
Methodology: a set of methods and the theory behind those methods, for solving a
particular kind of problem in a certain field. Methodologies of data science include
classification, regression, etc. while for artificial intelligence, we have methodologies
like deep learning, autoencoders, etc.
Model Maintenance: the process of updating or even upgrading a data model, as
new data becomes available or as the assumptions of the problem change.
Multi-Layer Perceptron (MLP): a deep learning system that comprises of a series of
layers of neurons, much like a normal ANN, but larger. It is often referred to as a feed-
forward network and it’s the first system in the deep learning family to have been
developed. MLPs are great for various standard data science problems, such as
classification and regression.
Mutation: a process in the Genetic Algorithms framework, according to which a
random gene changes its value at random, with a given probability.
MXNet: a deep learning framework developed by Apache. MXNet is linked to
Amazon, although it can run on any cloud computing service. Its main API is called
Gluon and it’s part of the main package of MXNet. There are several such packages
in different programming languages, each one an API for that language. MXNet can
support more programming languages than any other AI framework.
N
Natural Language Processing (NLP): a text analytics methodology focusing on
categorizing the various parts of speech for a more in-depth analysis of the text
involved.
Narrow AI: see weak AI.
NDArray: a data structure that is prominent in the MXNet framework. NDArrays are
flexible like conventional data arrays, but also allow for context in the data stored in
them, for more efficient processing of it.
Network: a collection of nodes, forming a data structure often used in AI systems.
In computer science, networks are what are known as graphs in mathematics.
Neuron: a fundamental component of an artificial neural networks, usually
representing an input (feature), a meta-feature, or an output. Neurons in a network-
based AI system are organized in layers.
O
Objective function: see fitness function.
Optimization: an artificial intelligence process, aimed at finding the best value of a
function (usually referred to as the fitness function) given a set of restrictions.
Optimization is key in all modern data science systems. Although there are
deterministic optimization algorithms out there, most of the modern algorithms are
stochastic.
Optimizer: an artificial intelligence system designed to perform optimization.
Overfi ing: the case whereby a model is too specialized to a particular dataset. Its
main characteristic is great performance for the training set and poor performance
for any other dataset. Overfi ing is a characteristic of an overly complex model.
P
Package: a set of programs designed for a specific set of related tasks, sharing the
same data structures, and freely available to the users of a given programming
language. Packages may require other packages in order to function, which are
called dependencies. Once installed, the package can be imported in the
programming language and used in scripts.
Parallelizable: an a ribute of many systems and algorithms, whereby different parts
of them can be split among various CPUs or GPUs, working in parallel. This brings
about a boost in performance that is often essential for AI processes. A
parallelizable system is also more easily scalable.
Particle Swarm Optimization (PSO): a fundamental optimization algorithm, with a
number of variants. Some claim that all swarm intelligence optimizers are based on
PSO, since it is the simplest optimization algorithm of this category. PSO is geared
towards continuous variables, though there is a variant of it for discrete ones. PSO
has a variety of applications in all sorts of problems, involving a large number of
variables (very large search space).
Perceptron: a rudimentary AI model and a fundamental component of an artificial
neural network. When it comes to classification, a single perceptron can only handle
very simple problems as it fails to generalize non-linear class boundaries.
Personally Identifiable Information (PII): information that can be used to pinpoint
a particular individual, thereby violating his/her privacy. PII is an important ethical
concern in data science and may not be so easy to tackle, since it often relies on
combinations of variables.
Pipeline: also known as workflow, it is a conceptual process involving a variety of
steps, each one of which can comprise of several other processes. A pipeline is
essential for organizing the tasks needed to perform any complex procedure (often
non-linear) and is very applicable in data science (this application is known as the
data science pipeline).
Population: the totality of the elements involved in an optimization system,
involving a number of solutions used at the same time. In some systems it is
referred to as a swarm. Alternatively, the totality of the data describing a given
phenomenon. Since this is often not available to the data scientist, samples of it are
used instead.
Possibilistic modeling: a particular modelling paradigm sometimes used in
artificial intelligence systems, making use of membership functions instead of
probabilities, in order to model uncertainty in Fuzzy Logic.
Predictive analytics: a set of methodologies of data science, related to the prediction
of certain variables. It includes a variety of techniques such as classification,
regression, time-series analysis, and more. Predictive analytics are a key part of data
science.
Pruning: the process of cleaning up code so that unwanted solutions can be
eliminated. However, with this process, the number of decisions that can be made
by machines is restricted.
Python: a widely used object-oriented programming language, typically used for
data science, as well as artificial intelligence applications geared towards data analytics.
Q
Quantum computing: a new computing paradigm which leverages quantum effects
and the use of qubits, for carrying out computing tasks potentially faster. Not all
problems can be solved efficiently with quantum computing, but it is believed that
artificial intelligence has a lot to benefit from it.
Qubits: short for quantum bits, a new kind of information unit, in the quantum
computing paradigm. Unlike conventional bits, which can be either 0 or 1 only,
qubits can take 0, 1, and both at the same time (leveraging the superposition
concept from Quantum Physics).
R
Recurrent neural network (RNN): a deep learning network which employs a non-
sequential flow in the data they process, resulting in an improved analysis of
complex datasets, through the modeling of the temporal aspect of the data at hand.
RNNs are ideal for natural language processing applications, as well as speech
analysis projects.
Regression: a very popular data science methodology, under the predictive analytics
umbrella. Classification aims to solve the problem of predicting the values of a
continuous variable corresponding to a set of inputs, based on pre-existing
knowledge of similar data, available in the training set.
Regressor: a predictive analytics system geared towards regression problems.
Reinforcement learning: a type of machine learning in which machines are “taught”
to achieve their target function through a process of experimentation and reward.
The machine receives positive reinforcement when its processes produce the
desired result and negative reinforcement when they do not.
ReLU function: a minimalistic transfer function used in deep learning. It is defined as
f(x) = max(w*x+b, 0) and takes values between 0 and infinity. Being computationally
cheaper and sparser than other transfer functions, it is sometimes preferred when
creating a deep learning network.
Resilient Distributed Dataset (RDD): the data representation of Apache Spark that
supports distributed computing.
Restricted Bolzmann Machine (RBM): a type of artificial neural network, geared
towards learning the probability distributions of a set of features, in order to
reconstruct them using a small set of meta-features. RBMs are similar to autoencoders
in the sense that they too are often used for dimensionality reduction, though RBMs
are used in other contexts, such as predictive analytics.
S
Sample: a limited portion of the data available, useful for building a model, and
(ideally) representative of the population it belongs to.
Sampling: the process of acquiring a sample of a population using a specialized
technique. Sampling is very important to be done properly, to ensure that the
resulting sample is representative of the population studied. Sampling needs to be
random and unbiased.
Scala: a functional programming language, very similar to Java, that is used in data
science. The big data framework Spark is based on Scala.
Selection: a process of figuring out which chromosomes get to cross over, in the
Genetic Algorithms framework. Selection is a stochastic process related to the fitness of
the chromosomes involved.
Self-organizing Maps (SOMs): a special kind of artificial neural network, able to
represent a feature space on a 2-dimensional plane, without the need of a target
variable. In data science, SOMs are useful for data visualization and data exploration.
Sentiment analysis: a natural language processing method involving the classification
of a text into a predefined sentiment, or the figuring out of a numeric value that
represents the sentiment polarity (how positive or how negative the overall
sentiment is).
Sigmoid function: a mathematical function of the form f(x) = 1 / (1 + exp(-(w*x +
b))). Sigmoids are used in various artificial neural networks, such as Deep Learning
networks, as transfer functions. Sigmoid takes values between 0 and 1, not inclusive.
This is sometimes referred to as the logistic function as it features in logistic
regression.
Simulated Annealing (SA): an optimization method in the nature inspired family of
algorithms, based on the annealing process of liquids. SA is ideal for complex
search spaces and it is very robust against local optima. A classical application of
SA is the Traveling Salesman Problem.
Softmax function: a transfer function sometimes used in a deep learning network.
It is a simpler version of the sigmoid function, where the bias parameter (b) is missing
from the equation. Softmax takes values between 0 and 1, not inclusive.
Spark: see Apache Spark.
Stochastic: something that is probabilistic in nature. That is, not deterministic.
Stochastic processes are common in most artificial intelligence systems and other
advanced machine learning systems.
Strong AI: an area of AI development that is working toward the goal of making AI
systems that are as useful and skilled as the human mind. It is often referred to as
Artificial General Intelligence (AGI).
Supervised learning: a set of data science methodologies where there is a target variable
that needs to be predicted. The main parts of supervised learning are classification,
regression, and reinforcement learning.
Swarm: a set of potential solutions, evolving through a swarm intelligence
framework. In the general case of different optimization systems, a swarm is referred
to as population.
Swarm intelligence: a concept based on the idea that when individual agents come
together, the interactions between them lead to the emergence of a more ‘intelligent’
collective behavior – such as a swarm of bees. Swarm intelligence is an essential part
of modern optimization methods, such as particle swarm optimization and it is
stochastic by nature.
T
Tanh function: the hyperbolic tangent function. It is of the same family as the
sigmoid and it is sometimes used as a transfer function for deep learning networks. It
is defined as f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x)) and takes values between -1
and 1, not inclusive.
Target variable: the variable of a dataset that is the target of a predictive analytics
system, such as a classification or a regression system.
Tensor Processing Unit (TPU): a special type of proprietary processor designed by
Google in 2016 to accelerate the training of the neural network models implemented
in the TensorFlow framework.
TensorFlow: a deep learning and high performance numerical computation library.
Initiated by Google and improved by a very large open source community,
TensorFlow is by far the most popular deep learning framework today.
Testing set: the part of the dataset that is used for testing a predictive analytics model
after it has been trained and before it is deployed. The testing set usually
corresponds to a small portion of the original dataset.
Training algorithm: the algorithm used for training a deep learning system (or a
predictive analytics model in general). It entails figuring out which nodes to keep
and what weights their connections have, so as to obtain a good generalization for
the problem at hand. Back-propagation is an established training algorithm,
suitable for various kinds of artificial neural networks, including deep learning
systems.
Training set: the part of the dataset that is used for training a predictive analytics
model before it is tested and deployed. The training set usually corresponds to the
largest portion of the original dataset.
Trait: a characteristic of a problem, expressed as a gene, in the Genetic Algorithms
framework.
Transfer function: a component of an artificial neural network, corresponding to the
function applied on the output of a neuron before it is transmi ed to the next layer.
A typical example is the sigmoid function, though ReLU is often used in practice too.
Transfer functions are sometimes referred to as activation functions.
Transfer learning: a machine learning methodology where a model trained for a task is
reused in a second task without retraining. Taking the outputs of the pre-trained
model as input to another model, this methodology tries to increase the
performance of the second model as well as to reduce the training time.
Traveling Salesman Problem (TSP): a classical problem in graph analytics, whereby
we opt to find the quickest path. That is, the one with the smallest overall cost in
terms of time or distance, from one node of the graph to itself, after passing through
all the other nodes once. TSP is one of the core problems in logistics and one of the
most challenging optimization problems out there.
U
Unfolding / unrolling: in recurrent neural networks, the process of depicting the
network’s architecture in such a way that all the time steps are visible, enabling
be ing understanding of its functionality. This is done by expanding the folded
connections in the hidden layer part of the network, giving it a linear form.
Unsupervised learning: a set of data science methodologies where there is no target
variable that needs to be predicted.
V
Variable: a column in a dataset, be it in a matrix or a dataframe. Variables are
usually turned into features, after some data engineering is performed on them.
Virtual Machine (VM): a collection of computing, storage resources, and software,
taking the form of a computer, accessible via the Internet. VMs usually live in the
cloud though there is software enabling you to create a VM on your own computer
or computer cluster. Cloud-based VMs are very useful for AI applications.
W
Weak AI: also known as narrow AI, weak AI refers to a non-sentient computer
system that operates within a predetermined range of skills and usually focuses on
a singular task or small set of tasks. All AI in use today is weak AI.
Workflow: see pipeline.
APPENDIX A
Transfer Learning
It is not unusual for a complex deep learning model to train days or weeks even on
today’s most performant hardwares. Hence it is unpractical for many of the
practitioners to re-train those useful models in order to make use of their outputs in
other related tasks. Instead, using the outputs or knowledge of pre-trained models
without re-training them would be more practical approach.
Transfer learning is a machine learning method that addresses this issue. More
formally, transfer learning is:
the improvement of learning in a new task through the transfer of knowledge from
a related task that has already been learned.37
In simpler terms, using the outputs of a pre-trained model as inputs to another
model to solve a different but somehow related task is known as transfer learning.
Although, it is not specific to deep learning models, the long training times of these
models make transfer learning a beneficial method for certain tasks.
Higher start: The initial skill (before refining the model) on the source
model is higher than it otherwise would be.
Higher slope: The rate of improvement of skill during training of the
source model is steeper than it otherwise would be.
Higher asymptote: The converged skill of the trained model is be er than
it otherwise would be.
That being said, applicability of transfer learning is usually a trial and error process
as it might be hard to guess whether the approach will yield any benefits or not.
However, any domain knowledge and expertise on the original task would be
useful guides on when to use transfer learning.
Many deep learning frameworks are already bundled with some of the state-of-the-
art models so that you can use these pre-trained models as part of your transfer
learning method. Keras also has a large set of pre-trained models integrated in it.
You can use these pre-trained models by just importing Keras’ applications module.
You can read the documentation here39 to learn more about Keras’ applications
module.
Chapter 11, Transfer Learning: h ps://amzn.to/2LlIxyB.
See the reference in the footnote 1.
h ps://keras.io/applications/.
APPENDIX B
Reinforcement Learning
We briefly cover Reinforcement Learning40 (RL), which has applicability in a wide
range of areas like game playing and finance. Resting upon a basic observation on
how humans learn from their experiences, RL is one of the most important learning
paradigms in machine learning and to some extent, to AI too.
Let’s first look at how a baby learns whether fire is a good thing or a bad thing.
Since fire illuminates the environment, baby learns that fire is something that is
shining and colorful. When she gets closer to the fire, she gets warm and in cold
nights that is also something very beneficial. Eventually, when she tries to touch the
fire, she realizes that fire burns which is a very bad thing for a baby. Hence, from
these experiments the baby learns that sometimes fire is good and sometimes it is
bad. As long as she keeps the distance at some level, she can enjoy the fire.
Similar to the way of learning in our example, RL is a learning paradigm in machine
learning where learning from experiments is the key tenet. More formally, RL is a
type of machine learning where agents learn how to behave in environments by
observing the outcomes of their actions. In a sense, RL can be thought of a special
form of trial and error in machine learning.
Applications of RL are especially popular in game playing. Also, understanding the
terminology of RL in a game context is quite easy. Because of this, in the following
discussion we’ll give our examples from an old but still popular game: Super Mario!
Key terms
In order to speak about RL, the following four key terms should be understood:
Let’s summarize what we’ve learned from the definitions of these four terms. In RL,
our task is to make our agents learn the best set of actions from the environments
they are given by experimenting. The term “best” means that the agent should
maximize its total rewards and hence the best set of actions should result in the
maximum possible total rewards. More formally, the goal of the agent is to
maximize its expected cumulative rewards.
Reward hypothesis
The central optimization idea in RL rests upon the “Reward Hypothesis”.
According to this idea, tasks can be represented as the maximum expected
cumulative rewards so that the analytical RL methods can be applied. Even if not all
tasks fall into this category, there are many important tasks that fall within. If we
can represent a problem along the lines of this reward hypothesis, then RL
techniques can be applied to it.
Mathematically, the cumulative reward function is the sum of the rewards that
agent gets after each action:
G(t) = R(t+1) + R(t+2) + … + R(t+n)
In some cases however, near future should be given more weight than the distant
future. To be able to do that, the rewards are multiplied by a preference parameter
or a discount factor which lies between 0 and 1. By doing this, we make our agents
care more about the near future than the distant future.41
Types of tasks
A task is an instance of a RL problem. The formula above includes “n” rewards.
However, some tasks continue forever. In that respect, we have two types of tasks:
1. Episodic Tasks: These tasks have some ending point. So, these are tasks
that are limited in terms of the number of actions and the states of the
environment. In the Super Mario game, a level corresponds to an
episode.
2. Continuous Tasks: These are the tasks that have no ending points.
Hence, these tasks continue forever. For this kind of task, the reward
formula we gave above should be adjusted such that the number of
reward terms in the summation goes to infinity. A good example of
these tasks is stock trading in finance.
You may also have heard of “Deep Reinforcement Learning”. It is a special kind of Reinforcement
Learning where the deep neural networks are used in the modeling. In this appendix, we just use
Reinforcement Learning to emphasize its general applicability to many kinds of machine learning
methods.
You can also find similar functions in economics. The consumption behavior of households is usually
modeled in a similar fashion in economic models.
APPENDIX C
Autoencoder Systems
Autoencoder systems are a special kind of Deep Learning network that opts to
perform a summarization of the original data, into a lower-dimensionality feature
space. In essence, they are like PCA, ICA, and other statistical processes for creating
a more compact feature set, but with one key difference: autoencoders provide a
mapping that is highly non-linear and therefore able to capture all those non-linear
aspects of the data. This results in a be er and more robust representation (aka
embedding), something particularly useful when it comes to highly complex
datasets - all that while keeping the resource requirements to a manageable level.
Components
The architecture of an autoencoder is very similar to that of other DL systems,
comprising a series of layers of neurons. However, autoencoders have the same
number of neurons in the output layer, which corresponds to the features of the
data, just like the input layer. The innermost layer (one of the hidden layers of the
network) contains the optimal embedding of these features and it always has fewer
neurons than the input layer. You can view a minimalistic autoencoder system in
Figure 29.
Data flow and functionality
The data in an autoencoder initially flows from the input to the output, through all
the hidden layers in-between. Once it reaches the end of the network, the outputs
are compared to the original feature values as the la er take the role of the targets.
The inevitable differences (errors) travel backwards and are used to modify the
weights of the connections among the neurons and the whole process is repeated
until the errors are below a predefined limit.
Moreover, the process of turning the input data into embedding is called encoding
(or convolution), while the reverse process, whereby the data in the embedding
process is turned into an approximation of the original features, in the output layer,
is called decoding (or deconvolution). Finally, the data corresponding to the hidden
layer is usually referred to as latent variables.
Error function
The error function of an autoencoder is usually the squared error between the inputs
and the outputs, though depending on the problem at hand it can involve different
kinds of norms of the aforementioned differences. Just like multi-level perceptrons,
once computed, the error is propagated to the different parts of the network.
Extensions of conventional autoencoder models
Denoising autoencoder
This is a special kind of autoencoder, first explored in 2008 by Vincent et al. It
basically works with corrupt data as its inputs and a empts to reconstruct the noisy
signal. It manages that by abstracting what is called the manifold of the data, which
is the shape we wish to capture when the dimensionality of the data is reduced.
Once a form of the manifold is abstracted based on the input data, the autoencoder
rebuilds that signal and outputs what it believes to be a more accurate
representation of the original inputs.
Note that the manifold approach is common among conventional autoencoders too,
since without an abstraction of the signal at hand, it would not be possible to
express it in a different form, particularly a more information-rich one.
Just like in regression systems, a regularization parameter is introduced to ensure a
simple and therefore more robust model. This is accomplished by introducing a
parameter β corresponding to the how much weight we put on a sparser (simpler)
coding: ES = E + β·Penalty.
Such a system would be also somewhat easier to interpret, while also more resistant
to instability. Also, despite the seemingly convoluted processes related to sparsity
and how it applies to a more robust representation, in practice it is quite
straightforward, as it is quite easy to implement through the various DL
frameworks.
Variational autoencoder
Variational AutoEncoders (or VAE for short) are specialized autoencoder systems
that are designed for calculating the parameters of a statistical model that describes
the distribution of the input data. VAEs are like conventional autoencoders but with
a twist in the latent variables they generate. The idea is to create such latent
variables that approximately follow a unit Gaussian distribution, instead of being
arbitrary. After this is done, it’s just a ma er of generating new points following
that distribution and passing them as inputs to the decoding part of the VAE
network. For be er performance, we can apply a trick of sorts, involving a process
called KL divergence, which needs to be optimized.
VAEs not only produce new data that closely resembles the original dataset, but
they also create data that is free of noise and also looks more realistic. This is the
main reason why they are be er at this task compared to Generative Adversarial
Networks (GANs).
Components
The architecture of a GAN is fairly straightforward, once you understand its general
premise. A comprehensive view of it can be found in Figure 30. Note that the
discriminator network D(x) can be any kind of DL network that can be employed in
prediction analytics.
Data flow and functionality
The data in a GAN flows towards the discriminator network, which is then asked to
decide whether it is of a particular class (usually 0 and 1). If the data is coming from
the real dataset, the chances of it passing is log(D(x)), which corresponds to an
entropy E1[log(D(x)]. Naturally, this DL system tries to maximize this quantity by
making it as close to 1 as possible. On the other hand, there is a chance that the data
is coming from the artificial dataset that the generative network is pu ing together.
In this case the chance of it passing is 1 - D(G(z)), with the corresponding entropy
being E2[log(1 - D(G(z)))]. Naturally, the discriminator network tries to minimize
that by making it close to 0. So, overall, the objective it is pursuing (through its
training) is to maximize the quantity V(D, G) = E1 + E2.
On the other hand, the generative network is trying to minimize the same quantity
V(D, G), since this is how it can make its counterpart more accurate in its
predictions. This whole process can be summarized in the following expression:
Training process
When training a GAN, for a problem you have defined and collected data for, you
need to apply the following process:
Perspective issues
Since perspective is something that only relatively recently has been understood by
our species when it comes to visual representations of the world (one of the
innovations of Leonardo Da Vinci), it is no surprise that GANs often fail to deliver
images that exhibit this characteristic. After all, GANs tend to view an image as a
flat surface having various colors on it, much like an ancient painting.
AI resources
Resources are important in every project. When it comes to AI-based data science,
the ones that are most relevant are computing and data resources, though there is
also the expertise in the specialized professionals handling these systems. The la er
are covered in the final section of this appendix.
Computing resources
The computing resources of an AI-based project depend on the project’s scale. They
involve a lot of memory though the main bo leneck is usually the computing
power, which is generally covered through a lot of GPUs. When scaling up a
system, it’s best to do that gradually though, since sometimes the trade-off between
computing resources and time required by an AI system (mainly for its training,
when it comes to deep learning systems) is manageable. Cloud computing systems
are a great way to experiment with this trade-off, though for more long-term
projects, it makes more sense to have your own computing infrastructure, in the
form of a private cloud, aka a computer cluster.
Data resources
Data resources are something oftentimes ignored since many people have a warped
perception about AI’s relationship to data. Although there are AI systems that can
generate data similar to what they have been trained on such as VAEs and GANs, in
order for them to provide good quality data they need to have a good
generalization of the dataset they are going to expand. For this to happen though,
sufficient data is required, just like any other AI project.
Data science systems (particularly machine learning ones) have evolved over the
years and can handle all sorts of datasets. However, for AI systems to truly offer an
edge over these systems, there needs to be a lot of data available. So, ge ing a
number of data streams in a data science pipeline and having processes in place for
maintaining them (so that the data doesn’t become stale) is paramount for making
an AI-based system bring about sufficient value to justify its existence and the
potential training of the people who use it regularly.
Industries and applications benefiting the most from AI
Naturally, industries where a lot of data is involved are more suitable for adopting
AI in their data science pipelines. That’s why the financial and telecommunications
sectors were the first ones to adopt data science in general. Also, industries like
retail can benefit a lot from AI, as they have dynamic data that is suitable for an
RNN system, for example.
What’s more, if you are considering an application involving lots of text, sound or
image data, such as through the ingestion of social media feeds, you can get a lot
out of an AI system, particularly if you have a lot of data like that. If all your data is
binary and continuous variables though, you may get by with a conventional data
science system too. Make sure you consult a data scientist with AI know-how when
making this decision.
That’s all. You can access the files inside the Docker image by going to 0.0.0.0:8888/
on your browser where a Jupiter notebook will be already run for you.
Docker tips
Generally, you don’t need an internet connection when running the Docker image,
y y g g
but for the updating stage (which you only need to run once), you’ll need to be
connected to the internet. Also, when accessing the Jupyter application, you’ll need
to provide the aforementioned password again. Although the port 8888 is not used
by a popular program, you need to make sure that no other program runs at port
8888 before running the Docker image. Finally, make sure that you run the Docker
image on a computer with at least 4 GB of RAM.
Index
AI frameworks
alternative, 193–210
AI methodologies, 35–37
alternatives to, 27–28
AI resources, 273–74
AI systems
comparing, 39–40
the future of, 37–38
AI, and deep learning systems, 9
ANN (Artificial Neural Network), 11
Apache Spark, 215–16
Apache Spark architecture, 216
Architecture, 12
Artificial creativity (AC), 27, 33–34, 41
Artificial Emotional Intelligence (AEI), 37, 41
Artificial Neural Networks (ANNs), 11, 159
Autoencoder systems, 20
Automated Machine Learning (AutoML), 38
Automated translation, 178
Backpropagation, 194, 211
Bias node, 14
Bidirectional RNN, 173
Big data, 212–13, 223
Billion Words, 222
Biostatistics, 219
Capsule Networks (CapsNets), 35, 193, 198–203
Chollet, François, 211
Classification, 13, 16, 51–57, 63
and CNNs, 163
and Keras, 92–95
and TensorFlow, 72–77
Clustering, 20
CNNs (Convolutional Neural Networks), 11
Common Crawl, 222
Computer vision, 217–18, 221
Conventional data science systems, 24
Convolution and CNNs, 161–62
Convolutional Neural Networks (CNNs), 11, 159, 160–70, 164–70, 179, 199
components of, 160–61
uses of, 169–70
Crossover, 122–23, 145
Darwinian evolution, and genetic algorithm, 121
Data denoising, 263
Data engineering, 10, 216–17
Data flow
and CNNs, 161
and Recurrent Neural Networks (RNNs), 172
Data generation, 263
Data models
building, 19
deployment of, 20
Data science
AI in, 271–75
and AI, 225
specializations, 216–19
Data science education, 275
Data science methodologies
predictive analytics as, 11
Dataset representation, 49
Datasets
and NDArray, 49–51
synthetic, 48–49
Deep Gimble I, 34
Deep learning (DL)
programming languages and, 17–18
Deep learning (DL) systems
alternatives to, 27–28
and AI, 9, 11–12
methodologies and applications and, 20–22
understanding, 12–16
vs. conventional data science systems, 24
Deep learning frameworks, 9
assessing, 22–24
how to leverage, 18–20
Deep Learning Machines (DLM), 194–98
Deep learning networks, 13
Defuzzification, 31
Deng, Jack, 43
Denoising autoencoder, 261
Deterministic optimizers, 29
Dimensionality reduction, 12, 20, 262
Docker system, 277–78
Docker tips, 278
Elitism, 123
Emscripten, 44
Ensembles
building of, 181–92
Epochs
training, 14
Estimators API, and TensorFlow, 84–87
ETL processes, 19, 22
Exponential expression
and particle Swarm Optimization (PSO), 115–16
Extract, Transform, and Load (ETL) processes, 11
Extreme Learning Machine (ELM), 35, 193, 209
architecture of, 196
Feature fusion, 39
Firefly, 119
Firefly optimizer, 107–8, 187–89
FIS, 32, 39, 194
Fitness, 123
Frameworks, 9–10
Main deep learning, 16–17
meta-feature of, 15
Functional programming languages, 225
Fuzzification, 31
Fuzzy Inference System (FIS), 32, 194, 203–7
downside of, 32
Fuzzy logic (FL), 30–33, 41, 194
Fuzzy logic systems, 27–28
relevance of, 32
Fuzzy logic tips, 208
Generative Adversarial Networks (GANs), 36, 262, 265–70
Genetic Algorithm (GA), 121–46, 144
activities in, 132–41, 141–43
in Julia, 127–31
tips, 143
Genetic Programming (GP), 142–43
Gluon interface, 44–45, 51, 55, 57, 61, 62
GRUs, 173–74, 179
Hadoop, 213–14, 214
Healthcare, 219
Hinton, Geoffrey, 194, 198, 199
Holland, John, Adaption in Natural and Artificial Systems, 121
Image caption generation, 178
IMF, 223
Internet of Things, 218
Interpretability, 23–24
Julia, 208
Particle Swarm Optimization (PSO) implementation, 109–12
programming language, 18
Simulated Annealing in, 151–53
Julia programming language, 30
Kaggle, 220, 224
Keras, 16, 17, 174
activities in, 91–101
and TensorFlow, 89–101
converting to TensorFlow estimators, 100
core components of, 91
model graph, 99
Kirkpatrick, S., 147
Knet framework, 16
Kohonen (Professor), 36
Kohonen Maps, 36
LeCun, Yann, 160
Library, 9
LibriSpeech, 223
LSTMs, 173–74, 179
Machine learning methodologies, 11
Machine learning systems
vs. deep learning systems, 12
Main deep learning
programming languages, 17–18
Main deep learning frameworks, 16–17
Matlab, 18
Meta-features, 20, 35
of the framework, 15
Methodologies, 10
data science, 11
machine learning, 11
MNIST dataset, 164, 165
Model, 10
Model maintenance, 24
Multi-Layer Perceptrons, 55
Mutation, 122, 145
MXNet, 16, 17, 62
and checkpoints, 60–61
and deep learning frameworks, 43–44
and NDArray, 45–47, 49–51
and Python, 48–49
checkpoints, 63
Gluon interface and, 44–45
package in Python, 47–48
tips, 61
Natural language, 221
Natural Language Processing (NLP), 21, 217
NDArrays, 45–47, 63
Neuro-fuzzy systems, 31
Neurons, 11, 12
NLP, 178
Non-linearity and CNNs, 163
Novel AI methodology, 41
Novel AI systems, 40
OpenAI Gym, 257
Optimization, 27, 35, 39, 41
definition, 28
importance of, 28
programming languages and, 30
Optimization ensemble, 181–82, 183, 191
and data science, 189–90
framework of, 183
tips, 190–91
Optimization methods
and Particle Swarm Optimization, 109
Optimization systems, 28, 29, 41
Parallelization, 182
Particle Swarm Optimization (PSO), 103–19, 118
activities in, 112–16
algorithms, 104–6
implementation in Julia, 109–12
tips, 116–18
variants of, 106–8
vs. alternatives, 109
Perceptron system, 15, 194
Personally Identifiable Information (PII), 34
Pipeline, 10
Polynomial expression
and Particle Swarm Optimization, 114–15
Pooling, and CNNs, 163
Predictive analytics system, 11, 20, 39
Programming languages, 9–10, 16, 43
main deep leraning, 17–18
Programming resources, 207
PSO optimizer, 181
PSO systems
and Firefly ensemble, 187–89
in an ensemble, 184–86
Publicly available datasets, 220–23
Python, 18, 43, 63, 174, 207
and MXNet, 48–49
MXNet package in, 47–48
Pytorch, 16
Quantum Annealing, 156
Quora Question Pairs, 222
Rechenberg, Ingo, 121
Recurrent Neural Networks (RNNs), 159, 170–78
activities of, 174–78
components of, 171–72
variants of, 173–74
Regression, 63
and Keras, 95–98
and TensorFlow, 77–81
Regression MLP system, 57–59
Reinforcement learning (RL), 20, 253–57
ReLU function, 52
Resilient Distributed Dataset (RDD), 215
RL-Glue, 257
Rosenbla , Frank, 194
Self-Organizing Genetic Algorithm (SOGA), 141
Self-organizing Maps (SOMs), 36
Sentiment analysis, 21
Sigmoid function, 14
Simulated Annealing (SA), 147–58, 149
activities and, 154–56
variants of, 156–57
Social sciences, 219
and time series, 223
Softmax function, 52, 172
Speech in English, 222
Speech recognition, 178
SQuAD, 222
Standard Genetic Algorithm, 125–27
Stanford Sentiment Treebank, 222
Stochastic optimizers, 29
Supervised learning, 20
Synthetic datasets, 48–49
System, 10
Target variable, 14
TensorBoard dashboard, 83
TensorFlow, 16, 17
and Keras, 89–101
architecture of, 67
TensorFlow estimators
and Keras, 100
Testing, 10
Text synthesis, 178
Theano, 16
TIMIT, 223
Training, 10
Training algorithms, 12, 15
Training epochs, 14
Training process
and CNNs, 163
and RNNs, 173
Transfer function, 14
Transfer learning, 247–51
Unsupervised learning methodologies, 21
Variable Selective Pressure Model (VSPM), 141
Variational autoencoder, 262
Visualization
and CNNs model, 164–70
and Keras, 98–100
and TensorFlow, 81–87
WikiText, 221
World Bank, 223