Machine Learning With Python
Machine Learning With Python
LEARNING
WITH PYTHON
Theory and Applications
MACHINE
LEARNING
WITH PYTHON
Theory and Applications
G. R. Liu
University of Cincinnati, USA
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
Printed in Singapore
About the Author
v
MACHINE LEARNING
WITH PYTHON
Contents
1 Introduction 1
1.1 Naturally Learned Ability for Problem Solving . . . . . . . 1
1.2 Physics-Law-based Models . . . . . . . . . . . . . . . . . . 1
1.3 Machine Learning Models, Data-based . . . . . . . . . . . 3
1.4 General Steps for Training Machine Learning Models . . . 4
1.5 Some Mathematical Concepts, Variables, and Spaces . . . 5
1.5.1 Toy examples . . . . . . . . . . . . . . . . . . . . . 5
1.5.2 Feature space . . . . . . . . . . . . . . . . . . . . . 6
1.5.3 Affine space . . . . . . . . . . . . . . . . . . . . . 7
1.5.4 Label space . . . . . . . . . . . . . . . . . . . . . . 8
1.5.5 Hypothesis space . . . . . . . . . . . . . . . . . . . 9
1.5.6 Definition of a typical machine learning model,
a mathematical view . . . . . . . . . . . . . . . . . 10
1.6 Requirements for Creating Machine Learning Models . . . 11
1.7 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Relation Between Physics-Law-based and Data-based
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 Who May Read This Book . . . . . . . . . . . . . . . . . . 14
1.11 Codes Used in This Book . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vii
viii Machine Learning with Python: Theory and Applications
2 Basics of Python 19
2.1 An Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Briefing on Python . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Underscore placeholder . . . . . . . . . . . . . . . 28
2.3.3 Strings . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Conversion between types of variables . . . . . . . 36
2.3.5 Variable formatting . . . . . . . . . . . . . . . . . 38
2.4 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Addition, subtraction, multiplication, division,
and power . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Built-in functions . . . . . . . . . . . . . . . . . . 40
2.5 Boolean Values and Operators . . . . . . . . . . . . . . . . 41
2.6 Lists: A diversified variable type container . . . . . . . . . 42
2.6.1 List creation, appending, concatenation,
and updating . . . . . . . . . . . . . . . . . . . . . 42
2.6.2 Element-wise addition of lists . . . . . . . . . . . . 44
2.6.3 Slicing strings and lists . . . . . . . . . . . . . . . 46
2.6.4 Underscore placeholders for lists . . . . . . . . . . 49
2.6.5 Nested list (lists in lists in lists) . . . . . . . . . . 49
2.7 Tuples: Value preserved . . . . . . . . . . . . . . . . . . . . 50
2.8 Dictionaries: Indexable via keys . . . . . . . . . . . . . . . 51
2.8.1 Assigning data to a dictionary . . . . . . . . . . . 51
2.8.2 Iterating over a dictionary . . . . . . . . . . . . . 52
2.8.3 Removing a value . . . . . . . . . . . . . . . . . . 53
2.8.4 Merging two dictionaries . . . . . . . . . . . . . . 54
2.9 Numpy Arrays: Handy for scientific computation . . . . . . 55
2.9.1 Lists vs. Numpy arrays . . . . . . . . . . . . . . . 55
2.9.2 Structure of a numpy array . . . . . . . . . . . . . 55
2.9.3 Axis of a numpy array . . . . . . . . . . . . . . . . 60
2.9.4 Element-wise computations . . . . . . . . . . . . . 61
2.9.5 Handy ways to generate multi-dimensional
arrays . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9.6 Use of external package: MXNet . . . . . . . . . . 63
2.9.7 In-place operations . . . . . . . . . . . . . . . . . 66
2.9.8 Slicing from a multi-dimensional array . . . . . . . 67
2.9.9 Broadcasting . . . . . . . . . . . . . . . . . . . . . 67
Contents ix
Introduction
We are constantly dealing with all kinds of problems every day, and would
like to solve these problems for timely decisions and actions. We may notice
that for many of the daily-life problems, our decisions are often made spon-
taneously, swiftly without much consciousness. This is because we have been
constantly learning to solve such problems in the past since we were born,
and therefore the solutions have already been encoded in the neuron cells in
our brain. When facing similar problems, our decision is spontaneous.
For many complicated problems, especially in science and engineering,
one would need to think harder and even conduct extensive research and
study on the related issues before we can provide a solution. What if we want
to give spontaneous reliable solutions to these types of problems as well?
Some scientists and engineers may be able to do this for some problems, but
not many. Those scientists are intensively trained or educated in specially
designed courses for dealing with complicated problems.
What if a normal layman would also like to be able solve these challenging
types of problems? One way is to go through a special learning process.
The alternative may be through machine learning, to develop a special
computer model with a mechanism that can be trained to extract features
from experience or data to provide a reliable and instantaneous solution for
a type of problem.
Problems in science and engineering are usually much more difficult to solve.
This is because we humans can only experience or observe the phenomena
1
2 Machine Learning with Python: Theory and Applications
associated with the problem. However, many phenomena are not easily
observable and have very complicated underlying logic. Scientists have been
trying to unveil the underlying logic by developing some theories (or laws or
principles) that can help to best describe these phenomena. These theories
are then formulated in the form of algebraic, differential, or integral system
equations that govern the key variables involved in the phenomena. The
next step is then to find a method that can solve these equations for these
variables varying in space and with time. The final step is to find a way
to validate the theory by observation and/or experiments to measure the
values of these variables. The validated theory is used to build models to
solve problems that exhibit the same phenomena. This type of model is
called physics-law-based model.
The above-mentioned process is essentially what humans on earth have
been doing in trying to understand nature, and we have made tremendous
progress so far. In this process, we have established a huge number of areas
of studies, physics, mathematics, biology, etc., which are now referred to as
sciences.
Understanding nature is only a part of the story. Humans want to
invent and build new things. A good understanding of various phenomena
enables us to do so, and we have practically built everything around us,
buildings, bridges, airplanes, space stations, cars, ships, computers, cell
phones, internet, communication systems, and energy systems. Such a list is
endless. In this process, we humans established a huge number of areas of
development, which we are now referred to as engineering.
Understanding biology helped us to discover medicines, treatments for
illnesses of humans and animals, treatments for plants and the environment,
as well as proper measures and policies dealing with the relationships
between humans, animals, plants, and environments. In this process, we
humans established a huge number of areas of studies, including medicine,
agriculture, and ecology.
In the relentless quest by humans in history, countless theories, laws,
techniques, methods, etc., have been developed in various areas of science,
engineering, and biology. For example, in the study of a small area of compu-
tational mechanics for designing structural systems, we have developed the
finite element method (FEM) [1], smoothed finite element method (S-FEM)
[2], meshfree methods [3, 4], inverse techniques [5], etc., just to name a few
that the author has been working on. It is not possible and necessary to list
all of these kinds of methods and techniques. Our discussion here is just to
provide an overall view of how a problem can be solved based on physics laws.
Introduction 3
Note that there are many problems in nature, engineering, and society
for which it is difficult to describe and find proper physics laws to accurately
and effectively solve them. Alternative means are thus needed.
This book will cover most of these algorithms, but our focus will be
more on neural network-based models because rigorous theory and predictive
models can be established.
Machine learning is a very active area of research and development. New
models, including the so-called cognitive machine learning models, are being
studied. There are also techniques for manipulating various ML models. This
book, however, will not cover those topics.
1. Obtaining the dataset for the problem, by your own means of data
generation, or imported from other existing sources, or computer
syntheses.
2. Clean up the dataset if there are objectively known defaults in it.
3. Determine the type of hypothesis for the model.
Introduction 5
4. Develop or import proper module for the needed algorithm for the
problem. The learning ability (number of the learning parameters) of
the model and the size of the dataset shall be properly balanced, if
possible. Otherwise, consider the use of regularization techniques.
5. Randomly initialize the learning parameters, or import some known pre-
trained learning parameter.
6. Perform the training with proper optimization techniques and monitor-
ing measures.
7. Test the trained model using an independent test dataset. This can also
be done during the training.
8. Deploy the trained and tested model to the same type of problems, where
the training and testing datasets are collected/generated.
We shall define variables and spaces often used in this book for ease of
discussion. We first state that this book deals with only real numbers, unless
specified when geometrically closed operations are required. Let us introduce
two toy examples.
randomly selected fruits of these two types from the market, and create a
dataset with 8,000 paired data-points. Each data-point records the values
of these three features and pairs with two labels (ground truth) of yes-or-no
for apple or yes-or-no for orange. The dataset is also called labeled dataset
for model training.
With an understanding of these two typical types of examples, it should
be easy to extend this to many other types of problems for which a machine
learning model can be effective.
Figure 1.1: Data-points in a 2D feature space X2 with blue vectors: xi = [xi1 , xi2 ]; and
2
the same data-points in the augmented feature space X , called affine space, with red
vectors: xi = [1, xi1 , xi2 ]; i = 1, 2, 3, 4.
p
A X space can be created by first spanning Xp by one dimension to Xp+1
via introduction of a new variable x0 as
[x0 , x1 , x2 , . . . , xp ] (1.5)
and then set x0 = 1. These 4 red vectors shown in Fig. 1.1 live in an affine
2
space X .
p
Note that the affine space X is neither Xp+1 nor Xp , and is quite
p
special. A vector in a X is in Xp+1 , but the tip of the vector is confined
in “hyperplane” of x0 = 1. For convenience of discussion in this book, we
say that an affine space has a pseudo-dimension that is p + 1. Its true
dimension is p, but it is a hyperplane in a Xp+1 space.
In terms of function approximation, the linear bases given in Eq. (1.3)
can be used to construct any arbitrary linear function in the feature
space. A proper linear combination of these complete linear bases is still
in the affine space. Such a combination can be used to perform an affine
transformation, which will be discussed in detail in Chapter 5.
These data-points xi (i = 1, 2, . . . , m) are stacked to form an augmented
p
dataset X ∈ X , which is the well-known moment matrix in function
approximation theory [1–4]. Again, this is for convenience in formulation.
We may not form such a matrix in computation.
y = [y1 , y2 , . . . , yk ], y ∈ Yk ∈ Rk (1.6)
For the toy example-1, yij (i = 1, 2, . . . , 8000; j = 1, 2) are 8,000 real numbers
in 2D space Y2 . For the toy example-2, each label, yi1 or yi2 , has a value of
0 or 1 (or −1 or 1), but the labels can still be viewed living in Y2 .
These labels yi (i = 1, 2, . . . , m) can be stacked to form a label set Y ∈ Yk ,
although we may not really do so in computation.
Introduction 9
ŵ = [W 0 , W 1 , . . . , W P ] ∈ WP (1.8)
We will discuss in later chapters the details about WP for various models
including estimation of the dimension P .
10 Machine Learning with Python: Theory and Applications
It reads that the ML model M uses a given dataset X with Y to train its
learning parameters ŵ, and produces a map (or giant functions) that makes
a prediction in the label space for any point in the feature space.
The ML model shown in Eq. (1.9) is in fact a data-parameter
converter: it converts a given dataset to learning parameters during training
and then converts the parameters back in making a prediction for a given set
of feature variables. It can also be mathematically viewed as a giant function
with k components in the feature space Xp and controlled (parameterized)
by the training parameters in WP . When the parameters are tuned, one gets
a set of k giant functions over the feature space.
On the other hand, this set of k giant functions can also be viewed as
continuous (differentiable) functions of these parameters for any given data-
point in the dataset, which can be used to form a loss function that is also
differentiable. Such a loss function can be the error between these k giant
functions and the corresponding k labels given in the dataset. It can be
viewed as a functional of prediction functions that in turn are functions of
ŵ in the vector space WP . The training is to minimize such a loss function
for all the data-points in the dataset, by updating the training parameters
to become minimizers. This overview picture will be made explicitly into
a formula in later chapters. The success factors for building a quality ML
model include: (1) type of hypothesis, (2) number of learning parameters
in WP , (3) quality (representativeness to the underlaying problem to be
modeled, including correctness, size, data-point distribution over the features
space, and noise level) of the dataset in Xp , and (4) techniques to find the
minimizer of learning parameters to best produce the label in the dataset.
We will discuss this in detail in later chapters for different machine learning
models.
Concepts on spaces are helpful in our later analysis of the predictive
properties of machine learning models. Readers may find difficulty in
comprehending these concepts at this stage, and thus are advised to just have
some rough idea for now and to revisit this section when reading relevant
chapters. Readers may jump to Section 13.1.5 and take a look at Eq. (13.13)
there just for a quick glance on how the spaces evolve in a deepnet.
Introduction 11
Note also that there are ML models for discontinuous feature variables,
and the learning parameters may not need to be continuous. Such methods
are often developed based on proper intuitive rules and techniques, and
we will discuss some of those. The concepts on spaces may not be directly
applicable but can often help.
Data are the key to any data-based models. There are many types of data
available for different types of problems that one may make use of as follows:
Note that the quality and the sampling domain of the dataset play
important roles in training reliable machine learning models. Use of a trained
model beyond the data sampling domain requires a special caution, because
it can go wrong unexpectedly, and hence be very dangerous.
The author has made substantial effort to write Python codes to demonstrate
the essential and difficult concepts and formulations, which allows readers
to comprehend each chapter earlier. Based on the learning experience of the
author, this can make the learning more effective.
The chapters of this book are written, in principle, readable indepen-
dently, by allowing some duplicates. Necessary cross-references between
chapters provided are kept minimum.
The book is written for beginners interested to learn the basics of machine
learning, including university students who have completed their first
year, graduate students, researchers, and professionals in engineering and
sciences. Engineers and practitioners who want to learn to build machine
learning models may also find the book useful. Basic knowledge of college
mathematics is helpful in reading this book smoothly.
This book may be used as a textbook for undergraduates (3rd year or
senior) and graduate students. If this book is adopted as a textbook, the
instructor may contact the author (liugr100@gmail.com) directly for some
homework and course projects and solutions.
Machine learning is still a fast developing area of research. There still exist
many challenging problems, which offer ample opportunities for research to
develop new methods and algorithms. Currently, it is a hot topic of research
and applications. Different techniques are being developed every day, and
new businesses are formed constantly. It is the hope of the author that this
book can be helpful in studying existing and developing machine learning
models.
The book has been written using Jupiter Notebook with codes.
Readers who purchased the book may contact the author directly
(mailto:liugr100@gmail.com) to request a softcopy of the book with codes
(which may be updated), free for academic use after registration. The
conditions for use of the book and codes developed by the author, in both
hardcopy and softcopy, are as follows:
1. Users are entirely at their own risk using any of part of the codes and
techniques.
Introduction 15
2. The book and codes are only for your own use. You are not allowed to
further distribute without permission from the author of the code.
3. There will be no user support.
4. Proper reference and acknowledgment must be given for the use of the
book, codes, ideas, and techniques.
Note that the handcrafted codes provided in the book are mainly for
studying and better understanding the theory and formulation of ML
methods. For production runs, well-established and well-tested packages
should be used, and there are plenty out there, including but not limited
to Scikit learn, PyTouch, TensorFlow, and Keras. Also, our codes provided
are often run with various packages/modules. Therefore, care is needed when
using these codes, because the behavior of the codes often depends on the
versions of Python and all these packages/modules. When the codes do not
run as expected, version mismatch could be one of the problems. When this
book was written, the versions of Python and some of the packages/modules
were as follows:
For example,
import keras
print('keras version',keras.__version__)
import tensorflow as tf
print('tensorflow version',tf.version.VERSION)
If the version is indeed an issue, one would need to either modify the code
to fit the version or install the correct version in your system, by may be
creating an alternative environment. It is very useful to query on the web
using the error message, and solutions or leads can often be found. This is
the approach the author often takes when encountering an issue in running
a code. Finally, this book has used materials and information available on
the web with links. These links may change over time, because of the nature
of the web. The most effective way (and often used by the author) to dealing
with this matter is to use keywords to search online, if the link is lost.
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course,
Butterworth-Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor
and Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space
Theory and Weakened Weak Forms, World Scientific, New Jersey, 2013.
[5] G.R. Liu and X. Han, Computational Inverse Techniques in Nondestructive Evalua-
tion, Taylor and Francis Group, New York, 2003.
[6] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms, New York, 1962. https://books.google.com/books?id=7FhRAA
AAMAAJ.
[7] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning Internal Representations
by Error Propagation, 1986.
[8] G.R. Liu, FEA-AI and AI-AI: Two-way deepnets for real-time computations for both
forward and inverse mechanics problems, International Journal of Computational
Methods, 16(08), 1950045, 2019.
[9] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., TubeNet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
Introduction 17
Basics of Python
This chapter discusses basics of Python language for coding machine learning
models. Python is a very powerful high-level programming language with
the need for compiling, but with some level of efficiency of machine-level
language. It has become the top popular tool for the development of tools and
applications in the general area of machine learning. It has rich libraries for
open access, and new libraries are constantly being developed. The language
itself is powerful in terms of functionality. It is an excellent tool for effective
and productive coding and programming. It is also fast, and the structure
of the language is well built for making use of bulky data, which is often the
case in machine learning.
This chapter is not a formal training on Python, but just to help readers
have a smoother start in learning and practicing the materials in the later
chapters. Our focus will be on some useful simple tricks that are familiar
to the author, and some behavior subtleties that often affect our coding in
ML. Readers familiar with Python may simply skip this chapter. We will
use the Jupyter Notebook as the platform for the discussions, so that the
documentation and demonstration can be all in a single file.
You may go online and have the Jupyter Notebook installed at, for
example, https://www.anaconda.com/distribution/, where you can have the
Jupyter Notebook and Python installed at the same time, and maybe along
with another useful Python IDE (Integrated Development Environment)
called Spyder. In my laptop, I have all three pieces ready to use.
A Jupyter Notebook consists of “cells” of different types: cells for codes
and cells for text called “makedown”. Each cell is framed with color borders,
and the color shows up when the cell is clicked on. A green color border
indicates that this cell is in the input mode, and one can type and edit the
contents. Pressing “ctrl + Enter” within the cell, the green border changes
19
20 Machine Learning with Python: Theory and Applications
to blue color, indicating that this cell is formatted or executed, and may
produce an outcome. Double clicking on the blue framed cell sets it back to
the input mode. The right vertical border is made thicker for better viewing.
This should be sufficient for us to get going. One will get more skills (such
as adding cells, deleting cells, and converting cell types) by playing and
navigating among the menu bars on the top of the Notebook window.
Googling the open online sources is excellent for getting help when one
has a question. The author does this all the time. Sources of the reference
materials include the following:
• https://docs.python.org/3.7/
• https://docs.scipy.org/doc/numpy/reference/?v=20191112052936
• https://medium.com/ibm-data-science-experience/markdown-for-jupyte
r-notebooks-cheatsheet-386c05aeebed
• https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
• https://www.python.org/about/
• https://www.learnpython.org/
• https://en.wikipedia.org/wiki/Python (programming language)
• https://www.learnpython.org/en/Basic Operators+
• https://www.python.org/+
• https://jupyter.org/
• https://www.youtube.com/watch?v=HW29067qVWkb
• https://pynative.com/
!jupyter --version
jupyter core : 4.7.1
jupyter-notebook : 6.3.0
qtconsole : not installed
ipython : 7.16.1
ipykernel : 5.3.4
jupyter client : 6.1.12
jupyter lab : not installed
nbconvert : 6.0.7
ipywidgets : 7.6.3
nbformat : 5.1.3
traitlets : 4.3.3
Basics of Python 21
2.1 An Exercise
Let us have a less conventional introduction here. Different from other books
on computer languages, we start the discussion on how to make use of our
own codes that we may develop during the course of study.
First, we “import” the Python system library or module from external or
internal sources, so that functions (also called methods) there can be used
in our code for interaction with your computer system. The most important
environment setting is the path.
Note that “#” in a code cell starts a comment line. It can be put anywhere
in a line in a code. Everything in the line behind # becomes comments, and
is not treated as a part of the code.
One may remove “#” in front of print(sys.path), execute it, and a number
of paths will be printed. Many of them were set during the installations
of the system and various codes, including the Anaconda and Python.
“grbin” in the current working directory has just been added in using the
sys.path.append().
When multiple lines of comments are needed, we use “doc-strings” as
follows:
'''Inside here are all comments with multiple lines. It is \
a good way to convey information to users, co-programmers. \
Use a backslash to break a line.'''
The following cell contains the Python code “grcodes.py”. Readers may
create the “grcodes.py” file and put it in the folder “grbin” (or any other
folder), so that the cell above can be executed and “gr.printx()” can be used.
Let us try to use a function in the imported module grcodes, using its
alias gr.
printx(name)
This prints out both the name and its value together.
usage: name = 88.0; printx('name')
Nice. I have actually completed a simple task of printing out “x” using
Python, and in two ways. The gr.printx function is equivalent to doing the
following:
Notice in this case that you must type the same x twice, which gives room
for error. A good code should have as little as possible repetition, allowing
Basics of Python 23
gr.printx('x')
printx('x') # Notice that "gr." is no longer needed.
x = 1.0
x = 1.0
Now, what is Python? Python was created by Guido van Rossum and first
released in 1991. Python’s design philosophy emphasizes code “readability”.
It uses an object-oriented approach aiming to help programmers to write
clear, less repetitive, logical codes for small- and large-scale projects that
may have teams of people working together.
Python is open source, and its interpreters are available for many
operating systems. A global community of programmers develops and
maintains CPython, an open-source reference implementation. A non-
profit organization, the Python Software Foundation, manages and directs
resources for Python and CPython development.
The language’s core philosophy is summarized in the document The Zen
of Python (PEP 20), which includes aphorisms such as the following:
This would mark the misspelled words for you (but will not provide
suggestions). Other necessary modules with add-on functions may also be
installed in a similar manner.
This book covers in a brief manner a tiny portion of Python.
2.3.1 Numbers
48
my int = 48
26 Machine Learning with Python: Theory and Applications
int
type(my_int)
float
(8+5j)
my complex = (8+5j)
if my_string == "Hello!": # comparison operators: ==, !=, <, <=, >, >=
print("A string: %s" % my_string) # Indented 4 spaces
A string: Hello!
This is a float, and it is 10.000000
This is an integer, and it is: 20
To list all variables, functions, modules currently in the memory, try this:
5.0
6.0
7
int(7.0) = 7
a = 1.0
print('a=',a, 'at memory address:',id(a))
b = a
print('b=',b, 'at memory address: ',id(b))
a, b = 2.0, 3.0
print('a=',a, 'at memory address: ',id(a))
print('b=',b, 'at memory address: ',id(b))
n1=100000000000
n_1=100_000_000_000 # for easy reading
print('Yes, n1 is same as n_1') if n1==n_1 else print('No')
# Ternary if Statement
n2=1_000_000_000
print('Total=',n1+n2)
print('Total=',f'{n1+n2:,}') # f-string (Python3.6 or later)
total=n1+n2
print('Total=',f'{total:_}')
Yes, n1 is same as n 1
Total= 101000000000
Total= 101,000,000,000
Total= 101 000 000 000
2.3.3 Strings
Strings are bits of text, which are very useful in coding in generating labels
and file names for outputs. Strings can be defined with anything between
quotes. The quote can be either a pair of single quotes or a pair of double
quotes.
Although both single and double quotes can all be used, when there are
apostrophes in a string, one should use double quotes, or these apostrophes
would terminate the string if single quotes are used and vice versa. For
example,
Basics of Python 29
One shall refer to the Python documentation, when needing to include things
such as carriage returns, backslashes, and Unicode characters. Below are
some more handy and clean operators applied to numbers and strings. You
may try it out and get some experience.
summation= 6
Summation= 6.0
3 3 3
30 Machine Learning with Python: Theory and Applications
You can split the string to a list of strings, each of which is a word.
Do not like the white-spaces between “4” and “th” “2” and “nd”? use string
concatenation:
Basics of Python 31
The position of the letter "o" is right after the 4th letter.
The 1st letter "l" is right after the 2nd letter.
my_string = "ABCDEFG"
reversed_string = my_string[::-1]
print(reversed_string)
GFEDCBA
Use of repetitions
n = 8
my_list = [0]*n
print(my_list)
[0, 0, 0, 0, 0, 0, 0, 0]
abcdefg abcdefg
my_list = [1,2,3,4,5]
print(my_list*2) #concatenate 2 times and then print out
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Basics of Python 33
print(sorted('BACD'),sorted('ABCD',reverse=True))
print(sorted(all_numbers), sorted(all_numbers,reverse=True))
original_list, n = [1,2,3,4], 2
new_list = [n*x for x in original_list]
# list comprehension for element-wise operations
print(new_list)
[2, 4, 6, 8]
34 Machine Learning with Python: Theory and Applications
1: a
2: b
3: c
1: a
2: b
3: c
a, b = 1,2
try:
print(a/b) # exception raised when b=0
except ZeroDivisionError:
print("division by zero")
else:
print("no exceptions raised")
finally:
print("Regardless of what happened, run this always")
0.5
no exceptions raised
Regardless of what happened, run this always
num = 21099
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of integer
num = 21099.0
print('The memory size is %d'%sys.getsizeof(num),'bytes')
# memory size of float
The first one printed True, as the string starts with “Hello”. The second one
printed False, as the string certainly does not end with “asdf”. The third
printed True, as the string ends with “!”. Their boolean values are useful
when creating conditions. More such functions:
my_string="Hello World!"
my_string1="HelloWorld"
my_string2="HELLO WORLD!"
print (my_string.isalnum()) #check if all char are numbers
print (my_string1.isalpha()) #check if all char are alphabetic
print (my_string2.isupper()) #test if string is upper case
False
True
True
36 Machine Learning with Python: Theory and Applications
False
True
True
False
n, x, s = 8888,8.0, 'string'
print (type(n), type(x), type(s)) # check the type of an object
print (len(s),len(str(n)),len(str(x)))
a = 2
print('a=',a, ' type of a:',type(a))
b = 3.0; a = a + b
print('a=',a, ' type of a:',type(a))
print('b=',b, ' type of a:',type(b))
n, x, s = 8888,8.5, 'string'
sfn = str(n) #integer to string
print(sfn,type(sfn))
sfx = str(x) #float to string
print(sfx,type(sfx))
85 <class 'int'>
16.0
0 <class 'int'>
38 Machine Learning with Python: Theory and Applications
However, operators with mixed numbers and strings are not permitted, and
it triggers a TypeError:
----------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-67-8c4f7138852e> in <module>
----> 1 my mix = my float + my string
name = "John"
print("Hello, %s! how are you?" % name) # %s is for string
# for two or more argument specifiers, use a tuple:
name, age = "Kevin", 23
print("%s is %d years old." % (name,age)) # %d is for digit
Hello, John! how are you?
Kevin is 23 years old.
Any object that is not a string (a list for example) can also be formatted
using the %s operator. The %s operator formats the object as a string using
the “repr” method and returns it. For example:
print(f"List1:{list1};List2:{list2};x={x},x={x:.2f}, x={x:.3e}")
# powerful f-string
number = 1 + 2 * 3 / 4.0
print(number)
2.5
The modulo (%) operator returns the integer remainder of the division:
dividend % divisor = remainder.
11//2= 5
11%2= 1
11
squared, cubed = 7 ** 2, 2 ** 3
print('7 ** 2 =', squared, ', and 2 ** 3 =',cubed)
7 ** 2 = 49 , and 2 ** 3 = 8
bwlg_XOR = 7^2
print(bwlg_XOR) # ^ is XOR (bitwise logic gate) operator!
a, b = 100, 200
print('a=',a,'b=',b)
a, b = b, a # swapping without using a "mid-man"
print('a=',a,'b=',b)
a= 100 b= 200
a= 200 b= 100
Python provides a number of build-in functions and types that are always
available. For a quick glance, see the following table, or find more details at
https://docs.python.org/3/library/functions.html.
Basics of Python 41
Boolean values are two constant objects: True and False. When used as an
argument to an arithmetic operator, they behave like the integers 0 and 1,
respectively. The built-in function bool() can be used to cast any value to a
Boolean. The definitions are as follows:
print(bool(5),bool(-5),bool(0.2),bool(-0.1),bool(str('a')),
bool(str('0')))
# True True True True True True
print(bool(0),bool(-0.0)) # These are all zero
# False False
print(bool(''),bool([]),bool({}),bool(())) # all empty (0)
# False False False False
bool() returns False, only if the value is zero or the container is empty.
Otherwise, True. Note that str(‘0’) is neither zero nor empty.
42 Machine Learning with Python: Theory and Applications
We already saw Lists a few times. This Section gives more details. A list is
a collection of variables, and it is very similar to arrays (see Numpy Array
section for more details). A list may contain any type of variables, and as
many variables as one likes. These variables are held in a pair of square
brackets [ ]. Lists can be iterated over for operations when needed. It is
one of the “iterables”. Let us look at the following examples.
x list= []
0x1ae44e9f548
3.0
[1, 2, 3.0]
Basics of Python 43
print('\n')
x_list2 = x_list*2
# concatenation of 2 x_list (not element-wise
# multiplication!) this creates an independent new x_list2
print(x_list2)
1,2,3.0,
1847992120648 1847993139592
1594536160
1847992120648 1847992120648
1847992120648 1847993186760
[4.0, 2, 3.0]
num = 19345678
list_of_digits=list(map(int, str(num))) #list iterable
print(list_of_digits)
list_of_digits=[int(x) for x in str(num)] #list comprehension
print(list_of_digits)
[1, 9, 3, 4, 5, 6, 7, 8]
[1, 9, 3, 4, 5, 6, 7, 8]
Element-wise addition of lists needs a little trick. The best ways, including
the use of numpy arrays, will be discussed in the list comprehension section.
Here, we use a primitive method to achieve this.
Basics of Python 45
1594536896
id(list1[0])
1594536736
add_list = []
for i1,i2 in zip(list1,list2): # for-loop and zip() to add it up
add_list.append(i1+i2)
0123456789TE
Hello world!
5th= o
7-11th= world
print('[6:-1]=',my_string[6:-1])
# "-1" for the last slice from the 6th to (last-1)th
print('[:]=',my_string[:]) # all characters in the string
print('[6:]=',my_string[6:]) # slice from 7th to the end
print('[:-1]=',my_string[:-1]) # to (last-1)th
[6:-1]= world
[:]= Hello world!
[6:]= world!
[:-1]= Hello world
[3:9:2]= l o
Basics of Python 47
Using a negative step, we can easily reverse a string, as we have seen earlier:
In summary, if we just have one number in the brackets, the slicing takes
the character at the (number +1)th position. This is because Python counts
from zero. A colon stands for all available. If it is used alone, the slice is the
entire string. If there is a number on its left, the slice is from the number
to the right-end, and vice versa. A negative number means it counts the
number but is from the right-end: −3 means “the 3rd character from the
right-end”. One can also use the step option for skipping.
Note that when accessing a string with an index which does not exist, it
generates an exception of IndexError.
-----------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-102-024f17c69a4f> in <module>
----> 1 print('[14]=',my string[14]) # gives an index out of range error
[14:]=
The very similar rules detailed above for strings apply also to a list, by
treating a variable in the list as a character.
48 Machine Learning with Python: Theory and Applications
When accessing a list with an index that does not exist, it generates an
exception of IndexError.
[]
nlist = [10, 20, 30, 40, 50, 6.0, '7H'] # Mixed variables
_, _, n3,_,*nn = nlist # when only the 3rd is needed,
# skip one, and then the rest
print ('n3=',n3, 'the rest numbers',*nn)
[11, 12]
['2B', 22]
[31, [32, 3.2]]
[[11, 12], ['2B', 22], [31, [32, 3.2]]]
11
12
print(nested_list)
print(nested_list[1][0]) # Try this: what would this be?
print(nested_list[2][1]) #?
print(nested_list[2][1][0]) #?
After the discussion about the List, discussing Tuples becomes straightfor-
ward. This is because they are essentially the same, and the major difference
is as follows:
ttuple = (10, 20, 30, 40, 50, 6.0, '7H') # create a Tuple
gr.printx('ttuple') # print(ttuple)
aa = ttuple[0]
print('aa=',aa)
print(ttuple[1], ' ',ttuple[6],' ',ttuple[-1])
0 : 10
1 : 20
2 : 30
The above may be all we need to know about Tuples. We now discuss
another useful data structure in Python.
phonebook0 = {
"John" : [788567837,788347278],
# this is a list, John has 2 phonenumbers
'Mark': 513683222,
'Joanne': 656477456
}
print(phonebook0)
Like lists, dictionaries can be iterated over. Because keys and values are
recorded in pairs, we may use for-loop to access them.
Basics of Python 53
Kevin 513476565
Richard 513387234
Jim 513682762
Mark 513387234
Kevin
Richard
Jim
Mark
513476565
513387234
513682762
513387234
To delete a pair of records, we use the build-in function del or pop, by using
the keys.
54 Machine Learning with Python: Theory and Applications
Now, use a simpler means called double-star. This allows one to create a 3rd
new dictionary that is a combination of two dictionaries, without affecting
the original two.
Numpy arrays are similar to Lists, and much easier to work with for scientific
computations. Operations on Numpy arrays are usually much faster for bulky
data.
• Both are mutable (the elements there can be added and removed after
the creation. A mutating operation is also called “destructive”, because it
modifies the list/array in place instead of returning a new one).
• Both can be indexed.
• Both can be sliced.
(2) Differences:
• For using arrays, one needs to import Numpy module, but lists are
build-in.
• Array works for element-wise operations, but lists cannot (need some
coding to do that).
• Data types in an array must be the same, but a list can have different types
of data (part of the reason why element-wise operation is not generally
accepted).
• Numpy array can be multi-dimensional.
• Operations with arrays are, in general, much faster than those on lists.
• Storing arrays uses less memory than storing lists.
• Numpy arrays are more convenient to use for mathematical operations
and machine learning algorithms.
x1 = [28 3 28 0]
x1 = array([28, 3, 28, 0])
list w= [57.5, 64.3, 71.6, 68.2] ; list h= [1.5, 1.6, 1.7, 1.65]
narray w= [57.5 64.3 71.6 68.2] ; narray h= [1.5 1.6 1.7 1.65]
Basics of Python 57
Let us create a function that prints out the information of a given numpy array.
def getArrayInfo(a):
'''Get the information about a given array:
getArrayInfo(array)'''
print('elements of the first axis of the array:',a[0])
print('type:',type(a))
print('number of dimensions, a.ndim:', a.ndim)
print('a.shape:', a.shape)
print('number of elements, a.size:', a.size)
print('a.dtype:', a.dtype)
print('memory address',a.data)
getArrayInfo(a)
Get the information about a given array: getArrayInfo(array)
We see here that ”’ ”’ useful to provide a simple instruction for the use of a created
function. Let us now use it to get the information for nist w.
getArrayInfo(narray_w)
[64.3, 71.6]
[64.3 71.6]
58 Machine Learning with Python: Theory and Applications
Let us now append an element to both the list and the numpy array.
print(list_w,' ',narray_w)
print(type(list_w),' ',type(narray_w))
print(len(list_w),' ', narray_w.ndim) # Use len() to get the length
To form a multi-dimensional array, we may use the following (more on this later):
arr = np.array([narray_w,narray_h])
arr
getArrayInfo(arr)
elements of the first axis of the array: [57.5 64.3 71.6 68.2]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (2, 4)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>
Basics of Python 59
We note that arr is of dimension 2, and has a shape of (2, 4), meaning it has
two entries along axis 0 and 4 entries along axis 1. We see again that the shape
of a numpy array is given in a tuple. A multi-dimensional numpy array can be
transported:
arrT = arr.T
print(arrT)
getArrayInfo(arrT) # see the change in shape from (2,4) to (4,2)
[[57.5 1.5 ]
[64.3 1.6 ]
[71.6 1.7 ]
[68.2 1.65]]
elements of the first axis of the array: [57.5 1.5]
type: <class 'numpy.ndarray'>
number of dimensions, a.ndim: 2
a.shape: (4, 2)
number of elements, a.size: 8
a.dtype: float64
memory address <memory at 0x000001AE44FB12D0>
It is seen that the dimension remains 2, but the shape is changed from (2,4) to
(4,2). The value of an entry in a numpy array can be changed.
Notice the behavior of the array created via an assignment: changes in an array
will affect the other. This behavior was observed for lists. To create an independent
array, use copy() function.
arr=np.stack([narray_w,narray_h],axis=0)
#stack up 1D arrays along axis 0
print(arr)
print(arr)
rarr=np.ravel(arr)
print(rarr)
getArrayInfo(rarr)
Figure 2.1: Picture modified from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.
Basics of Python 61
It is seen that the dimension is changed from 2 to 1, and the shape is changed
from (2,4) to (8,).
In machine learning computations, we often perform summation of entries of an
array along an axis of the array. This can be done easily using the np.sum function.
print(arr)
print('Column-sum:',np.sum(arr,axis=0),np.sum(arr,axis=0).shape)
print('row-sum:',np.sum(arr,axis=1),np.sum(arr,axis=1).shape)
listwh= [57.5, 64.3, 71.6, 68.2, 59.8, 1.5, 1.6, 1.7, 1.65]
print('narraywh=',narray_w+narray_h)
# + is element-wise addition for numpy arrays.
Let us compute the Body Mass Index or BMI using these narrays.
We discussed element-wise operations for lists earlier, and used special operations
(list comprehension) and special functions such as zip(). The alternative is the
“numpy-way”. This is done by first converting the lists to numpy arrays, then
performing the operations in numpy using these arrays, and finally converting the
results back to a list. When the lists are large in size, this numpy-way can be much
faster, because all these operations can be performed in bulk in numpy, without
element-by-element accessing of the memories.
import numpy as np
list1 = [20, 30, 40, 50, 60]
list2 = [4, 5, 6, 2, 8]
(np.array(list1) + np.array(list2)).tolist()
The results are the same as those we obtained before using special list element-
wise operations.
np.linspace(1., 4., 6)
# arrays `with a specified number of elements with equal value spacing
a = np.array([1.,2.,3.])
a.fill(9.9) # all entries with the same value
print(a)
[9.9 9.9 9.9]
Note that if an error message like “No module named ‘xyz’ ” is encountered,
which is likely during this course using our codes, one shall perform the installation
of “xyz” module in a similar way, so that all the functions and variables
defined there can be made use of. Note also that there are a huge number of
modules/libraries/packages openly available, and it is not possible to install all of
them. The practical way is installing it only when it is needed. One may encounter
issues in installations, many of which are related to compatibility of versions of the
involved packages. Searching online for help can often resolve these issues, because
the chance is high that someone has already encountered similar issues earlier, and
the huge online community has already provided some solution.
After mxnet module is installed, we import it to our code.
import mxnet as mx
mx.__version__
'1.7.0'
We often create arrays with randomly sampled values when working with neural
networks. In such cases, we initialize an array using a standard normal distribution
with zero mean and unit variance. For example,
x=nd.exp(y)
x
[[ 0.506 0.873 1.458 1.507]
[ 1.771 0.063 2.934 0.541]
[ 6.239 0.318 1.055 0.081]]
<NDArray 3x4 @cpu(0)>
y.T
[[-0.681 0.571 1.831]
[-0.135 -2.758 -1.147]
[ 0.377 1.076 0.054]
[ 0.410 -0.614 -2.507]]
<NDArray 4x3 @cpu(0)>
Note that nd arrays behave differently from np arrays. They do not usually
work together without proper conversion. Therefore, special care is need. When
strange behavior is observed, one may print out the variable to check the array type.
The same is generally true when numpy arrays work with arrays in other external
modules, because the array objects are, in general, different from one module to
another. One may use asnumpy() to convert an nd-array to an np-array, when so
desired. Given below is an example (more on this later):
np.dot(x.asnumpy(), y.T.asnumpy())
# convert nd array to np array and then use numpy np.dot()
array([[ 0.705, -1.476, -3.776],
[ 0.114, 3.662, 1.970],
[-3.860, 3.774, 10.910]], dtype=float32)
66 Machine Learning with Python: Theory and Applications
If we do not plan to reuse x, then the result can be assigned to x itself. We may
do this in MXNet:
print(x)
x[1:3] # read the second and third rows from x
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 12.811 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
x[1,2] = 88.0 # change the value at the 2nd raw 3rd column
print(x)
x[1:2,1:3] = 88.0
# change the values from the 2nd raw and 2nd to 3rd column
print(x)
[[ 1.343 3.358 6.210 6.438]
[ 7.653 -2.504 88.000 1.550]
[ 26.785 0.124 4.275 -2.182]]
<NDArray 3x4 @cpu(0)>
2.9.9 Broadcasting
What would happen if one adds a vector (1D array) y to a matrix (2D array) X? In
Python, this can be done, and is often done in machine learning. Such an operation
is performed using a procedure called “broadcasting”: the low-dimensional array
68 Machine Learning with Python: Theory and Applications
is duplicated along any axis with dimension 1 to match the shape of the high-
dimensional array, and then the desired operation is performed.
import numpy as np
y = np.arange(6) # y has an initial of shape (6), or (1,6)
print('y = ', y,'Shape of y:', y.shape)
x = np.arange(24)
print('x = ', x,'Shape of x:', x.shape)
y = [0 1 2 3 4 5] Shape of y: (6,)
x = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 22 23]
Shape of x: (24,)
(4, 6)
X =
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]] Shape of X: (4, 6)
X + y =
[[ 0 2 4 6 8 10]
[ 6 8 10 12 14 16]
[12 14 16 18 20 22]
[18 20 22 24 26 28]]
z = np.reshape(X,(2,3,4))
print (z)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Basics of Python 69
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
a = np.arange(12).reshape(3,4)
a.fill(100)
a
z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],
a = np.arange(4)
a.fill(100)
a
array([100, 100, 100, 100])
z + a
array([[[100, 101, 102, 103],
[104, 105, 106, 107],
[108, 109, 110, 111]],
If these conditions are met, the dimensions are compatible, otherwise not compatible
hence a ValueError. Figure 2.2 shows some examples.
Broadcasting operations:
Figure 2.2: Picture modified from that in “Introduction to Numerical Computing with
NumPy”, SciPy 2019 Tutorial, by Alex Chabot-Leclerc.
import numpy as np
x = np.arange(24).reshape(4,6)
y = np.arange(6)
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
numpy.ndarray
Basics of Python 71
npa
ndy = nd.array(npa)
ndy
To figure out the detailed differences between the MXNet NDArrays and the
NumPy arrays, one may refer to https://gluon.mxnet.io/chapter01 crashcourse/nd
array.html.
(vectors, matrices, etc.), and broadcasting is done over other dimensions. One can
also produce custom ufunc instances using the frompyfunc factory function”.
More details of unfuncs can be found at scipy documentation (https://docs.sc
ipy.org/doc/numpy/reference/ufuncs.html). Given below are two of many ufuncs.
exp(x, /[, out, where, casting, order, . . . ]) Calculate the exponential of all elements
in the input array. log(x, /[, out, where, casting, order, . . . ]) Natural logarithm,
element-wise. One may use the following for more details.
First, we note that numpy array is far more than 1D vector and 2D matrix. By
default, we can have as many as 32D, and that can also be changed. Thus, numpy
array is extremely powerful, and is a data structure works well for complex machine
learning models that use multidimensional dataset and data flows. All the element-
wise operations, broadcasting, handling of flows of large volume data, etc. work very
efficiently. It, however, does not follow precisely the concept of vectors, matrices
that we established in the conventional linear algebra and most frequently used.
This is essentially the root of confusion in many cases, in the author’s opinion.
Understanding the following key points is a good start to mitigate the confusion.
Numpy 1D array is similar to the usual vector concept in linear algebra. The
difference is that 1D array does not distinguish row or column vector. It is just
a 1D array with a shape of (n,), where n is the length of the array. It behaves
largely like the row vector, but not quite. For example, transpose() has no effect on
it. This is because what the transpose() function does is to swap two axises of an
array with two or more axises.
The column vector in linear algebra should be treated in numpy as a special case
of 2D array with only one column. One can create an array like the column vector
in linear algebra, by adding an additional axis to the 1D array. See the following
examples.
Basics of Python 73
[1 2 3] [1 2 3] (3,) (3,)
[[1]
[2]
[3]] (3, 1)
The axis added array becomes a 2D array, and the transpose() function works. It
creates a “real” row vector that is a special case of a 2D array in numpy.
The same can also be achieved using the following tricks.
[[1]
[2]
[3]] (3, 1)
[[1]
[2]
[3]] (3, 1)
[[2 3 4]
[3 4 5]
[4 5 6]]
Once knowing how to create arrays equivalent to the usual row and column
vectors in the conventional linear algebra, we shall be much more comfortable in
debugging codes when encountering strange behavior.
Another often encountered example is when solving linear system equations.
From the conventional linear algebra, we know that the right-hand-side (rhs) should
be a column vector, and the solution also should be a column vector. In using
numpy.linalg.solve() for a linear algebraic equation, we can feed in with a 1D array
as the rhs vector, and it will return a solution that is also a 1D array. Of course,
we can also feed in with a column vector (a 2D array with only one column). In
that case we will get the solution in a 2D array with only one column. We shall see
examples in Section 3.1.11, and many cases in later chapters. These behaviors are
all expected and correct in numpy
Numpy 2D array is similar to the usual matrix in linear algebra largely, but not
quite. For example, the matrix multiplication in linear algebra is often done in
numpy using the dot-product, such as np.dot() or the “@” operator in numpy
(version 3.5 or later). The “*” operator is an element-wise multiplication, as shown
in Section 2.9.4. We will see more examples in Chapter 3 and later chapters. Also,
some operations to a numpy array can results in dimension change. For example,
when mean() is applied to an array, the dimension is reduced. Thus, care is required
on which axis is collapsed.
Note that there is a numpy matrix class (see definition of class in Section 2.14)
that can be used to create matrix objects. These objects behavior quite similar as
the matrix in linear algebra. We try not to use it, because it will be deprecated one
day, as announced in the online document numpy-ref.pdf.
Based on the author’s limited experience, once we are aware of these differences
and behavior subtleties (more discussion later when we encountered one), we can
then pay attention to the behavior subtleties. It is often helpful to frequently
checking the shape of the arrays. This allows us to work more effectively with
powerful numpy arrays, including performing proper linear algebra analysis. At this
moment, it is quite difficult to discuss the theorems in linear algebra using 1D, 2D
or higher dimensional array concepts. In the later chapters, we will still follow the
general rules, principles, and use the terms of vector and matrix in the conventional
linear algebra, because many theoretical findings and arguments are based on it. A
vector refers generally to a 1D numpy array, and a matrix to a 2D numpy array.
When examine the outcomes in numpy codes, we shall notice the behavior subtleties
of the numpy arrays.
Basics of Python 75
sentence1 = "His first name is Mark and Mark is his first name"
words1 = sentence1.split() # use split() to form a set of words
print(words1) # whole list of these words is printed
word_set1 = set(words1) # convert to a set
print(word_set1) # print. No duplication
Using a set to get rid of duplication is useful for many situations. Many other
useful operations can be applied to sets. For example, we may want to find the
intersection of two sets. To show this, let us create a new list and then a new set.
sentence2 = "Her first name is Jane and Jane is her first name"
words2 = sentence2.split()
print(words2) # whole list of these words is printed.
word_set2 = set(words2) # convert to a set
print(word_set2) # print. No duplication
print(word_set1.difference(word_set2))
{'his', 'His', 'Mark'}
This finds words in word set1 that are not in word set2.
We may also want to find the words in word set2 that are not in word set1 as
follows:
print(word_set1.symmetric_difference(word_set2))
{'her', 'His', 'Jane', 'Her', 'Mark', 'his'}
76 Machine Learning with Python: Theory and Applications
We used list comprehension for particular situations a few times. List comprehension
is a very powerful tool for operations on all iterables including lists and numpy
arrays. When used on a list, it creates a new list based on another list, in a single,
readable line.
In the following example, we would like to create a list of integers which specify
the length of each word in a sentence, but only if the word is not “the”. The natural
way to do this is as follows:
words = sentence.split()
word_lengths = [len(word) for word in words if word != "the"]
words_nothe = [word for word in words if word != "the"]
print(words_nothe, ' ',word_lengths)
['Raises', 'Sun', 'and', 'comes', 'light'] [6, 3, 3, 5, 5]
import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
fx = []
x = np.arange(-2, 2, 0.5) # Numpy array with equally spaced values
fx = np.array([-(-xi)**0.5 if xi < 0.0 else xi**0.5 for xi in x])
# Creates a piecewise function (such as activation function).
# The created list is then converted to a numpy array.
print('x =',x); print('fx=',fx)
x = 2
print(x == 2) # The comparison result in a boolean value: True
print(x == 3) # The comparison result in a boolean value: False
print(x < 3) # The comparison result in a boolean value: True
True
False
True
x = 2
if x == 2:
print("x equals 2!")
else:
print("x does not equal to 2.")
x equals 2!
78 Machine Learning with Python: Theory and Applications
if name2 in groupA:
print("The person's name is either",groupA[0], "or", groupA[1])
else:
print(name2, "is not found in group A")
x, y = ['a','b'], ['a','b']
z = y # makes z pointing to y
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y
y.append(x)
print('After change one value in y')
print(x == y,'x=',x,'y=',y,hex(id(x)),hex(id(y))) # Prints False,
# their values are no longer equal
print(x is y,'x=',x,'y=',y,hex(id(x)),hex(id(y)))
# False, x is # NOT y
print(y is z,' y=',y,'z=',z,hex(id(y)),hex(id(z))) # True, y is z:
# they change together!
True
False
The “not” operator can also be used with “is” and “in”: “is not”,“not in”:
x, y = [1,2,3], [1,2,3]
z = y # makes z pointing to same object as y
print(x == y,' x=',x,'y=',y,hex(id(x)),hex(id(y))) # True,
# because the values in x and y are equal
Note that there is no limit in Python on how many blocks one can use in an if
statement.
For loops can iterate over a sequence of numbers using the “range” function. The
range() is a built-in function which returns a range object: a sequence of integers.
It generates integer numbers between the given start-integer and the stop-integer.
It is generally used to iterate over with for loops.
For a given list of five numbers, let us display each element that doubled, using a
for loop and range() function.
“break” and “continue” statements: break is used to exit a “for” loop or a “while”
loop, whereas continue is used to skip the current block, and return to the “for” or
“while” statement.
0,1,2,3,4,5,6,7,8,9,
0,2,4,6,8,
When the loop condition fails, then the code part in “else” is executed. If break
statement is executed inside the for loop, then the “else” part is skipped. Note that
the “else” part is executed even if there is a continue statement before it.
# Prints out 0,1,2,3,4 and then it prints "count value reached 5"
count=0
nlimit = 5
while(count<nlimit):
print(count, end=',')
count +=1
else:
print("count value reached %d" %(nlimit))
condition=True
if condition:
x=1
else:
x=0
print(x)
1
condition=True
x=1 if condition else 0
print(x)
1
It is simple, readable and dry. Thus conditionals are frequently used in Python.
Functions offer a convenient way to divide a code into useful blocks that can be called
unlimited times when needed. This can drastically reduce the repeat of codes, and
make codes clean, more readable, and easy to maintain. In addition, functions are
good ways to define interfaces for easy sharing of the codes among programmers.
Functions may be created with return values to the caller, using the keyword
“return”.
x, y = 2.0, 8.0
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}') # print('%f + %f = %f'%(x,y,apb))
x, y = 20, 80
apb = sum_two_numbers(x, y)
print(f'{x} + {y} = {apb}, perfect!')
#print('%d + %d = %d: perfect!'%(x,y,apb))
print(f'{x + y}, perfect!') #f-string allow the use of operations
Variable scope: LEGB rule (local, enclosing, global, buildings) defines the sequence
for Python to search for a variable. The searching terminates when the variable is
found.
x, y = 2.0, 8.0
print('Before the function is called,x=',x,'y=',y)
apb = sum_two_numbers(x, y)
print('After the function is called, %f + %f = %f'%(x,y,apb))
2 6 5
Often Lambda functions are used together with the normal functions, especially for
returning a value where a single-line function comes in handy.
def func2nd_order(a,b,c):
return lambda x: a*x**2 + b*x +c
f2 = func2nd_order(2,-4,6)
print('f2(2)=',f2(2), 'or', func2nd_order(2,-4,6)(2))
f2(2)= 6 or 6
A class is a single entity that encapsulates variables and functions (or methods).
A class is essentially a template to create class objects. One can create unlimited
number of class objects with it, and each class object gets the structure, vari-
ables/attributes and-functions (or methods) from the class. References used in this
section includes the following:
class C(builtins.object)
| A simplest possible class named "C"
|
| Data descriptors defined here:
|
| dict
| dictionary for instance variables (if defined)
|
| weakref
| list of weak references to the object (if defined)
|
| --------------------------------------------------------------
| Data and other attributes defined here:
|
| ca = 'class attribute'
This example shows how a Class is structured, comments given in ”’ ”’ in the class
definition are used to convey a message to the user, and how an attribute can be
created in the class. We can now use it to observe some behavior of a class attribute
and instance attributes.
The change is only effective for the instance attribute that is changed.
Class attributes and object instance attributes are stored in separate dictionaries:
C.__dict__
mappingproxy({' module ': ' main ',
' doc ': ' A simplest possible class named "C" ',
'ca': "The 2nd changed the class attribute 'ca'",
' dict ': <attribute ' dict ' of 'C' objects>,
' weakref ': <attribute ' weakref ' of 'C' objects>})
i1.__dict__
i2.__dict__
{}
No dictionary has been created, because no change is made at the instance level. It
stays with the class.
Basics of Python 89
i2.__dict__
A dictionary has now been created, because the change is made at the instance
level. It departed from the class level. Any future change at the class level to this
attribute will no longer affect the attribute at this instance level.
class Circle:
''' Class "Circle": Compute the area of a circle '''
pi = 3.14159 #class attribute of constants used class-wide
# and class specific
def __init__(self, radius): # a special constructor
# __init__ is executed when the class is called.
# it is used to initiate a class. For this simple
# task we need only one variable: radius.
# "self" is used to reserve an argument place for an
# instance (to-be-created) itself to pass along.
# It is used for all the functions in a class.
self.radius = radius # This allows the instance accessing
# the variable: radius.
def circle_area(self): # function computes circle area
return self.pi * self.radius **2 # pi gets in there via
# the object instance itself
r = 10
c10 = Circle(r) # create an instance c10. c10 is now passed to
# self inside the class definition
# 10 is passed to the self.radius
print('Circle.pi before re-assignment',Circle.pi)
# access pi via class
90 Machine Learning with Python: Theory and Applications
print('Radius=',c10.radius)
# access via object c10.radius is the self.radius in __init__
print('c10.pi before re-assignment', c10.pi)
# The class attribute is accessed via instance attribute
Circle.pi before re-assignment 3.14159
Radius= 10
c10.pi before re-assignment 3.14159
It is seen that the Class Circle works well. Let us now create a subclass.
class P_circle(Circle):
# Subclass P_circle referring the base (or parent)
# Circle in (). This establishes the inheritance
''' Subclass "P_circle" based on Class "Cirle": Compute the\
area of a circle portion '''
def __init__(self,radius,portion):
# with 3 attributes: self, radius, and portion.
super().__init__(radius) # This brings in base attributes
# from the base class Circle.
self.portion = portion # Subclass attribute.
Basics of Python 91
def pcircle_area(self):
# define a function to compute the area of a partial circle
return self.portion*self.circle_area() # New function in
# subclass. The base class Circle is used here.
Readers may remove “#” in above cell, execute it, and take a moment to read
through this information, and see how the subclass is structured, its connection
with the base class, how self is been used to prepare for connections with the future
objects to be assigned, and what are the attributes and functions that are newly
created and inherited from the base class.
3.14159
10.0
pc10.pi = 3.14
pc10.pi
3.14
c10.pi
3.14
It remains unchanged. Actions to the subclass are not affecting the base class.
2.15 Modules
We now touch upon modules. A module is a piece of Python file that has a specific
functionality. For example, when writing a finite element program, we may write
one module for creating the stiffness matrix and another for solving the system
92 Machine Learning with Python: Theory and Applications
equations. Each module is a separated Python file, which can be written and edited
independently. This helps a lot in organizing and maintaining large programs.
A module in Python is a Python file with .py extension. The file name will be
the module name. Such a module can have a set of functions, classes, and variables
defined. In a module, one can import other modules in the procedure mentioned in
the beginning of this chapter.
Python is also very powerful in generating plots. This is done by import modules
that are openly available. Here, we shall present a simple demo plot of scattered
circles.
First, we import the modules needed.
import numpy as np
import matplotlib.pyplot as plt
# matplotlib.pyplot is a plot function in the matplotlib module
%matplotlib inline
# to have the plot generated inside the notebook
# Otherwise, it will be generated in a new window
We now generate sample data, and then plot 80 randomly generated circles.
n=80
x=np.random.rand(n) # Coordinates randomly generated
y=np.random.rand(n)
colors=np.random.rand(n)
areas=np.pi*(18*np.random.rand(n))**2
# circle radiuses from 0~20 randomly generated
plt.scatter(x,y,s=areas,c=colors,alpha=0.8)
plt.show()
Figure 2.3: Randomly generated circular areas filled with different colors.
Basics of Python 93
# Plost a curve
x = range(1000)
y = [i ** 2 for i in x]
plt.plot(x,y)
plt.show();
x = np.linspace(0, 1, 1000)**1.5
plt.hist(x);
Performance assessment on a code can be done in two ways. Typical example codes
are given below. Readers may make use of these codes for accessing computational
performance to his/her codes.
94 Machine Learning with Python: Theory and Applications
g=list(range(10_000_000))
#print(g)
q=np.array(g,'float64')
#print(q)
start = time.process_time()
sg=sum(g)
t_elapsed = (time.process_time() - start)
print(sg,'Elapsed time=',t_elapsed)
start = time.process_time()
sq=np.sum(q)
t_elapsed = (time.process_time() - start)
print(sq,t_elapsed)
329 ms ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.18 Summary
With the basic knowledge of Python and its related modules, and the essential
techniques for coding, we are now ready to learn to code for computations and
machine learning techniques. In this process one can also gradually improve his/her
Python coding skills.
Reference
[1] C.R. Harris, K.J. Millman and J.v.d.W. Stéfan et al. Array programming with NumPy,
Nature, 585(7825), 357–362, Sep 2020. http://dx.doi.org/10.1038/s41586-020-2649-2.
Chapter 3
Linear algebra is most essential for any computations that involve big data,
such as in machine learning. We plan to briefly review basic linear algebraic
operations, through the use of Python programming, using modules that
have already been developed in the Python community at large. We shall go
through the basic concepts, the mathematical notation, data structure, and
the computation procedure. Readers feel free to skim or skip this chapter
if you are already confident in the basic linear algebra computations. Our
discussion will start from the data structure. First, we import necessary
modules and functions.
95
96 Machine Learning with Python: Theory and Applications
We will also use the MXNet package. If not done yet, we shall have
MXNet installed using: pip install mxnet.
After the installation, we import the MXNet.
3.1.2 Vectors
A vector refers to an object that has more than one component stacked along
one dimension. It can give a physical meaning depending on the type of the
physics problem. For example, a force vector in three-dimensional (3D) space
has three components, each representing the projection of the force vector
onto one of the three axes of the space. It is also known as the degrees of
freedom (DoF). Higher dimensions are encountered in discretized numerical
models, such as the finite element method or FEM (e.g., [1]). In the FEM,
the solid structure is discretized in space with elements and nodes. The DoFs
of such a discretized model depend on the number of nodes, which can be
very large, often in the order of millions. Therefore, we form vectors with
millions of components or entries. In machine learning models, the features
and labels can be written in vector form.
Basic Mathematical Computations 97
In this chapter, we will not discuss much about the physical problems.
Instead, we discuss general aspects of a vector in abstract, and issues on
computational operations that we may perform to the vector for a given
coordinate system. The DoF for a vector is also referred as the length. A
vector of length p has a shape denoted in Python as (p’).
print(x.T) # Transpose of x
print(x.T.shape)
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
<NDArray 15 @cpu(0)>
(15,)
In mathematics, the transpose of a (column) vector shall become a
raw vector, and vice versa. In MXNet NDArray, The transformed vector
is stored in the same place but marked as transpose, so that operations,
such as multiplication, can be performed properly. This saves operations for
physically copying and moving the data, improving efficiency. To confirm
this, we just print out the addresses, as follows.
print(id(x))
print(id(x.T)) # Transpose of x
2212173184080
2212173184080
98 Machine Learning with Python: Theory and Applications
3.1.3 Matrices
A matrix refers to an object that has more than one dimensions of data
structure, in which each dimension has more than one component. It can be
viewed as stacks of vectors of some length. It again can have a physical mean-
ing depending on the type of physics problem. For example, the stiffness (and
mass) matrix created based on a discretized numerical model for a solid struc-
ture, such as the FEM, has a two-dimensional (2D) structure. In each of the
dimensions, the number of components is the same as that of the DoFs. The
whole matrix is a kind of spatially distributed “stiffness” of the structure [1].
In machine learning models, the input data points in the feature space, and
learning parameters in the hypothesis space may be written in a matrix form.
Again, we will not discuss much about the physical problem here.
Instead, we discuss general aspects of a matrix in abstract, and issues on
computational operations that we may perform to the matrix. Such an
abstract matrix can be presented in arrays of multi-dimensions known as
shape in Python defined in Chapter 2.
A.T=
[[ 0. 5. 10.]
[ 1. 6. 11.]
[ 2. 7. 12.]
[ 3. 8. 13.]
[ 4. 9. 14.]]
<NDArray 5x3 @cpu(0)>
print(A.shape, A.T.shape)
(3, 5) (5, 3)
It is found that the transposed matrix has its own address also in MXNet.
100 Machine Learning with Python: Theory and Applications
3.1.4 Tensors
The term of “Tensor” requires some clarification. In mathematics or physics,
tensor has a specific well-defined meaning. It refers to a structured data
(a single number, a vector, or a multi-dimensional matrix) that obeys
a certain tensor transformation rule under coordinate transformations.
Therefore, tensor is a very special group of structured data or object, and
not all matrices can be called a tensor. In fact, most of them are not. So long
as the tensor transformation rules are obeyed, it can be classed in different
orders. Scalars are 0th order tensors, vectors are 1st-order tensors, and 2D
matrices are 2nd-order tensors, and so on.
Having said that, in the machine learning (ML) community, however, any
matrix with dimension higher than 2 is called a tensor. It can be viewed as
stacks of matrices of same shape. This ML tensor seems to carry a meaning
of big data that needs to be structured in high dimensions. The ML tensor
is now used as a general way of representing an array with an arbitrary
dimension or arbitrary number of axes. The use of ML tensors becomes
more convenient when dealing with, for example, images that can have 3D
data structures, with axes corresponding to the height, width, and the three
color (RGB) channels. In numpy, a tensor is a multidimensional array.
Because there is usually no such a coordinate transformation performed in
machine learning, there will be no possible confusion caused in our discussion
in this book. From now onwards, we will call the ML tensor a tensor,
with the understanding that it may not obey the real-tensor transformation
rules and we do not perform such a transformation in machine learning
programming.
We now use nd.range() and then the reshape() to create a 3D nd-array.
X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)
X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
Basic Mathematical Computations 101
A = nd.arange(8).reshape((2, 4))
B = nd.ones_like(A)*8 # get shape of A, assign uniform entries
print('A =', A, '\n B =', B)
print('A + B =', A + B, '\n A * B =', A * B)
A =
[[0. 1. 2. 3.]
[4. 5. 6. 7.]]
<NDArray 2x4 @cpu(0)>
B =
[[8. 8. 8. 8.]
[8. 8. 8. 8.]]
<NDArray 2x4 @cpu(0)>
A + B =
[[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]
<NDArray 2x4 @cpu(0)>
A * B =
[[ 0. 8. 16. 24.]
[32. 40. 48. 56.]]
<NDArray 2x4 @cpu(0)>
x = nd.arange(5)
print(x)
[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>
[10.]
<NDArray 1 @cpu(0)>
X = nd.ones(15).reshape(3,5)
X
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
<NDArray 3x5 @cpu(0)>
a = nd.arange(5)
b = nd.ones_like(a) * 2 #This ensures the compatibility
print(f"a={a},a.shape={a.shape} \nb={b},b.shape={b.shape}")
printx('nd.dot(a, b)')
printx('nd.dot(a, b).shape')
print(f"np.dot(a, b)={np.dot(a.asnumpy(), b.asnumpy())}")
Basic Mathematical Computations 103
print(f"np.dot(a.T,b)={np.dot(a.asnumpy().T, b.asnumpy())}")
printx('np.dot(a.asnumpy(), b.asnumpy()).shape')
a=
[0. 1. 2. 3. 4.]
<NDArray 5 @cpu(0)>,a.shape=(5,)
b=
[2. 2. 2. 2. 2.]
<NDArray 5 @cpu(0)>,b.shape=(5,)
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b).shape = (1,)
np.dot(a, b)=20.0
np.dot(a.T,b)=20.0
np.dot(a.asnumpy(), b.asnumpy()).shape = ()
Note that the transpose() to the vector will have not effect, because
transpose() in numpy swaps the axises of a 2D array. An numpy 1D array
has a shape of (n,) and hence no action can be taken. A numpy 1D array is
not treated as a matrix, as discussed in Section 2.9.13. When b is a column
vector, a special case of 2D array, it then has two axises like a matrix. The
dot-product a · b is same as the matrix-product ab (where a is defined as
a row vector and b a column vector), in terms of the scalar value resulted.
Thus, in our formulation, we do not distinguish them mathematically, and
we often use the following equality.
a · b = ab (3.1)
In computations in numpy, however, there are some subtleties. The dot-
product of two (1D array) vectors gives a scalar, and the dot-product of a
(1D array) vector with a column vector gives an 1D array with the same
scalar as the sole element. In NDArray, such subtleties are not observed.
Readers may examine the following code carefully to make sense of this.
[[2.]
[2.]
[2.]
[2.]
[2.]]
<NDArray 5x1 @cpu(0)> b_c.shape= (5, 1)
np.dot(a.asnumpy(),b.asnumpy()) = 20.0
np.dot(a.asnumpy(),b_c.asnumpy()) = array([20.], dtype=float32)
[20.]
nd.dot(a, b) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c) =
[20.]
<NDArray 1 @cpu(0)>
nd.dot(a, b_c).shape = (1,)
As seen, all these give the same scalar value but in a different data
structure.
The dot-product of two column vectors (special matrices) a and b of
equal length is written linear algebra as a b or b a, which gives the same
scalar (but in a 2D array or a matrix with only one element).
[[0.]
[1.]
[2.]
[3.]
[4.]]
<NDArray 5x1 @cpu(0)>
np.dot(a_c.asnumpy().T,b_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()) = array([[20.]], dtype=float32)
np.dot(b_c.asnumpy().T,a_c.asnumpy()).shape = (1, 1)
To access the scalar value in a 2D array of shape (1, 1), simply use.
print(np.dot(a_c.asnumpy().T,b_c.asnumpy())[0][0])
20.0
Basic Mathematical Computations 105
One may use flatten() to make a column vector back to a row vector.
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())')
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()) = array([20.], dtype=float32)
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy()).shape = (1,)
printx('np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0]')
np.dot(a_c.asnumpy().flatten(),b_c.asnumpy())[0] = 20.0
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()')
printx('np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0]')
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel() = array([20.], dtype=float32)
np.dot(a_c.asnumpy().T,b_c.asnumpy()).ravel()[0] = 20.0
a = np.arange(3)
b = np.ones(5) * 2
print(a, b)
print('np.outer=\n',np.outer(a, b))
[0 1 2] [2. 2. 2. 2. 2.]
np.outer=
[[0. 0. 0. 0. 0.]
[2. 2. 2. 2. 2.]
[4. 4. 4. 4. 4.]]
and b needs to be a row vector with shape (1, m). Readers may try this
as an exercise. Note that although we may get the same results, using the
built-in np.outer() is recommended, because it is usually much faster, and
does not need additional operations. This recommendation applies to all the
other similar situations.
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
[10. 11. 12. 13. 14.]]
<NDArray 3x5 @cpu(0)> (3, 5)
[1. 1. 1. 1. 1.]
<NDArray 5 @cpu(0)> (5,)
d3 = nd.ones(A35.shape[0])
nd.dot(d3,A35) # this works:[3]X[3,5]-> vector of length 5
[15. 18. 21. 24. 27.]
<NDArray 5 @cpu(0)>
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
<NDArray 2x5 @cpu(0)>
import numpy as np
print('np.dot():\n',np.dot(A23.asnumpy(),B35.asnumpy()))
print('numpy @ operator:\n',A23.asnumpy() @ B35.asnumpy())
np.dot():
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
numpy @ operator:
[[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
3.1.10 Norms
A norm is used in Python to measure how “big” a vector or matrix is. There
are types of norm measures, but they all produce a non-negative value for
the measure. The most often used or default one is the L2-norm. It is the
square root of the sum of the squared elements in the vector or matrix or
tensor. For matrices, it is often called the Frobenius norm. The computation
is by calling a norm() function:
[1.7320508 1.7320508]
<NDArray 2 @cpu(0)>
[1.7320508 1.7320508]
The L1-norm of a vector is the sum of the absolute value of the elements
in a vector. The L1-norm of a matrix can be defined as the maximum of
L1-norm of column vectors of the matrix. For computing the L1-norms of a
vector, we use the following:
print(np.linalg.norm(A23.asnumpy(),1)) # np.linalg.norm()
# for matrix
2.0
KD = F (3.2)
import numpy as np
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
K: [[ 1.500 1.000]
[ 1.500 2.000]]
F: [1 1]
D: [ 0.667 0.000]
If careful, one should see that the input F is a 1D numpy array, and the
result is also a 1D array, which is not the convention of linear algebra, as
discussed in Section 2.9.13. One can also purposely define a column vector
(a 2D array with only on column) following the convention of linear algebra,
and get the solution. In this case, however, the returned solution will be the
same, but is in a column vector. Readers may try this as an exercise.
Note that solving linear algebraic system equations numerically can be
very time consuming and expensive, especially for large systems. With
the development of computer hardware and software in the past decades,
numerical algorithms for solving linear algebraic systems are well developed.
The most effective solver for very large systems uses iterative methods.
It converts the problem of solving algebraic equations to a minimization
problem with a properly defined error residual function as a cost or
loss function. A gradient-based algorithm, such as the conjugate gradient
methods and Krylov methods, can be used to minimize the residual error.
These methods are essentially the same as those used in machine learning.
The numpy.linalg.solve uses routines from the widely used efficient Linear
Algebra PACKage or LAPACK.
For a matrix that is not square, we shall use the least-square solvers
for a best solution with minimized least-square error. The function in
numpy is numpy.linalg.lstsq(). We will see examples later when discussing
interpolations.
Basic Mathematical Computations 111
D = K−1 F (3.3)
where K−1 is the inverse of K. Therefore, if one can compute K−1 , the
solution is simply a matrix-vector product. Indeed, for small systems, this
approach does work, and is used by many. We now use numpy.linalg.inv()
to compute the inverse of a matrix.
D = np.dot(Kinv,F)
print('D:',D)
D: [ 0.667 0.000]
[[ 1.000 3.000]
[ 3.000 5.000]]]
112 Machine Learning with Python: Theory and Applications
array([[[-2.000, 1.000],
[ 1.500, -0.500]],
[[-1.250, 0.750],
[ 0.750, -0.250]]])
A = VΛV (3.4)
VV = I (3.5)
V−1 = V (3.6)
Because the matrix is PD, its inverse exists and all the eigenvalues are
nonzero (positive-definite). The inverse of the diagonal matrix Λ is simply
the same diagonal matrix with diagonal terms replaced by the reciprocals of
the eigenvalues.
Eigenvalue decomposition can be viewed as a special case of SVD. For
general matrices that we often encounter in machine learning, the SVD is
more widely used for matrix decomposition and will be discussed later in
this chapter, because it exists for all matrices.
In this section, let us see an example of how the eigenvalues and the
corresponding eigenvectors can be computed in Numpy.
114 Machine Learning with Python: Theory and Applications
import numpy as np
from numpy import linalg as lg # import linalg module
A = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) # Identity matrix
print('A=',A)
e, v = lg.eig(A)
print('Eigenvalues:',e)
print('Eigenvectors:\n',v)
A= [[1 0 0]
[0 1 0]
[0 0 1]]
Eigenvalues: [ 1.000 1.000 1.000]
Eigenvectors:
It is clearly seen that the identity matrix has three eigenvalues all of 1, and
their corresponding eigenvectors are three linearly independent unit vectors.
Let us look at a more general symmetric matrix.
A:
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
Eigenvalues: [ 1.539 1.000 0.461]
Eigenvectors:
[[-0.263 0.928 0.263]
[-0.707 -0.000 -0.707]
[-0.657 -0.371 0.657]]
print(np.dot(v,v.T))
[[ 1.000 0.000 0.000]
[ 0.000 1.000 -0.000]
[ 0.000 -0.000 1.000]]
This means that all these three eigenvectors are orthogonal with each
other, and the dot-product between them becomes a unit. We are now ready
to recover the original matrix A, using these eigenvalues and eigenvectors.
print(A)
lamd = np.eye(3)*e
A_recovered = v@lamd@v.T
print(A_recovered)
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
[[ 1.000 0.200 0.000]
[ 0.200 1.000 0.500]
[ 0.000 0.500 1.000]]
lamd_inv = np.eye(3)/e
A_inv = v@lamd_inv@v.T
print(A_inv)
[[ 1.056 -0.282 0.141]
[-0.282 1.408 -0.704]
[ 0.141 -0.704 1.352]]
We now see one real eigenvalue, but the other two eigenvalues are
complex valued. These two complex eigenvalues are conjugates of each other.
Similar observations are made for the eigenvectors. We conclude that a real
asymmetric matrix can have complex eigenvalues and eigenvectors. Complex
valued matrices shall in general have complex eigenvalues and eigenvectors.
A special class of complex valued matrices called the Hermitian matrix (self-
adjoint matrix) has real eigenvalues. This example shows that the complex
space is geometrically closed, but the real space is not. An n by n real matrix
should have n eigenvalues (and eigenvectors), but they may not be all in the
real space. Some of them get into the complex space (that with the real space
as its special case).
level of singularity. The condition number for a matrix with the highest level
of singularity is infinite. Let us see some examples.
very large or the smallest eigenvalue of the matrix is very small, the matrix
is likely singular, depending on their ratio.
This finding implies that a normalization to a matrix (which is often
done in machine learning) will not in theory change its condition number.
It may help in reducing the loss of significant digits (because of the limited
presentation of floats in computer hardware).
It is seen that the identity matrix has a shape of 4 × 4. It has a full rank.
This singular matrix has two linearly independent columns, and hence
a rank of 2. It has a rank deficiency of 1. Thus, it should also have a zero
eigenvalue, as shown below. If the matrix has a rank deficiency of n, it shall
have n zero eigenvalues. This is easy checked out using Numpy.
Basic Mathematical Computations 119
The matrix has only two rows, and a rank of 2. It has a full rank.
d = [u v] (3.9)
import numpy as np
theta = 45 # Degree
thetarad = np.deg2rad(theta)
c, s= np.cos(thetarad), np.sin(thetarad)
T = np.array([[c, -s],
[s, c]])
print('Transformation matrix T:\n',T)
120 Machine Learning with Python: Theory and Applications
Transformation matrix T:
[[ 0.707 -0.707]
[ 0.707 0.707]]
T @ d # rotated by theta
array([ 0.707, 0.707])
T @ T # 2 theta rotations
array([[ 0.000, -1.000],
[ 1.000, -0.000]])
3.3 Interpolation
# Data available
xn = [1, 2, 3] # data: given coordinates x
fn = [3, 2, 0] # data: given function values at x
# Query/Prediction f at a new location of x
x = 1.5
f = np.interp(x, xn, fn) # get approximated value at x
print(f'f({x:.3f})≈{f:.3f}')
f(1.500)≈2.500
y = wx + b (3.10)
We shall determine the gradient w and bias b, using the data pair [xi , yi ]. In
this example, Eq. (3.10) can be rewritten as
y =X·w (3.11)
where X = [x, 1] and w = [w, b]. Now, we can use the np.linalg.lstsq to solve
for w:
y = w_true*x+b_true+np.random.rand(len(x))/1.0
# generate y data random noise added
print(y)
w, b = np.linalg.lstsq(X, y, rcond=None)[0]
w, b
[-0.686 0.754 1.643 2.702]
(1.1055249814646126, -0.5550392342429096)
import numpy as np
from scipy import interpolate
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
x0, xL = 0, 18
x = np.linspace(x0, xL, num=11, endpoint=True)
# x data points
y = np.sin(-x**3/8.0) # y data points
print('x.shape:',x.shape,'y.shape:',y.shape)
f = interp1d(x, y) # linear interpolation
f2 = interp1d(x, y, kind='cubic') # Cubit interpolation
# try also quadratic
xnew = np.linspace(x0,xL,num=41,endpoint=True)
# x prediction points
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
x.shape: (11,) y.shape: (11,)
Radial basis functions (RBFs) are useful basis functions for approximation
of functions. RBFs are distance functions, and hence work well for irregular
grids (even randomly distributed points), in high dimensions, and are often
found less prone to overfitting. They are also used for constructing meshfree
methods [2]. In using Scipy, the choices of RBFs are as follows:
• “multiquadric”: sqrt((r/self.epsilon)**2 + 1)
• “inverse”: 1.0/sqrt((r/self.epsilon)**2 + 1)
• “gaussian”: exp(-(r/self.epsilon)**2)
• “linear”: r
• “cubic”: r**3
• “quintic”: r**5
• “thin plate”: r**2 * log(r).
import numpy as np
from scipy.interpolate import Rbf, InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
# Generate data
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
x = np.linspace(0, 10, 9)
print('x=',x)
y = np.sin(x)
print('y=',y)
# fine grids for plotting the interpolated data
xi = np.linspace(0, 10, 101)
# use fitpack2 method
ius=InterpolatedUnivariateSpline(x,y) # interpolation
# function
yi = ius(xi) # interpolated values at fine grids
plt.subplot(2, 1, 1) # have 2 sub-plots plotted together
plt.plot(x, y, 'bo') # original data points in blue dots
plt.plot(xi, np.sin(xi), 'r') # original function, red line
Basic Mathematical Computations 127
Figure 3.6: Comparison of interpolation using spline and radial basis function (RBF).
import numpy as np
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
from matplotlib import cm
128 Machine Learning with Python: Theory and Applications
A = UΣV∗ (3.12)
where * stands for Hessian (transpose of the matrix and conjugate to the
complex values).
• U is an m × m unitary matrix.
• Σ is an m × p rectangular diagonal matrix with non-negative real numbers
on the diagonal entries.
• V is a p × p unitary matrix.
• The diagonal entries σi in Σ are known as the singular values of A.
• The columns of U are called the left-singular vectors of A.
• The columns of V are called the right-singular vectors of A.
More detailed discussions on SVD can be found at Wikipedia (https://
en.wikipedia.org/wiki/Singular value decomposition).
130 Machine Learning with Python: Theory and Applications
B = A A (3.13)
B = Ve ΛVe (3.14)
Because matrix A has a rank of p, the singular value in Σ shall all be positive
real numbers. Using Eq. (3.15), we have
A A = UΣV UΣV = VΣU UΣV = VΣ2 V = B (3.16)
Basic Mathematical Computations 131
AV = UΣ (3.19)
U = AVΣ−1 (3.20)
import numpy as np
a = np.random.randn(3, 6) # matrix with random numbers
print(a)
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
132 Machine Learning with Python: Theory and Applications
u, s, vh = np.linalg.svd(a, full_matrices=True)
print(u.shape, s.shape, vh.shape)
(3, 3) (3,) (6, 6)
print('u=',u,'\n','s=',s,'\n','vh=',vh)
u= [[-0.677 0.283 0.679]
[ 0.202 0.959 -0.198]
[-0.707 0.003 -0.707]]
s= [ 3.412 2.466 1.881]
vh= [[ 0.589 -0.318 -0.544 0.048 -0.158 0.479]
[-0.083 0.216 -0.652 0.227 0.606 -0.318]
[-0.128 -0.382 -0.180 -0.867 0.166 -0.159]
[-0.510 -0.736 -0.110 0.412 -0.115 -0.066]
[ 0.300 -0.343 0.476 0.115 0.723 0.171]
[-0.529 0.218 -0.083 -0.104 0.210 0.781]]
print(a)
print(np.dot(u, np.dot(smat, vh)))
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
[[-1.582 0.398 0.572 -1.060 1.000 -1.532]
[ 0.257 0.435 -1.851 0.894 1.263 -0.364]
[-1.251 1.276 1.548 1.039 0.165 -0.946]]
Basic Mathematical Computations 133
We note here that the SVD of a matrix keeps the full information of
the matrix: using all these singular values and vectors, one can recover the
original matrix. What if one uses only some of these singular values (and the
corresponding singular vectors)?
We can use SVD to compress data, by discarding some (often many) of these
singular values and vectors of the data matrix. The following is an example
of compressing an m × n array of an image data.
(405, 349, 3)
1.160458452722063 349 (405, 349) 141345
(405, 405) (349,) (349, 349)
<matplotlib.image.AxesImage at 0x203185a7080>
Figure 3.8: Reproduced image using compressed data in comparison with the original
image.
It is clear that when k = 20 (out of 349) singular values are used, the
reconstructed image is quite close to the original one. Readers may estimate
how much the storage can be saved if one keeps 10% of the singular values
(and the corresponding singular vectors), assuming the reduced quality is
acceptable.
Basic Mathematical Computations 135
B = VΣV (3.22)
AP CA = AV (3.23)
One may reconstruct A using the following formula, if using all the
eigenvectors:
Ar = AP CA V = AVV = A (3.24)
This is because eigenvectors are orthonormal. It is often that the first few
(ranked by the value of the eigenvalues in descending order) eigenvectors
contain most of the overall information of the original matrix A. In this
case, we can use only a small number of eigenvectors to reconstruct the A
matrix. For example, if we use k p number of eigenvectors, we have
Ar = AP CA [0 : m, 0 : k]V [0 : k, 0 : p]
(3.25)
= A[0 : m, 0 : p]V[0 : p, 0 : k]V [0 : k, 0 : p] = A
This will, in general, not equal the original A, but can be often very close
to it. In this case, the storage becomes m × k + k × p which can be much
smaller than the original size of m × p. In Eq. (3.25), we used the Python
syntax, and hence it is very close to that in the Python code.
Note that if matrix A has dimensions of m < p, we simply treat its
transpose in the same way mentioned above.
One can also perform a similar analysis by forming a normal matrix B
using the following equation instead:
B = AA (3.26)
It has the same shape as the original A that is m×p. One may reconstruct A
using the following formula and all the eigenvectors (that are orthonormal):
Note that for large systems, we do not really form the normal matrix
B, perform eigenvalue decomposition, and then compute V numerically.
Instead, the QR transformation type of algorithms are used. This is because
of the instability reasons mentioned in the beginning of Section 3.4.2.
We show an example of PCA code with only three lines. It is from glowing-
python (https://glowingpython.blogspot.com/2011/07/principal-component-
analysis-with-numpy.html), with permission. It is inspired by the function
princomp of matlab’s statistics toolbox and quite easy to follow. We modified
the code to exactly follow the PCA formulation presented above.
import numpy as np
from pylab import plot,subplot,axis,show,figure
def princomp(A):
""" PCA on matrix A. Rows: m observations; columns:
p variables. A will be zero-centered and normalized
Returns:
coeff: eig-vector of A^T A. Row-reduced observations,
each column is for one principal component.
score: the principal component - representation of A in
the principal component space.Row-observations,
column-components.
latent: a vector with the eigenvalues of A^T A.
"""
# eigenvalues and eigenvectors of covariance matrix
# modified. It was:
# M = (A-np.mean(A.T,axis=1))
138 Machine Learning with Python: Theory and Applications
# [latent,coeff] = np.linalg.eig(np.cov(M))
# score = np.dot(coeff.T,M)
A=(A-np.array([np.mean(A,axis=0)])) # subtract the mean
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
score = np.dot(A,coeff) # projection on the new space
return coeff,score,latent
# A simple 2D dataset
np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
Data = np.array([[2.4,0.7,2.9,2.5,2.2,3.0,2.7,1.6,1.8,1.1,
1.6,0.9],
[2.5,0.5,2.2,1.9,1.9,3.1,2.3,2.0,1.4,1.0,
1.5,1.1]])
A = Data.T # Note: transpose to have A with m>p
print('A.T:\n',Data)
coeff, score, latent = princomp(A) # change made. It was A.T
print('p-by-p matrix, eig-vectors of A:\n',coeff)
print('A.T in the principal component space:\n',score.T)
print('Eigenvalues of A, latent=\n',latent)
figure(figsize=(50,80))
figure()
subplot(121)
# every eigenvector describe the direction of a principal
# component.
m = np.mean(A,axis=0)
plot([0,-coeff[0,0]*2]+m[0], [0,-coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(Data[0,:],Data[1,:],'ob') # the data points
axis('equal')
subplot(122)
# New data produced using the s
plot(score.T[0,:],score.T[1,:],'*g') # Note: transpose back
axis('equal')
show()
Basic Mathematical Computations 139
A.T:
[[ 2.40 0.70 2.90 2.50 2.20 3.00 2.70 1.60 1.80
1.10 1.60 0.90]
[ 2.50 0.50 2.20 1.90 1.90 3.10 2.30 2.00 1.40 1.00
1.50 1.10]]
p-by-p matrix, eig-vectors of A:
[[ 0.74 -0.67]
[ 0.67 0.74]]
A.T in the principal component space:
[[ 0.82 -1.79 0.98 0.49 0.26 1.66 0.90 -0.11 -0.37
-1.16 -0.45 -1.24]
[ 0.23 -0.11 -0.33 -0.28 -0.08 0.27 -0.12 0.40 -0.18 -0.01
0.03 0.20]]
Eigenvalues of A, latent=
[ 11.93 0.58]
import numpy as np
def princomp(A,numpc=0):
# computing eigenvalues and eigenvectors of covariance
# matrix A
A = (A-np.array([np.mean(A,axis=0)]))
# subtract the mean (along columns)
[latent,coeff] = np.linalg.eig(np.dot(A.T,A))
#was: A = (A-np.mean(A.T,axis=1)).T # subtract the mean
#was: [latent,coeff] = np.linalg.eig(np.cov(M))
p = np.size(coeff,axis=1)
idx = np.argsort(latent) # sorting the eigenvalues
idx = idx[::-1] # in ascending order
# sorting eigenvectors according to eigenvalues
coeff = coeff[:,idx]
latent = latent[idx] # sorting eigenvalues
if numpc < p and numpc >= 0:
coeff = coeff[:,range(numpc)] # cutting some PCs
full_pc = np.size(A,axis=1)
# numbers of all the principal components
r = len(A[:,0])/len(A[1])
print(r,len(A[1]),A.shape,A.size)
i = 1
dist = []
figure(figsize=(11,11*r))
for numpc in range(0,full_pc+10,50): # 0 50 100 ... full_pc
coeff, score, latent = princomp(A,numpc)
print(numpc,'coeff, score, latent \n',
coeff.shape, score.shape, latent.shape)
Ar = np.dot(score,coeff.T)+np.mean(A,axis=0)
#was:Ar = np.dot(coeff,score).T+np.mean(A,axis=0)
# difference in Frobenius.norm
dist.append(np.linalg.norm(A-Ar,'fro'))
# showing the pics reconstructed with less than 50 PCs
if numpc <= 250:
ax = subplot(2,3,i,frame_on=False)
ax.xaxis.set_major_locator(NullLocator())
ax.yaxis.set_major_locator(NullLocator())
i += 1
imshow(Ar) #imshow(np.flipud(Ar))
title('PCs # '+str(numpc))
gray()
figure()
imshow(A) #imshow(np.flipud(A))
title('numpc FULL: '+str(len(A[1])))
gray()
show()
Figure 3.10: Images reconstructed using reduced PCA components, in comparison with
the original image.
Basic Mathematical Computations 143
We can see that 50 principal components give a pretty good quality image,
compared to the original one.
To assess the quality of the reconstruction quantitatively, we compute
the distance of the reconstructed images from the original one in the
Frobenius norm, for a different number of eigenvalues/eigenvectors used in
the reconstruction. The results are plotted in Fig. 3.11, with the x-axis for
the number of eigenvalues/eigenvectors used. The sum of the eigenvalues is
plotted in the blue curve, and the Frobenius norm is plotted in the red curve.
The sum of the eigenvalues relates to the level of variance contribution.
Figure 3.12: The Newton Iteration: The function is shown in blue and the tangent line at
local xi is in red. We see that xi gets closer and closer to the root of the function when the
number of iterations i increases (https://en.wikipedia.org/wiki/Newton%27s method#/
media/File:NewtonIteration Ani.gif) under the CC BY-SA 3.0. (https://creativecommons.
org/licenses/by-sa/3.0/) license.
array([ True])
Basic Mathematical Computations 145
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
def f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
a, b, n = -1., 2, 400 # large n for plotting the curve
x = np.linspace(a, b, n) # x at n points in [a,b]
y = f(x) # compute the function values
Plot the function curve and the trapezoidal (shaded) areas below it.
def gauss(f,n,a,b):
[x,w] = p_roots(n+1) # roots of the Legendre polynomial
# and weights
G=0.5*(b-a)*sum(w*f(0.5*(b-a)*x+0.5*(b+a)))
# in natural coordinates
# sample the function values at these roots and sum up.
return G
148 Machine Learning with Python: Theory and Applications
def my_f(x):
return 9*x**3-8*x**2-7*x+6 # define a polynomial function
ng = 2
integral_Gauss = gauss(my_f,ng,a,b)
print("The results should be:", integral, "+/-", error)
print("The results by the trapezoid approximation with",
len(xint),"points is:", integral_trapezoid)
print("The results by the Gauss integration with", ng,
'Gauss points:', "points is:", integral_Gauss)
Finally, let us introduce techniques often used for initial treatment for
datasets. Consider a given training dataset X ∈ Xm×p . In machine learning
models, m is the number of data-points in the dataset, and p is the number of
feature variables. The values of the data are often in a wide range for real-life
problems. For numerical stability reasons, we usually perform normalization
to the given dataset before feeding it to a model. There are mainly two
Basic Mathematical Computations 149
techniques are used: min-max feature scaling and standard scaling. Such a
scaling or normalization is also called transformation in many ML modules.
X − X. min(axis = 0)
Xscaled = (3.32)
X. max(axis = 0) − X. min(axis = 0)
where X. min and X. min will be (row) vectors, and we used the Python
syntax of broadcasting rules and element-wise divisions. This would bring
all values for each feature into [0, 1] range. A more generalized formula that
can bring these values to an arbitrary range of [a, b] is given as follows.
X − X. min(axis = 0)
Xscaled = a + (b − a) (3.33)
X. max(axis = 0) − X. min(axis = 0)
Here we used again the Python syntax so that scalars, vectors and matrix
are all in the same formula.
Once such a scaling transformation to the training dataset is done, X. min
and X. min can be used to perform exactly the same transformation to the
testing dataset to ensure consistency for proper predictions.
The following is a simple code to perform min-max scaling using
Eq.(3.32).
np.set_printoptions(precision=4)
X = [[-1, 2, 8], # an assumed toy training dataset
[2.5, 6, 1.5], # with 4 samples, and 3 features
[3, 11, -6],
[21, 7, 2]]
print(f"Original training dataset X:\n{X}")
X = np.array(X)
X_scaled = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(f"Scaled training dataset X:\n{X_scaled}")
print(f"Maximum values for each feature:\n{X.max(axis=0)}")
print(f"Minimum values for each feature:\n{X.min(axis=0)}")
[0.1818 1. 0. ]
[1. 0.5556 0.5714]]
We can now perform the same transformation to the testing dataset using
X. min and X. min of the training dataset.
It is clearly seen that the min-max scaling does no harm to the dataset.
One can get it back as needed.
The same min-max scaling can be done using Sklearn.
Basic Mathematical Computations 151
X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")
It is seen that the min-max scaling used has not changed anything to the
one-hot dataset, as expected. They do not have value significance.
Basic Mathematical Computations 153
Note that the values are not confirmed in [−1, 1]. It follows a normal dis-
tribution. We can now perform the same transformation to the corresponding
testing dataset using the fitted instance of the training dataset.
It is clearly seen that the standard scaling does no harm to the dataset.
One can get it back as needed.
The same standard scaling can be done using Sklearn.
Scaled dataset X:
[[-0.8592 -1.4056 1.3338]
[-0.4515 -0.1562 0.0252]
[-0.3932 1.4056 -1.4848]
[ 1.7039 0.1562 0.1258]]
Mean values for each feature:
[6.375 6.5 1.375]
Standard deviations for each feature:
[8.5832 3.2016 4.9671]
Basic Mathematical Computations 155
X_back = scaler.inverse_transform(X_scaled)
print(f"Back transformed training dataset:\n{X_back}")
Xt_back = scaler.inverse_transform(Xt_scaled)
print(f"\nBack transformed testing dataset Xtest:\n{Xt_back}")
Note that the same scaling can be done for the labels in the training
dataset, if they are not probability distribution types of data. When perform
the testing on the trained model or prediction using the trained model, the
labels should be scaled back to the original really data unit.
Also, it is a good practice to take look at the distribution of the data-
points. This is usually done after scaling so that the region of the data-points
is normalized. One may simply plot the so-called kernel density estimation
(KDE) using, for example, seaborn.kdeplot().
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, 2013. London.
[2] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
156 Machine Learning with Python: Theory and Applications
[3] K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical
Magazine, 2(1), 559–572, 1901.
[4] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., Tubenet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
[5] Shuyong Duan, Zhiping Hou, G.R. Liu et al., A novel inverse procedure via creating
tubenet with constraint autoencoder for feature-space dimension-reduction, Interna-
tional Journal of Applied Mechanics, 13(08), 2150091, 2021.
Chapter 4
157
158 Machine Learning with Python: Theory and Applications
86 61 35 81 40
4 92 95 31 70
18 73 98 9 33
18 73 98 9 33
We see now that the same set of numbers is generated, which is some kind
of controlled random sampling by a seed value. The use of random.seed()
may confuse many beginners, but the above example shall eliminate the
confusion. Function random.seed() is used just to ensure the repeatability
when one reruns the code again, which is important for code development
ensuring reproducibility. We will use it quite frequently.
Also, we see the fact that random numbers generated by a computer
are not entirely random and are controllable to a certain degree. Naturally,
it should be, because any (classic) computer is deterministic in nature.
This pseudo-random feature is useful: when we study a probability event,
we make use of the randomness of random.randint() or random.random().
When we want our study and code to be repeatable, we make use of
random.seed().
Note that the seed value of 1 can be changed to any other number,
and with a different seed value used, a different set of random numbers
is generated.
Let us now generate real numbers.
0.11791870367106105
0.7609624449125756
0.47224524357611664
0.37961522332372777
0.20995480637147712
4.1.2 Probability
Probability is a numerical measure on the likelihood of the occurrence of an
event or a prediction. Assume, for example, the probability of the failure of
a structure is 0.1. We can then denote it mathematically as
In this case, there is only one random variable that takes two possible discrete
values: “yes” with probability of 0.1, and “no” with probability of 0.9. Such
a distribution of a random variable is known as the Bernoulli distribution.
For general events, there may be more possible discrete random variables
and random variables with continuous distributions. Statistics studies the
techniques for sampling, interpreting, and analyzing the data about an event.
Machine learning is based on a dataset available for an event, and thus
statistical analysis helps us to make sense of a dataset and hopefully produces
a prediction in terms of probability.
We use Python to perform statistical analysis to datasets. We first
import necessary packages, including the MXNet packages (https://gluon.
mxnet.io/).
Let us consider a simple event: tossing a die that has six identical surfaces,
each of which is marked with a unique digit number, from 1 to 6. In this
case, the random variable can take 6 possible discrete values. Assume that
such markings do not introduce any bias (fair die), and do not affect in any
way the outcome of a tossing. We want to know the probability of getting
a particular number on the top surface, after a number of tossings. One can
then perform “numerical” experiments: tossing the die the large number of
times virtually in a computer and counting the times that the number shows
on the top surface. We use the following code to do this:
print(pr)
n_top_array = nd.sample_multinomial(pr,shape=(1))
# toss once using the
# sample_multinomial() function
print('The number on top surface =', n_top_array)
For this problem, we know (assumed) that the theoretical or the “true”
probability for a number showing on the top surface is 1/6 ≈ 0.1667.
The one-time toss above gives an nd-array with just one entry that is the
number on the top surface of the die. To obtain a probability, we shall toss
for many times for statistics to work. This is done by simply specifying the
length of the nd-array in the handy nd.sample multinomial() function.
Tossed 18 times.
Toss results
[3 0 0 3 1 4 4 5 3 0 0 2 2 2 4 2 4 4]
<NDArray 18 @cpu(0)>
n_t = 20
print(nd.sample_multinomial(pr, shape=(n_t))) #toss n_t times
[2 3 5 4 0 0 1 1 0 2 3 0 0 0 2 4 5 2 0 4]
<NDArray 20 @cpu(0)>
We see the probability values for all the 6 digits getting closer to the
theoretical or true probability.
Statistics and Probability-based Learning Model 163
import numpy as np
np.set_printoptions(suppress=True)
print(record) # print out the records
[[ 0. 0. 0. ... 333. 334. 335.]
[ 0. 0. 1. ... 373. 373. 373.]
[ 0. 1. 1. ... 341. 341. 341.]
[ 1. 1. 1. ... 327. 327. 327.]
[ 0. 0. 0. ... 300. 300. 300.]
[ 0. 0. 0. ... 324. 324. 324.]]
<NDArray 6x2000 @cpu(0)>
x = nd.arange(n_tosses).reshape((1,n_tosses)) + 1
#print(x)
observations = record / x # Pr of 6 digits for all tosses
print(observations[:,0]) # observations for 1st toss
print(observations[:,10]) # for first 10 toss
print(observations[:,999]) # for first 1000 toss
[0. 0. 0. 1. 0. 0.]
<NDArray 6 @cpu(0)>
This simple experiment gives us 1,000 observations for six possible values
of uniform distribution (any of the 6 digits has equal chance to land on
top). When the probability of the appearance of each of the six surfaces
of the die is computed after 1,000 times of tossing, we got roughly 0.14 to
0.19. These probabilities will change a little each time we do the experiment
because of the random nature. If we would do 10,000 times of tossing for each
experiment, we shall get all probabilities quite close to the theoretical value
of 1/6 ≈ 0.1667. Readers can try this very easily using the code given above.
Let us now plot the “numerical” experimental results. For this, we use
matplotlib library.
164 Machine Learning with Python: Theory and Applications
Figure 4.1: Probabilities obtained via finite sampling from a uniform distribution of a
fair die.
It is clear that the more experiments we do, the probability gets closer
to the theoretical value of 1/6.
The above discussion is for very simple events of a die toss. It gives a
clear view on some of the basic issues and procedures related to the statistics
analysis and probability computation for complicated events.
Let us plot out the density function defined in Eq. (4.2). The bell shape of
the function may already be familiar to you.
x = np.arange(-0.5, 0.5, 0.001) # define variable x
def gf(mu,sigma,x): # define the Gauss function
return 1/(sigma*np.sqrt(2*np.pi))*np.exp(-.5*((x-mu)/sigma)**2)
mu, sigma = 0, 0.1 # mean 0, standard deviation 0.1
plt.figure(figsize=(6, 4))
plt.plot(x, gf(mu,sigma,x))
plt.show()
where pi is the probability of the ith possible value of the variable and
i pi = 1. Vector p is the vector that holds these the probabilities. The
168 Machine Learning with Python: Theory and Applications
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
p = np.arange(0.01, 1.0, .01) # generate variables
logp = -np.log(p)
# negative sign for positive value: log(p)<0 for 0<p<1
plt.plot(p,logp,color='blue')
plt.xlabel('Probability p')
plt.ylabel('Negative log(p)')
plt.title('Negative log Value of Probability')
plt.show()
Let us mention a few important features, which are the root reasons for
why the logarithm is used so often in machine learning.
plt.plot(v1,H_qf)
plt.xlabel('Probability, v1 (v2=1-v1)')
plt.ylabel('Entropy of events')
plt.title('Entropy of Events')
plt.show()
Statistics and Probability-based Learning Model 171
N = 0
max_v = 100 # Events with N variables
# capped at max_values.
Ni = np.array([]) # For the number of v
H_qf = np.array([]) # To hold the entropy
while N < max_v:
N += 1
Ni = np.append(Ni,N)
qf = np.ones(N)
qf = qf/np.sum(qf) # uniform sample generated
H_qf = np.append(H_qf,-np.dot(qf,np.log(qf))/len(qf))
print('Probability distribution:',qf[0:max_v:10])
print('H_qf=', H_qf[0:max_v:10])
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.plot(Ni,H_qf)
plt.xlabel('Number of variables, all with same probability')
plt.ylabel('Entropy')
plt.title('Events with variables of uniform distribution')
plt.show()
• When these two distributions are the same, the cross-entropy becomes the
entropy studied in the previous section. All these three inequalities above
become equal.
• Therefore, in machine learning models, even if the prediction is perfect,
the cross-entropy will still not be zero, because the true distribution itself
may have an entropy. If p is the entropy of the true distribution, the cross-
entropy Hpq is bounded from below by Hp . It can only be zero if the true
distribution is without any uncertainty (probabilities of the variables are
all zero, except for one of them, which is 1).
We examine a simple event with a variable that can take two possible values.
Assuming we have a quality prediction, how is it measured in cross-entropy?
It is seen that the cross-entropy Hpq is low, indicating that the prediction
q is good. Notice that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
Statistics and Probability-based Learning Model 175
It is seen that the cross-entropy Hpq is high, indicating that the prediction
q is bad. Notice again that Hpq = Hqp , Hpq ≥ Hp , and Hqp ≥ Hq .
We are now ready to discuss the KL-divergence.
4.5 KL-Divergence
It is seen that the KL-divergences Dpq and Dqp are all positive. They
all have low values, indicating that the prediction q is good. Notice that
Dpq = Dqp .
It is seen again that the KL-divergences Dpq and Dqp are all positive.
They all have high values, indicating the prediction q is poor. Notice also
that Dpq = Dqp .
Consider a simple event with a variable that can take four possible values.
Assume that we have a good prediction of a distribution in relation to the
true or reference distribution. We examine how it is measured in the binary
cross-entropy using the following code:
Consider an event with a variable that can take four possible values. Assume
that we have a poor prediction of a distribution in relation to the true
distribution. We examine again how it is measured in the binary cross-
entropy using the following code:
It is also found that the binary cross-entropy roughly doubles the cross-
entropy, leading to an enhanced discrepancy measure.
In the previous two examples, we studied two cases with the true distribution
at extreme: its probabilities are 1.0 and zeros. For both examples, we
observed an enhanced entropy measure using the binary cross-entropy. In
this example, we consider a more even true distribution and examine the
behavior of the binary cross-entropy.
[0.3 0.3 0.2 0.2] [0.4 0.2 0.3 0.1] converse: [0.7 0.7 0.8
0. 8] [0.6 0.8 0.7 0.9]
180 Machine Learning with Python: Theory and Applications
In this case, it is also found that the binary cross-entropy does not give
enhancement.
Same as the previous example, but consider a case with poor prediction.
[0.1 0.3 0.2 0.2] [0.4 0.05 0.05 0.5] converse: [0.9 0.7 0.
8 0.8] [0.6 0.95 0.95 0.5]
Cross-entropy cHpq: 0.43203116151910004
Binary cross-entropy bcHpq: 0.7048313483737685
Consider a statistics event with more than one random variable that occurs
jointly. When we deal with such multiple random variables, we may want to
know the joint probability Pr(A,B): the probability of both A = a and B =
b occurring simultaneously, for given elements a and b.
Statistics and Probability-based Learning Model 181
It is clear that for any values a and b, Pr(A,B) ≤ Pr(A = a), because
Pr(A = a) is measured regardless of what happens for B. For A and B to
happen jointly, A has to happen and B also has to happen (and vice versa).
Thus, A,B cannot be more likely than A or B occurring individually.
Pr(A,B)Pr(A) is called conditional probability and is denoted by Pr(B|A),
which is the probability that B happens, under the condition that A has
happened. This leads to the important Bayes’ theorem.
4.8.1 Formulation
Based on the Bayesian statistics, a popular algorithm has been developed
known as the Naive Bayes Classifier. Consider an event with p variable
x = {x1 , x2 , . . . , xp } ∈ Xp . We assume that any variable xi is independent
of another. For a given label y, the conditional probability for being x is
expressed as
p(x|y) = p(xi |y) (4.10)
i
print('type:',type(mnist_train))
3.0
<matplotlib.image.AxesImage at 0x1c5743ba630>
Figure 4.9: One sample image of handwritten digit from the MNIST dataset.
The model has been trained using the training dataset. We now plot the
“trained” model.
import matplotlib.pyplot as plt
%matplotlib inline
fig, figarr = plt.subplots(1, 10, figsize=(15, 15))
for i in range(10):
figarr[i].imshow(xcount[:,i].reshape((28,28)).asnumpy(),
cmap='hot')
figarr[i].axes.get_xaxis().set_visible(False)
figarr[i].axes.get_yaxis().set_visible(False)
plt.show()
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(py.asnumpy(),nd.sum(py).asnumpy())
For the given image x, a feature xi is binary and takes values of either 1
or 0. Because we are using the train model to compute the probabilities, we
shall have
p(xi = 1|y) = p(xi = 1|y) for predicting xi is on
(4.14)
p(xi = 0|y) = 1 − p(xi = 1|y) for predicting xi is off
It is clear now that the testing is essentially measuring the binary cross-
entropy of the distribution of a given image (true distribution) with that
of the average image of a labeled letter computed from the dataset (the
model distribution). Therefore, we can write out Eq. (4.16) directly using
the binary cross-entropy formula.
186 Machine Learning with Python: Theory and Applications
figarr[0, ctr].axes.get_xaxis().set_visible(False)
figarr[0, ctr].axes.get_yaxis().set_visible(False)
ctr += 1
if ctr == 10:
break
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
plt.show()
print('True label: ',y)
xi = np.array(xi)
print('Predicted digits:',xi)
print('Correct?',np.equal(y,xi))
np.set_printoptions(formatter={'float': '{: 0.1f}'.format})
print('Maximum probability:',pxm)
Figure 4.11: Predicted digits (in probability) using images from the testing dataset of
MINST.
4.8.5 Discussion
The test shows that this classifier made one wrong classification for the first
10 digits in the testing dataset. The 9th digit should be 5, but classified
as 4. For this wrongly classified digit, the confidence level is very close to 1.
The wrong prediction may be due to the incorrect assumptions: each pixel
is independently generated, depending only on the label. Clearly, a digit is
a very complicated function of images, and statistic information alone has
a limit. This type of Naive Bayes classifier was popular in the 1980s and
1990s for applications such as spam filtering. For image processing types of
problems, we now have effective classifies (such as CNN; for example, see
Chapter 15).
188 Machine Learning with Python: Theory and Applications
189
190 Machine Learning with Python: Theory and Applications
i=1
where z(ŵ; x) reads as “z is a function of ŵ for given x”, and all vectors are
defined as
x = [x1 , x2 , . . . , xp ] ∈ Xp ∈ Rp
p
x = [1 x] = x0 , x1 , x2 , . . . , xp ∈ X ∈ Rp+1
1
(5.2)
w = [W 1 , W 2 , . . . , W p ] ∈ Wp ∈ Rp
ŵ = [b w] = W 0 , W 1 , W 2 , . . . , W p ∈ Wp+1 ∈ Rp+1
b
these variables can be made clear once for all. Readers may take a movement
to digest this formulation, so that the later formulations can be understood
more easily.
When we write z = xw+b, we call it a xw+b formulation. When we write
z = x ŵ, in which the bias b is absorbed by w, we call it an xw formulation.
Both formulations will be used in the book interchangeably, because they are
essentially the same. The xw+b formulation allows explicit viewing the roles
of weights and biases separately during analysis. The xw formulation is more
concise in derivation processes, and also allows explicit expressions of affine
transformations, which are most essential for major machine learning models.
z =0·w =0 (5.3)
This means that the constant c ∈ R will never be predicted by the hypothesis
without b. This means also that a pure linear transformation through w is
insufficient for proper prediction, because it cannot even predict constants.
On the other hand, when b is there, we simply choose b = c, and the
constant c is then produced by the hypothesis. This implies also that z must
p
live in an affine space X , an augmented feature space that lives within Xp+1 .
y(x) = xk + c (5.4)
The given linear function is predicted exactly, using such a particular choice
of w∗ and b∗ . This means that any given arbitrary linear function of x ∈ Xp
can be predicted using hypothesis Eq. (5.1).
y(x) = x k̂ (5.6)
It is clear that the loss function is L(z) is a scalar function of the prediction
function z(ŵ) that is in turn a function of the vector of learning parameter
ŵ. Therefore, L(z) is in fact a functional. It takes a vector ŵ ∈ W(p+1) and
produces a positive number in R. It is also quadratic in ŵ.
In the 2nd line of Eq. (5.7), we first moved the transpose into the first
pair of parentheses, and then factor out x and x from these two pairs of
parenthesis to form matrix that is the out-product of [x x], which is a p × p
symmetric matrix of rank 1. All these follow the matrix operation rules. Note
that (ŵ − k̂) is a vector. Therefore, Eq. (5.7) is a standard quadratic form.
If [x x] is SPD, L has a unique minimum at (ŵ − k̂) = 0 or ŵ∗ = k̂. This
would prove that the prediction function is capable of reproducing any linear
194 Machine Learning with Python: Theory and Applications
function uniquely in the feature space, and we are done. However, because
[x x] has only rank 1, we need to manipulate a little further for deeper
inside.
∂L(ŵ)
= 2x x(ŵ − k̂) = 0 (5.8)
∂ ŵ
On the other hand, Eq. (5.1) can be used to perform an affine transfor-
mation, where weights wi (i = 1, 2, . . . , p) are responsible for (pure) linear
transformation and bias b is responsible for translation. Both wi and b are
learning parameters in a machine learning model. To show how Eq. (5.1) is
explicitly used to perform an affine transformation, we perform the following
maneuver in matrix formulations:
First, using each ŵi (i = 1, 2, . . . , k) and Eq. (5.1) we obtain,
zi = x ŵi (5.9)
Prediction Function and Universal Prediction Theory 195
z = xW
Let us look at some special cases, when Eq. (5.11) is used to perform affine
transformations.
z1 = x k + c (5.12)
which is Eq. (5.5). This means that the prediction of a linear function in the
feature space can be viewed as an affine transformation in the affine space.
Since k in W is the gradient of the function, it is responsible for rotation.
Figure 5.2: A p → 1 neural network with one layer neurons taking an input of p features,
and one output layer of just one single neuron that produces a single prediction function
z. This forms an affine transformation unit or ATU (or linear function prediction unit).
The net on the left is for xw+b formulation, and on the right is for xw formulation with
p + 1 neurons in the input layer in which the one at the top is fixed as 1. Both ATUs are
essentially identical.
198 Machine Learning with Python: Theory and Applications
study it in great detail using the following examples. Let us first discuss the
data structures. Different ML algorithms may use a different one, and the
following one is quite typical.
p → 1 nets:
Equation (5.1) can be written in the matrix form with dimensionality clearly
specified as follows:
z(ŵ; x) = x w + b = x ŵ (5.13)
1×p p×1 1×1 1×(p+1) (p+1)×1
p
The prediction function z ∈ X is now clearly specified as a function of w
and b corresponding to any x ∈ Xp . For the ith data-point xi , we have
z(ŵ; xi ) = xi w + b = xi ŵ (5.14)
1×p p×1 1×1 1×(p+1) (p+1)×1
Note that z(ŵ; x) is still a scalar for one data-point. Also, because no further
transformation, z is not needed for one layer nets.
p → k nets:
In hyperspace cases, we would have many, say k, neurons in the output
of the current layer, each neuron performing (independently) an affine
transformation based on the same dataset (see, Fig. 5.13). Therefore, the
output should be an array with k entries. The data may be structured in
matrix form: ⎡ ⎤
W 11 W 12 . . . W 1k
⎢W 21 W 22 . . . W 2k ⎥
⎢ ⎥
[z1 z2 · · · zk ] = [xi1 xi2 · · · xip ]⎢ . .. .. .. ⎥
z(W,b; x )
i 1×k xi1×p
⎣ .
. . . . ⎦
W p1 W p2 ... W pk
W p×k
+[b1 b2 · · · bk ] (5.15)
b 1×k
The above matrix can be written in a concise matrix form as follows, with
all the dimensionality specified clearly:
z(Ŵ; xi ) = xi W + b = xi Ŵ (5.16)
1×k 1×p p×k 1×k 1×(p+1) (p+1)×k
Note that one can stack up as many neurons in a layer as needed because
these weights for each neuron are independent of the those for any other
Prediction Function and Universal Prediction Theory 199
neuron in the stack. This stacking is powerful because it makes the well-
known universal approximation theory (see Chapter 7) workable.
p → k nets with m data-points:
For a dataset with m points, the data may be structured as matrix Xm×p ,
by vertically stacking xi . In this case, m predictions can be correspondingly
made, and the formulation in matrix form becomes
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
z1 (W, b; x1 ) x11 x12 . . . x1p W 11 W 12 . . . W 1k b
⎢ z2 (W, b; x2 ) ⎥ ⎢ x21 x22 . . . x2p ⎥ ⎢W 21 W 22 . . . W 2k ⎥ ⎢b⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ = ⎢ .. .. . . .. ⎥ ⎢ .. .. . . .. ⎥ + ⎢ .. ⎥ (5.17)
⎣ . ⎦ ⎣ . . . . ⎦ ⎣ . . . . ⎦ ⎣.⎦
zm (W, b; xm ) xm1 xm2 . . . xmp W p1 W p2 ... W pk b
Z m×1 (W,B;X) X m×p W p×k Bm×1
where vector B has the same b for all entries. The above matrix can be
written in a concise matrix form as follows, with all the dimensionality
specified clearly:
Z(W, B) = X W + B = X Ŵ (5.18)
m×1 m×p p×1 m×k m×(p+1) (p+1)×k
where ŵII = [bII , wII ] in which bII and wII are a changed set of learning
parameters in W3 .
200 Machine Learning with Python: Theory and Applications
2
This results in a transformed data-point Z = [1, zI , zII ] ∈ X .
The above procedure using affine transformation on the original dataset
2
X ∈ X by varying ŵ for 2 times results in transformed dataset Z that is in
2
the same affine space X , the automorphism.
We now write a code to demonstrate the affinity of the above transforma-
2
tion. Because X is also a 2D plane, we can conveniently plot both original
and transformed patterns together in space R2 using only zI and zII for
visualization and analysis. We first define some functions.
import numpy as np
y = np.arange(MinY,MaxY+dd,dd)
xmin= np.full(y.shape,MinX)
xmax= np.full(y.shape,MaxX)
x1 = np.append(np.append(np.append(x,xmax),np.flip(x)),xmin)
x2 = np.append(np.append(np.append(ymin,y),ymax),np.flip(y))
x1 = np.append(x1,(MaxX+MinX)/2) # add the center
x2 = np.append(x2,(MaxY+MinY)/2)
X1 = x1*np.cos(theta)+x2*np.sin(theta)
X2 = x2*np.cos(theta)-x1*np.sin(theta)
def spiral(alpha,c):
# Define a spiral pattern in 2D space (x1, x2)
xleft,xright,xdelta = 0.0, 40.01, 0.1
x = np.arange(xleft,xright,xdelta)
x1 = np.exp(alpha*x)*np.cos(x)/c # logarithmic spiral
# function, x1 with decay rate alpha & scaling factor c.
x2 = np.exp(alpha*x)*np.sin(x)/c # x2 value.
x1 = np.append(x1,0) # add the center
x2 = np.append(x2,0)
X = np.stack((x1, x2), axis=-1) # X has two components.
return x1,x2,X
Next, we define a function for plotting these patterns, initial and affine
transformed with wII and bII , and linear transformed with wII and bII = 0.
%matplotlib inline
import matplotlib.pyplot as plt
def affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II):
plt.figure(figsize=(4.,4.),dpi=90)
plt.scatter(net(X,w0_I,b0_I),net(X,w0_II,b0_II),label=\
"Original: w0I=["+str(w0_I[0])+","+str(w0_I[1])+
"], b0I="+str(b0_I)+"\n w0II=["+str(w0_II[0])+","+
str(w0_II[1])+"], b0II="+str(b0_II),s=10,c='orange')
#plot the initial pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b_II),label=\
"Affine: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], bII="+str(b_II),s=10,c='blue')
# plot the affine transformed pattern
plt.scatter(net(X,w_I,b_I),net(X,w_II,b0_II),label=\
"Linear: wI=["+str(w_I[0])+","+str(w_I[1])+
"], bI="+str(b_I)+"\n wII=["+str(w_II[0])+","+
str(w_II[1])+"], b0II="+str(b0_II),s=10,c='red')
#plot the linear transformed pattern
plt.xlabel('$z_{I}$')
plt.ylabel('$z_{II}$')
plt.title('linear and affine transformation')
plt.grid(color='r', linestyle=':', linewidth=0.3)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.axis('scaled')
#plt.ylim(-5,9)
plt.show()
x1,x2,X = edge(1.5,0.2)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
Prediction Function and Universal Prediction Theory 203
x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
From the above two figures, the following observations can be made:
Let us now take a look at the affine (and linear) transformation to a circle
using the same code.
x1,x2,X = circle(1.0,0.1)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
This time, it is clearly seen from Fig. 5.5 that after the transformation, the
original (orange) circular pattern is rotated, scaled, sheared, and translated
to an ellipse (blue). The points that we have observed for the rectangles are
still valid.
x1,x2,X = spiral(0.1,10.)
affineplot(x1,x2,X,w0_I,b0_I,w_I,b_I,w_II,b_II)
In this case, the original (orange) spiral is rotated, scaled, sheared, and
translated to a new one. The observations made for the above two examples
still hold.
Figure 5.7: An image of a leaf of the fern-like fractal is an affine transformation of another.
transformation of the dark blue leaf, or any of the light blue leaves. The fern
seems to have this typical pattern coded as an ATU in its DNA. This implies
that ATU is as fundamental as DNA.
x1,x2,X = edge(5.0,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
208 Machine Learning with Python: Theory and Applications
x1,x2,X = rectangle(-2.,2.,-1.,1.,0.2,np.pi/4)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
2. The affinity is, however, destroyed: the ratios of distances between points
lying on a straight line are changed. Not all the parallel line segments
remain parallel after the sigmoid transformation. The use of the sigmoid
function clearly brings nonlinearity. This gives the net the following
capabilities:
• The output φ(z(ŵ; x)) is now nonlinearly dependent on the features x.
One can now use it for logistic regression for labels given 0, or 1, by
training ŵ.
• φ(z(ŵ; x)) is linearly independent of the features x in the input. This
allows further affine transformations to be carried out in a chain to the
next layer if needed.
• φ(z(ŵ; x)) is also linearly independent of the features ŵ used in this
layer. When we need more layers in the net, fresh ŵs can now be used
for the next layers independently. This enables the creation of deepnets.
Let us now take a look at the wrapped affine transformation to a circle
function, using the same code.
x1,x2,X = circle(2.5,0.1)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
This case shows more server shape destroyed. The uniqueness of the point-
to-point transformation is still preserved. The following is for the wrapped
affine transformation for the spiral pattern.
x1,x2,X = spiral(0.1,10.)
sigmoidaffine(x1,x2,X,w0_I,b0_I,w0_II,b0_II,w_I,b_I,w_II,b_II)
210 Machine Learning with Python: Theory and Applications
Based on Eq. (5.1), we now see a situation where a dataset (x, z) can
be encoded in a set of learning parameters ŵ in the hypothesis space.
Figure 5.12 shows schematically such an encoded state.
The straight lines in Fig. 5.12 are encoded with a point in the hypothesis
space W2 . For example, the red line is encoded by a red dot at w0 = 1 and
w1 = 1. In other words, using w0 = 1 and w1 = 1, we can reproduce the red
line. The blue line is encoded by a blue dot at w0 = 2 and w1 = −0.5, with
which the blue line can be reproduced. The same applies to the black line. A
machine learning process is to produce an optimal set of dots using a dataset.
After that, one can then produce lines. In essence, a machine learning model
converts or encodes the data to wi . This implies that the size and quality of
Prediction Function and Universal Prediction Theory 211
Figure 5.12: Data (on relations of x-z or x-y for given labels) encoded in model
parameters ŵ in the hypothesis space. In essence, a ML model converts data to ŵ during
training.
the dataset are directly related to the dimension of the affine spaces used in
the model.
On the other hand, if one tunes wi , different prediction functions can be
produced in the label space. Therefore, it is possible to find such a set of
wi that makes the prediction match the given label in the dataset for given
data-points. Finding such a set of wi is the process of learning. Real machine
models are a lot more complicated, but this gives the essential mechanism.
which holds for arbitrary x. This same line can also be expressed as
z = xŵ(2) (5.22)
which holds also for arbitrary x. Using now Eqs. (5.21) and (5.22), we obtain
Equation (5.23) must also hold for arbitrary x. Therefore, we shall have
This means that these two lines are the same, which completes the proof.
In fact, the uniqueness can be clearly observed from Fig. 5.12, because a
line is uniquely determined by its slope and bias, and both are given by ŵ.
The uniqueness is one of the most fundamental reasons for a quality
dataset to be properly encoded with the learning parameters based on the
hypothesis of affine transformations, or for a machine learning model to be
capable of reliably learning from data.
∇ŵ z = x (5.29)
We can now duplicate vertically the single neuron on the right in Fig. 5.2
for k times and let all neurons become densely connected, meaning that
each neuron in the output is connected with each of the neurons in the input
(also known as fully connected). This forms a mapping of p → k network.
The prediction functions z from an ATA can be expressed as
z(Ŵ; x) = xW + b = x Ŵ (5.30)
Figure 5.13: A p → k neural network with one input layer of p neurons and one output
layer of k neurons that produces k prediction functions zi (i = 1, 2, . . . , k). Each neuron
at the output connects to all the neurons in the input with its own weights. This stack
of ATU forms an affine transformation array or ATA. In other words, the p → k net
has the predictability of k functions in the feature space with p dimensions. Left: xw+b
formulation; Right: xw formulation.
which can also be regarded as the flattened Ŵ. The total number of the
learning parameters is
P = (p + 1) × k
It is clear that the hypothesis space grows fast in multiples for an ATA.
Equation (5.30) is the matrix form of a set of affine transformations.
It is important to note that each zj (wj , bj ) is computed using Eq. (5.1),
using its own weights wj and bias bj . This enables all zj (wj , bj ), j =
1, 2, . . . , k being independent of each other. Therefore, the ATA given in
Fig. 5.13 creates the simplest mapping that can be used for k-dimensional
regression problems using a dataset with p features. Note also that when
k = p, it can perform the p → p affine transformation.
(new)
It becomes a set of new features xi , i = 1, 2, . . . , k that are linearly
independent of the original features xi , i = 1, 2, . . . , p. These new features
can then be used as inputs to the next layer. This allows the use of a new
set of learning parameters for the next layer.
It is clear that a role of the activation function is to force the outputs from
an ATA linearly independent of that of the previous ATA, enabling further
affine transformations leading to a chain of ATAs, a deepnet. To fulfill this
important role, the activation function must be nonlinear.
Such a chain of stacks of prediction functions (affine transformations)
wrapped with nonlinear activation functions gives a very complex deepnet,
resulting in a complex prediction function. Further, when affine transfor-
mation Eq. (5.1) is replaced with spatial filters, one can build a CNN
(see Chapter 15) for object detection, and when replaced with temporal
filters, we may have an RNN (see Chapter 16) for time sequential models,
and so on.
These new features given in Eq. (5.31) can now be used as the inputs for the
next layer to form a deepnet. To illustrate this more clearly, we consider a
simplified deepnet with 4 − 2 − 3 neurons shown in Fig. 5.14.
Figure 5.14: Schematic drawing of a chain of stacked affine transformations wrapped with
activation functions in a deepnet for approximation of high-order nonlinear functions of
high dimensions. This case is an xw+b formulation. A deepnet using xw formulation will
be given in Section 13.1.4.
216 Machine Learning with Python: Theory and Applications
Here, let us use the number in parentheses to indicate the layer number:
(1)
1. Based on 4 (independent input) features xi (i = 1 ∼ 4) to the first layer,
(1)
a stack of 2 affine transformations zi (i = 1 ∼ 2) takes place, using a
(1)
4 × 2 weight matrix W(1) and biases bi (i = 1 ∼ 2). Affine transformation
(1) (1) (1) (1) (1) (1)
z1 uses wi1 (i = 1 ∼ 4) and b1 , and z2 uses wi2 (i = 1 ∼ 4) and b2 .
Clearly, these are carried out independently using different sets of weights
and biases.
(1) (1)
2. Next, z1 and z2 are, respectively, subjected to a nonlinear activation
(2)
function φ, producing 2 new features xi (i = 1 ∼ 2). Because of the
(2)
nonlinearity of φ, xi will no longer linearly depend on the original
(1)
features xi (i = 1 ∼ 4).
(2)
3. Therefore, xi (i = 1 ∼ 2) can now be used as independent inputs for
the 2nd layer of affine transformations, using a 2 × 3 weight matrix W(2)
(2)
and biases bi (i = 1 ∼ 3), in the same manner. This results in a stack
(3)
of 3 affine transformations zi (i = 1 ∼ 3), which can then be wrapped
again with nonlinear activation functions. This completes the 2nd layer
of 3 stacked affine transformations in a chain.
The above process can continue as desired to increase the depth of the neural
network. Note also that the number of neurons in each layer can be arbitrary
in theory. Because of the stacking and chaining, the hypothesis space is
greatly increased. The stacking causes the increase in multiples, and the
chaining in additions. The prediction functions may live in an extremely high
dimensional space WP for deepnets. For this simple deepnet of 4 − 2 − 3, the
dimension of the hypothesis space becomes P = (4× 2+ 2)+ (2× 3+ 3) = 19.
In general, for a net of p − q − r − k, for example, the formulation should be
P = (p × q + q) + (q × r + r) + (r × k + k) (5.32)
layer 1 layer 2 layer 3
where NL is the total number of hidden layers in the MLP. Note that we
may not really perform the foregoing flattening in actual ML models. It is
just for demonstrating the growth of the dimension of the hypothesis space.
In actual computations, we may simply group them in a Python list, and
use an important autograd algorithm to automatically perform the needed
Prediction Function and Universal Prediction Theory 217
Figure 5.15: 1 → 1 → 1 net with sigmoid activation at the hidden and last layers.
x(3) = c0 + c1 x + c2 x2 + c3 x3 + · · · (5.35)
where these constants are given, through a lengthy but simple derivation, by
1 (1) (2) (1) (1) (2) (1)
1 (2) (1) (2) 2 (1) 2
c0 = − b b w b −b w − b w b w − 12
16 48
1 (1) (1) 2
− b b − 12
48
218 Machine Learning with Python: Theory and Applications
1 (1) (2) (1) 2 (1) (2) (1)
2
(2) (1)
2
c1 = − w w b + 2b b w + b w −4 (5.36)
16
1 (1) 2 (2) 2 (1)
c2 = − w w b + b(2) w(1)
16
1 (1) 3 (2) 3
c3 = − w w
48
In the opinion of the author, these are the fundamental reasons for various
types of deepnets being capable of creating p → k mappings for extremely
complicated problems from p inputs of features to k labels (targeted features)
existing in the dataset. We now summarize our discussion to a Universal
Prediction Theory [6].
x = [1, x] (5.37)
x = 1, x, x2 (5.38)
If one knows the dataset well and believes that a particular function can be
used as a basis function, one may simply add it as an additional feature. For
220 Machine Learning with Python: Theory and Applications
The use of nonlinear functions as bases for features is also related to the
so-called support vector machine (SVM) models that we will discuss in
Chapter 6, where we use kernel functions for linearly un-separable classes.
This kind of nonlinear feature basis or kernel is sometimes called feature
functions.
In our neural network models, higher-order and enrichment basis func-
tions can also be used in higher dimensions. For example, for two-dimensional
spaces, we may have features like
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, London, 2013.
[2] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods, Taylor and Francis
Group, New York, 2010.
[3] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
[4] G.R. Liu and Gui-Yong Zhang, Smoothed Point Interpolation Methods: G Space Theory
and Weakened Weak Forms, World Scientific, London, 2013.
[5] G.R. Liu, A Neural Element Method, International Journal of Computational Methods,
17(07), 2050021, 2020.
[6] G.R. Liu, A thorough study on affine transformations and a novel Universal Prediction
Theory, International Journal of Computational Methods, 19(10), in-printing, 2022.
MACHINE LEARNING
WITH PYTHON
Chapter 6
223
224 Machine Learning with Python: Theory and Applications
the translational location of the red-dashed line along its normal. We then
w
have the unit normal vector as w with a length of 1. For any point (not
necessarily the data-point) in the 2D space (marked with a small cross in
Fig. 6.1), we can form a vector x starting at the origin. Now, the dot-product
w
x· (6.1)
w
w
becomes the length of the projection of x on the unit normal w . Therefore,
it is the measure that we need to determine how far point x is away from the
w
origin in the direction of w , which is a useful piece of information. Because
we do not yet know the translational location of the red line in relation to x,
b
we thus introduce a parameter w , where b ∈ W1 is an adjustable parameter
to allow the red line move up and down along w.
Notice in Eq. (6.1) that we used dot-product (the inner product). This
is the same as the matrix-product we used in the Python implementation,
because their shapes match: x is a (row) vector, and w is column vector (a
matrix with single column) with the same length. Therefore, we use both
interchangeably in this book.
The equation for an arbitrary line in relation to given point x shall have
this form:
x·w+b (6.2)
Note that Eqs. (6.3) and (6.4) are for ideal situations where these two sets of
points might be infinitely close. In practical applications, we often find that
these points are in two distinct classes, and they are separated by a street
with a finite width w (that may be very small). The formulation can now be
modified as follows:
This type of equation is also known as the decision rule: when the condition
is satisfied by an arbitrary point x, it then belongs to a labeled class (y = 1
or y = −1), when the parameters w and b are known. We made excellent
progress.
It is obvious that Eqs. (6.5) and (6.6) can be magically written in a single
equation by putting these two conditions together with their corresponding
labels y.
y(x · w + b) > w/2 or y(x · ŵ) > w/2 or mg > w/2 (6.7)
yi (xi · w + b) > w/2 or yi (xi · ŵ) > w/2 or mg(i) > w/2 (6.8)
where mg(i) is the margin for the ith data-point. Because there are an
infinite number of lines (such as the red-dashed and red-dotted lines shown in
Fig. 6.1) for such a separation, there exist multiple solutions to our problem.
We just want to find one of them that satisfies Eq. (6.8) for all data-points
in the dataset. This process is called training. Because labels are used, it is
a supervised training. The trained model can be used to predict the class of
a given data-point (which may not be from the training dataset), known as
classification or prediction in general. The following is an algorithm to
perform all those: training as well as prediction.
import mxnet as mx
from mxnet import nd, autograd
import matplotlib.pyplot as plt
import numpy as np
# We now generate a synthetic dataset for this examination.
mx.random.seed(1) # for repeatable output of this code
# define a function to generate the dataset that is
# separable with a margin strt_w
def getfake(samples, dimensions, domain_size, strt_w):
wfake = nd.random_normal(shape=(dimensions)) # weights
bfake = nd.random_normal(shape=(1)) # bias
wfake = wfake / nd.norm(wfake) # normalization
# generate linearly separable data, with labels
X = nd.zeros(shape=(samples, dimensions)) # initialization
Y = nd.zeros(shape=(samples))
i = 0
while (i < samples):
tmp = nd.random_normal(shape=(1,dimensions))
margin = nd.dot(tmp, wfake) + bfake
if (nd.norm(tmp).asscalar()<domain_size) & \
(abs(margin.asscalar())>strt_w):
X[i,:] = tmp[0]
Y[i] = 1 if margin.asscalar() > 0 else -1
i += 1
return X, Y, wfake, bfake
# Plot the data with colors according to the labels
def plotdata(X,Y):
for (x,y) in zip(X,Y):
if (y.asscalar() == 1):
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='r')
else:
plt.scatter(x[0].asscalar(),x[1]. asscalar(),color='b')
street_w = 0.1
ndim = 2
X,Y,wfake,bfake = getfake(50,ndim,3,street_w)
#generates 50 points, in 2D space with a margin of street_w
plotdata(X,Y)
plt.show()
Figure 6.2: Computer-generated data-points that are separable with a straight line.
plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1) # results
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
cs = (wfake [0]/nd.norm(wfake)).asnumpy() # projection on true norm
The Perceptron and SVM 229
si = (wfake [1]/nd.norm(wfake)).asnumpy()
Xa = nd.dot(X,wfake) + bfake # with true bias
plt.plot(Xa.asnumpy()*cs,Xa.asnumpy()*si,zorder=1)
for (x,y) in zip(Xa,Y):
if (y.asscalar() == 1):
plt.scatter(x.asscalar()*cs,x. asscalar()*si,color='r')
else:
plt.scatter(x.asscalar()*cs,x.asscalar()*si, color='b')
plt.show()
Figure 6.4: Data-points projected on a straight line that is perpendicular to a line that
separates these data-points.
It is seen that
• When these points are projected on a vector that is not the true normal
(along the blue line) direction, the blue and red points are mixed along
the blue line.
230 Machine Learning with Python: Theory and Applications
• When the true normal direction is used, all these points are distinctly
separated into two classes, blue and red, along the orange line. This dataset
is linearly separable.
Let us see how the Perceptron algorithm finds a direction and the bias,
and hence the red line that separates these two classes. We use again the algo-
rithm available at mxnet-the-straight-dope (https://github.com/zackchase/
mxnet-the-straight-dope). The algorithm is based on the following encour-
agement rule: positive events should be encouraged and negative ones should
be discouraged. This rule is used with the decision rule discussed earlier for
each data-point in a given dataset.
It is seen that all the red dots are on the positive side of the straight line
of x · w + b = 0 with the learned parameters of the weight vector w∗ and
bias b∗ . All the data marked with blue dots are on the negative side of the
line. In the entire process, all these points stay still, and the updates are
done only on the weight vector w and bias b. We shall now examine the
fundamental reasons for this simple algorithm to work.
• If the data-points are linearly separable, meaning that there exists at least
one pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1, such that
yi (xi · w∗ + b∗ ) ≥ w/2 > 0 for all data pairs, where w is a given scalar of
the street width,
• then the Perceptron algorithm converges after at most t = 2(R2 +1)/w2 ∝
R 2
w iterations, with a pair of parameters (wt , bt ) forming a line xi ·wt +bt
that separates the data-points in two classes.
We now prove this Theorem largely following the procedure with codes
at mxnet-the-straight-dope (https://github.com/zackchase/mxnet-the-
straight-dope/blob/master/chapter01 crashcourse/probability.ipynb), under
the Apache-2.0 License. We first check the convergence behavior numerically
(this may take minutes to run).
ws = np.arange(0.025,0.45,0.025) #generate a set of street widths
number_iterations = np.zeros(shape=(ws.size))
number_tests = 10
for j in range(number_tests): #set number of tests to do
for (i,wi) in enumerate(ws):
X,Y,_,_=getfake(1000,2,3,wi) #generate dataset
for (x,y) in zip(X,Y):
number_iterations[i] += Perceptron(w,b,x,y,wi)
#for each test, record the number of updates
number_iterations = number_iterations / 10.0
plt.plot(ws,number_iterations,label='Average number of iterations')
plt.legend()
plt.show()
The test results are plotted in Fig. 6.9. It shows that the number of
iterations needs to increase with the decrease of the street width w, and
234 Machine Learning with Python: Theory and Applications
the rate is roughly quadratic (inversely). This test supports the convergence
theorem. Let us now prove this in a more rigorous mathematical manner.
The proof assumes that the data are linearly separable. Therefore, there
exists a pair of parameters (w∗ , b∗ ) with w∗ ≤ 1 and b2 ≤ 1. Let us
examine the inner product of the current set of parameters ŵ with the
assumed existing ŵ∗ at each iteration. What we would like the iteration
to do is to update the current ŵ to approach ŵ∗ iteration by iteration, so
that their inner product can get bigger and bigger. Eventually, they can be
parallel with each other. Let us see whether this is really what is happening
in the Perceptron algorithm given above. Our examination is also iteration
by iteration but considers only the iterations when an update is made by the
algorithm, because the algorithm does nothing otherwise. This means that
we perform update only when yt (xt · ŵt ) ≤ w/2 at at the tth step.
At the initial setting in the algorithm, t = 0, we have no idea on what ŵ
should be, and thus set ŵ0 = 0. Here for neat formulation using dot-product.
We assume that column vector ŵ0 and ŵ∗ are flatten to (row) vectors, so
that ŵ can have a dot-product directly with any other (flattened) ŵ resulted
in the iteration process. This can be done easily in numpy using the flatten()
function. Thus we have at the initial setting,
ŵ0 · ŵ∗ = 0
street defined by the line with current ŵ1 . Next, we perform the following
updates:
ŵ1 = ŵ0 + y1 x1
We thus have,
ŵ2 = ŵ1 + y2 x2
We now have,
ŵt+1 2 ≤ ŵt 2 + R2 + 1 + w
236 Machine Learning with Python: Theory and Applications
ŵ1 2 ≤ ŵ0 2 + R2 + 1 + w = R2 + 1 + w
Using the Cauchy-Schwartz inequality, i.e., a·b ≥ a·b, and then Eq. (6.9),
we obtain,
x = [x1 , x2 , . . . , xp ] (6.15)
This middle line in a 2D feature space X2 is shown in Fig. 6.10 with red dash-
dot line, which is approximated using the weights w and bias b in hypothesis
space W3 .
Figure 6.10: Linearly separable data-points in 2D space, the width of the street is to be
maximized using SVM.
The Perceptron and SVM 239
x1 · w + b = +1 (6.17)
x2 · w + b = −1 (6.18)
Consider now the projections of these two support vectors, x1 and x2 on the
w
unit normal of the middle line for the street, w . The projection of x1 gives
the (Euclidean) distance of the upper-right gutter to the origin along the
normalized w. Similarly, the projection of x2 gives the distance of the lower-
left gutter to the origin along the normalized w. Therefore, their difference
gives the width of the street:
w w x1 · w − x2 · w
w = x1 · + b − x2 · −b= (6.19)
w w w
Figure 6.11: Change of width of the street when the street is turned with respect to w.
work. To examine and view what is really happening here, let us look at a
simpler setting where both x1 and x2 (that are on the gutters) are sitting
on the x2 -axis, as shown in the Fig. 6.11.
When the gutters are at the horizontal direction, the equation for the
upper gutter is
0 · x1 + 1 · x2 = +1 + b (6.21)
The street width is w0 . The normal vector w and its norm are given as
follows:
w = 0 1 , w = 02 + 12 = 1 (6.22)
k · x1 + 1 · x2 = +1 + b (6.23)
The new normal vector wk and its norm are given as follows:
wk = k 1 , wk = k 2 + 12 (6.24)
It is obvious that the street width after the rotation, wk , is clearly smaller
than the original street width before the rotation, w0 , while the norm of w
has increased. This is also true for any value of k. The street width is at the
maximum when it is along the horizontal direction which is perpendicular
to the vector x1 − x2 . We write the following code to plot this relationship:
The Perceptron and SVM 241
Figure 6.12: Variation of the street width with the slope of the street gutter (left) and
with the norm of the weight vector (right).
It is clear from Fig. 6.12 that the street width is inversely related to the
norm of the w.
Most importantly, this analysis shows that if the street width is maxi-
mized, w must be perpendicular to the gutters (decision boundaries). This
conclusion is true for the arbitrary pair of data-points x1 and x2 on these
two gutters.
Now, Eq. (6.19) can be rewritten as
w
w= [x1 − x2 ] · (6.25)
w
Data pair in linear polynomial bases
This means that when w is maximized, the inner production of [x1 − x2 ] and
w
w is maximized (these two vectors are parallel), where [x1 −x2 ] is a vector of
pairs of data-points in the linear polynomial bases x1 , x2 , . . . , to approximate
w
a line in the feature space, and w is the vector of the normalized weights
or tuning/optimization parameters. The use of linear polynomial bases here
is because we assume the data-points are linearly separable by a hyperplane.
Remember our original goal is to find maximum street width. Based on
Eq. (6.20), this is equivalent to minimizing the norm of w, which in turn
is the same as minimizing 12 w2 . The benefit of such simple conversions
will soon be evidenced. We now have our objective function:
1 1
L= w2 = w w (6.26)
2 2
The above function needs to be minimized. We see a nice property of the
above formulation: the objective function is quadratic, and its Hessian matrix
is a unity matrix that is clearly SPD. Therefore, it has one and only one
minimal. The local minimal is the global one. This is the fundamental reason
why local minimal/optimal is not a concern when an SVM model is used.
Notice that our minimization has to be done under the following condition
or constraint:
for all the data-points. Assume that there are a total of m data-points for
our p dimensional problem. In matrix form, this set of constraint equations
has the following form:
The modified loss function with only equality constraints can now be
written as,
m
1
L= w w− λi [yi (xi · w + b) − 1] + s2i
2
i
λi [yi (xi · w + b) − 1] = 0
Therefore, for all the data-points that satisfies [yi (xi · w + b) − 1] > 0 (data-
points away from the street), the corresponding λi must be zero, meaning
that it is not active. This proves that Eq. (6.29) can be used, because it
added only zeros. We will use Eq. (6.29) to derive the equations for our dual
problem.
When [yi (xi · w + b) − 1] = 0, the data-points are on the gutters, and
the added cost is zero. Thus, λi can be positive nonzero. Thus, for it to be
bounded, some additional condition on λi is needed, which will be naturally
obtained through minimization conditions.
The Perceptron and SVM 245
∂L(w, b, λ)
= w − X yλ = 0 (6.31)
∂w
Noticing that L is a scalar, we seek for its partial derivative with respect to
a vector w. Therefore, the outcome is a vector, as expected. Equation (6.31)
leads to
n
w = X yλ = λi yi xi (6.32)
i=1
which leads to
n
λ y1 = 1 yλ = λi yi = 0 (6.34)
i=1
Note that b is now out of the picture, which echoes Eq. (6.20): the width
of the street should not depend on b. This is expected because b affects
only the translational location of the street. In addition, the data-points (in
the feature space) used in the SVM formulation are in the form of inner
product: (xi · xj ). This means that what matters is the interrelationship of
these data-points. This special formulation of SVM allows the effective use
of the so-called nonlinear kernel functions that evaluate the interrelationship
of these data-point for problems with nonlinear decision boundaries. We will
discuss this further in Section 6.4.10.
In practical computation, we may compute first Xy = yX, whose ith row
is simply yi xi , and can be computed efficiently. Equation (6.36) can now be
further simplified as
1
L = 1 λ − λ Pλ (6.37)
2
where
P = [Pij ] = Xy X
y , where Pij = yi yj (xi · xj ) (6.38)
problem. In addition, our objective function given in Eq. (6.29) must have a
saddle shape: it has a positive curvature along w, and a negative curvature
along λ.
In the theory of Lagrangian multipliers, when the objective function is
altered with λ, the partial differentiations of the modified function can in
general only give stationary points where the extrema reside. These points
are saddle points, which means that along some parameter axes these points
may be minimums, and along other parameter axes they may be maximums.
Therefore, whether the extremum is an optimum or minimum needs to be
examined carefully. To see the issue more clearly, let us take a look at a
simplified problem with a loss function of only two parameters of w and λ.
The simplified function still has the same essential behavior, so that we can
write a Python code and plot a 3D figure to view what is really going on.
import numpy as np
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 220
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d import axes3d
%matplotlib inline
#%matplotlib tk
#plt.legend()
fig_s.tight_layout() # otherwise y-label will be clipped
plt.show()
We see a saddle surface and a saddle point at the middle point on the
surface. It is clear now that the function has minima along w (for any fixed
λ) and maxima along λ (for any fixed w).
Back to our general objective functions L given in Eq. (6.37), let us look
at the 2nd differentiation with respect to λ:
∂ 2 L(λ)
= −P (6.39)
∂λ2
which is the Hessian matrix (−P, a measure of curvature), which is a negative
definite matrix; because P is non-negative definite, we can clearly again see
that our problem has now become an optimization problem with respect to
our newly introduced Lagrangian multiplier λ.
Finally, noticing that our objective function is with constraints given in
Eqs. (6.31) and (6.34), our SVM problem can be written as
1
max − λ Pλ + 1 λ
λ 2
subject to λ ≥ 0 (6.40)
1 yλ = 0
The Perceptron and SVM 249
Note that 1 y is simply a row vector that collects all the labels of the data-
points, which can be made use of in the algorithm to avoid unnecessary
computation.
Of course, we can easily convert our optimization problem into a
minimization problem as follows, by simply revising the sign of the objective
function:
1
min λ Pλ − 1 λ
λ 2
subject to λ ≥ 0 (6.41)
1 yλ = 0
Ax = b
Compared with Eq. (6.41), it is seen clearly that our problem is finally
formulated as a typical and standard problem with a quadratic (and hence
convex) objective function and a set of linear inequality constraints, known as
the quadratic programming (QP) problem. Well-established QP techniques
can readily be used as standard routines to solve this problem, including
the CVXOPT (https://cvxopt.org/), which is a free software package for
convex optimization based on Python. Because of the excellent property of
convexity, one does not need to worry about the local optimums. It has one
and only one optimum (or minimum) and it is the global optimum.
class SVM:
def fit(self, X, y):
n_samples, n_features = X.shape
# P = X X^T
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
K[i,j] = np.dot(X[j], X[i])
P = cvxopt.matrix(np.outer(y, y) * K) #use cvxopt
print('K.shape',K.shape,' y.shape',y.shape)
# q = -1 (1xN)
q = cvxopt.matrix(-1*np.ones(n_samples))
# A = y^T
A = cvxopt.matrix(y, (1, n_samples))
# b = 0
b = cvxopt.matrix(1.0)
# -1 (NxN)
The Perceptron and SVM 251
G = cvxopt.matrix(-1*np.diag(np.ones(n_samples)))
# 0 (1xN)
h = cvxopt.matrix(np.zeros(n_samples))
solution = cvxopt.solvers.qp(P, q, G, h, A, b) # QP
# Lagrange multipliers
a = np.ravel(solution['x']) #flatten to a 1D array
# Lagrange have nonzero Lagrange multipliers
sv = a > 1e-5
ind = np.arange(len(a))[sv]
self.a = a[sv]
self.sv = X[sv]
self.sv_y = y[sv]
# Intercept
self.b = 0
for n in range(len(self.a)):
self.b += self.sv_y[n]
self.b -= np.sum(self.aself.sv_yK[ind[n], sv])
self.b /= len(self.a)
# Weights
self.w = np.zeros(n_features)
for n in range(len(self.a)):
self.w += self.a[n] * self.sv_y[n] * self.sv[n]
X, y = make_blobs(n_samples=255, centers=2,
random_state=0, cluster_std=0.65)
y[y == 0] = -1
tmp = np.ones(len(X))
y = tmp * y
# Let us take a look at the data.
print('X.shape',X.shape)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='winter');
X.shape (255, 2)
252 Machine Learning with Python: Theory and Applications
Figure 6.14: Computer-generated data-points that can be separated via a straight line.
Figure 6.15: SVM results of a widest street that separates these data-points.
X test.shape (64, 2)
y test.shape (64,)
Projected: [ 8.4316 -2.7710 -5.9351 7.9451 5.31072511]
Predicted: [ 1. -1. -1. 1. 1.]
array([[31, 0],
[ 0, 33]], dtype=int64)
X.shape (255, 2)
(191, 2) (64, 2) (191,) (64,)
Figure 6.16: Computer-generated data-points that can be separated via a straight line
for sciket-learn SVM classifier.
Figure 6.17: Sciket-learn SVM results of a widest street that separates these data-points.
Notice that the width of the street computed here is a little narrower than
that obtained using the previous code. This is because LinearSVC allows one
to compute the so-called soft decision boundary, by setting the C arguments
when creating the svc instance. One may try to change the C value and run
the code again to see what happens.
We now make the prediction again using these test samples, and then
evaluate the prediction quality in the form of a confusion matrix.
y_pred = svc.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[31, 0],
[ 0, 33]], dtype=int64)
For this example, we also had all 31 negative counts and 33 positive
counts. There was no false prediction for both classes.
x = {x1 , x2 , x3 } ∈ X3 (6.44)
The Perceptron and SVM 257
We can use the Pascal pyramid to include more nonlinear (NL) basis to form
features of
√ √ √
xN L = 1, 2x1 , 2x2 , 2x3 , x1 x1 , x1 x2 , x1 x3 ,
(6.45)
x2 x1 , x2 x2 , x2 x3 , x3 x1 , x3 x2 , x3 x3 ∈ X13
Notice that the order of computation for the left is p, but that for the right
would be p2 . The saving by using a kernel is significant. The generalized
nonlinear polynomial kernels can be given as follows:
find xN L , including the radial basis functions (RBFs). For example, when the
Gaussian basis function (a typical RBF) is used, we have the Gaussian kernel:
x − xj 2
k(x, xj ) = exp − (6.49)
2σ 2
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
The Perceptron and SVM 259
Figure 6.18: SVM classification of data-points with curves via an Sklearn classifier.
It is seen that the Sklearn classifier SVC has also quite correctly classified
all these samples in this test dataset.
260 Machine Learning with Python: Theory and Applications
• Sepal length.
• Sepal width.
Decision surfaces are produced using SVM classifiers with four different
kernels. The linear models LinearSVC() and SVC(kernel='linear') give
slightly different decision boundaries. This is because
• LinearSVC minimizes the squared hinge loss, while SVC minimizes the
regular hinge loss.
• LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass
reduction, while SVC uses the One-vs-One multiclass reduction.
%matplotlib inline
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
def make_meshgrid(x, y, h=.02):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
The Perceptron and SVM 261
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
The results show that both linear models have linear decision boundaries
of intersecting hyperplanes. The nonlinear kernel models (polynomial or
Gaussian RBF) produce more flexible curved decision boundaries. The
shapes of the boundaries depend on the kernel type and parameters used.
Readers are referred to Sklearn’s case studies for more details (https://scikit-
learn.org/stable/auto examples/svm/plot rbf parameters.html).
References
[1] F. Rosenblatt, The perceptron: A perceiving and recognizing automation, Report 85-
60-1, Cornell Aeronautical Laboratory, 1957.
[2] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain
Mechanisms, 1962. https://books.google.com/books?id=7FhRAAAAMAAJ.
[3] C. Corinna and V.N. Vapnik, Support-vector networks, Machine Learning, 20(3), 273–
297, 1995.
[4] B.-H. Asa, H. David, S. Hava et al., Support vector clustering, Journal of Machine
Learning Research, 2, 125–137, 2001.
[5] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, London, 2013.
[6] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/pap
ers/v12/pedregosa11a.html.
[7] M.J.D. Powell, The theory of radial basis function approximation in 1990, Advances
in Numerical Analysis, II, 105–210, 1992. https://ci.nii.ac.jp/naid/10008965650/en/.
[8] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor and
Francis Group, New York, 2010.
[9] Y. Xiaowei, S. Lei, H. Zhifeng et al., An extended Lagrangian support vector machine
for classifications, Progress in Natural Science, 14(6), 519–523, 2004.
MACHINE LEARNING
WITH PYTHON
Chapter 7
265
266 Machine Learning with Python: Theory and Applications
1
φ(z) = σ(z) = (7.1)
1 + e−z
It is clear that for any given argument of a real number z in (−∞, ∞), it
returns a positive number within (0,1). In machine learning, the argument
is often an array, and we use element-wise operations. Hence, σ(z) is also
an array, given an array input. The sigmoid activation function σ maps,
or practically squashes, any argument value into (0,1). Hence, it is called
squashing function. It is also called logistic function, because it is often used
in logistic regression, giving a result in a kind of probability.
The derivative of the sigmoid function has a simple form of
We now write the following codes of the activation function and to demon-
strate the behavior of the sigmoid function using mxnet and matplotlib.
%matplotlib inline
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc
def logistic0(z): # also known as the sigmoid function
return 1. / (1. + nd.exp(-z))
plt.figure(figsize=(4.0, 2.5),dpi=100)
z = nd.arange(-10, 10, .1)
y = logistic0(z)
plt.plot(z.asnumpy(),y.asnumpy())
plt.xlabel('z')
plt.ylabel(r'$\sigma(z) $')
plt.title('Sigmoid Function')
plt.grid(color='r',which='both',linestyle=':',linewidth=0.5)
plt.show()
Activation Functions and Universal Approximation Theory 267
Figure 7.1: Variation of the sigmoid function: positive, monotonic, smooth, and
differentiable.
It is clear that the sigmoid outputs are between 0 and 1, and hence
qualified for probability outputs. It has a value of 0.5 at z = 0. Therefore,
if we set its output as positive when the probability is greater than 0.5, and
negative whenever the output is less than 0.5, we have an adequate predictor
for samples with ±1 labels.
The derivative of the sigmoid function can be plotted using the following
simple code:
yg = logistic0(z)*(1.-logistic0(z))
plt.figure(figsize=(4.0, 2.5),dpi=100)
plt.plot(z.asnumpy(),yg.asnumpy())
plt.xlabel('z')
plt.ylabel(r"$\sigma \ '(z) $")
plt.title('Derivative of the sigmoid Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.show()
Figure 7.2: Variation of the derivative of the sigmoid function: positive, smooth, and
further differentiable.
In neural network-based machine learning, the inputs to the first layer take
the direct inputs of samples or data-points. Therefore, no activation function
is needed for the input layer. For all the subsequent layers, the inputs to a
layer of neurons are an affine transformation of the output of the previous
layer, as discussed in Chapter 5. Let us consider here one neuron at an output
layer. We shall have
z =x·w+b (7.3)
where x is the input from the previous layer, w is the weight matrix that
connects the previous layer to the current layer, and b is the bias for the
neurons of the current layer. Both w and b are to be updated through
training, and hence often referred to as training (or learning) parameters. In
this chapter, we do not study how to update these parameter, but focus on
the effects of them on the output of the neuron. When the sigmoid function
is applied to produce the output of the current layer, we shall have the
following general expression.
1
xout = −(x·w+b)
(7.4)
1+e
Activation Functions and Universal Approximation Theory 269
Let us write a Python code to examine how the outputs of the activation
function are influenced by the weights and biases. We also use the mxnet
model to do the number crunching and plot out the results in graphics.
We first let w vary in a range, and b be fixed at 0.
def logistic(w,x,b):
return 1. / (1. + nd.exp(-(x*w + b)))
Figure 7.3 shows clearly that a larger weight makes the sigmoid curve
steeper near x = 0. We can practically make a function that is as closer as
we want to the step function, by using a very large weight.
Let us now examine the influence of the bias on the shape of the activation
function with a fixed weight.
Figure 7.4 shows clearly that the bias makes the sigmoid curve shift: A
positive bias value shifts the sigmoid curve leftwards and a negative value
Activation Functions and Universal Approximation Theory 271
shifts the curve rightwards. It practically controls where in space the output
should appear. If we make use of these effects of both weight and bias, we
can produce a close-to-rectangular pulse function using a pair of activation
functions. We call it Neural-Pulse-Unit (NPU). It is schematically drawn in
Fig. 7.5.
The width (in the x coordinate) at the middle point of the pulse, pw , is
controlled by the difference of the biases used in the two paired neurons and
the weight. The formula can be derived in the following procedure.
First, the z value at the middle point of the sigmoid function should
always be zero; we shall then have for the neuron (on the top)
z = x+ · w + b = 0 (7.5)
where b is the base bias of the neuron pair, as shown in Fig. 7.5 above. For
the neuron below,
z = x− · w + b − bw = 0 (7.6)
where bw is the negative offset of the bias at the neuron below. The width
at the middle point of the pulse, pw , becomes
pw = x+ − x− = bw /w (7.7)
It depends on bw and w.
By altering the weight and bias values, one can also create such a pulse
at any x. The center location of the pulse, xc , can be calculated using
x+ + x− 2b − bw
xc = =− (7.8)
2 2w
It depends on b, bw , and w.
It is seen that the width and the location of the pulse depend only on
the weight and bias, which are trainable parameters. This implies that once
the neurons are trained with a set of fixed parameters, it will be capable of
producing the function value for any given input x.
272 Machine Learning with Python: Theory and Applications
y = logistic(x,w1,b0)-logistic(x,w1,b0-bw)
plt.figure(figsize=(4.5, 3.0),dpi=100.)
plt.plot(x.asnumpy(),y.asnumpy())
plt.xlabel('x')
plt.ylabel(r"$\sigma(x*w+b0)-\sigma(x*w+b0-bw) $")
plt.title('Neural Pulse, Sigmoid, w='
+str(w1)+' b='+str(b0)+' bw='+str(bw))
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.show()
print('b0=',b0,'bc=',bc,'xc=',xc,'pw=',pw)
hidden neurons as the same large value and let the bias shift. The outputs
of these two neurons will all be close to step functions, but in opposite
orientation and at different stepping-up locations. The final output neuron
sums these two step functions and delivers a pulse. It functions as an NPU.
Using multiple such NPUs, one can generate as many NPUs located
at different xc as one desires. Below is a code for producing a lined-up
rectangular unit pulse.
plt.figure(figsize=(6.5, 3.0),dpi=200)
x = nd.arange(-8, 8, .1)
w = nd.arange(0.0, 51, 5.0)
b = nd.arange(-120, 190, 40)
w1=40. #w[-1].asscalar() #We use only one but large weights
#for all NPUs
db=20.0 #=bw #define the width of the pulse
plt.figure(figsize=(6.5, 3.0),dpi=100.)
for bi in b:
y = logistic(x,w1,bi)-logistic(x,w1,bi-db)
plt.plot(x.asnumpy(),y.asnumpy(),
label="b="+str(bi[0]. asnumpy()))
plt.xlabel('x')
plt.ylabel(r"$\sigma(x*w+b)-\sigma(x*w+b-bw) $")
plt.title('Subtraction of two Sigmoid Functions with w='+str(w1)
+' bw='+str(db))
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.show()
Figure 7.7 shows eight pulses lined up created using eight NPUs (eight
pairs of neurons in the hidden layer). Each NPU is responsible for a pulse
centered at its xc value. As long as we have enough number of pair neurons
in the hidden layer, one can create as many pulses as needed to cover the
interest domain of x.
Nx
f (x) = f (xi ) · P (xi ) (7.9)
i=1
where f (x) is the given function, f (x) denotes the approximated function,
f (xi ) is the given function value at xi , P (xi ) is the unit rectangular pulse
created by the ith neuron pair for xi , and Nx is the number of xi sampled
in the domain of x.
Let us write another code for this single hidden layer neuron network to
get this done.
plt.figure(figsize=(6.5, 3.0),dpi=100)
x = nd.arange(-10, 10, .1)
w = nd.arange(0.0, 50, 5.0)
b = nd.arange(-400, 470, 40)
w1=w[-1].asscalar() # use one large weights for all NPUs
bw=40.0 # negative shift (width) of the bias
for bi in b:
bc = 2*bi - bw # center value of b
xc = -0.5* bc/w1 # x at the center of pulse
fcos = nd.cos(xc/2) # value of cosine function
pulse = (logistic(x,w1,bi)-logistic(x,w1,bi-bw))
Activation Functions and Universal Approximation Theory 275
xi = -0.5*(2*b - bw)/w1
plt.plot(xi.asnumpy(),np.cos(xi.asnumpy()/2),c='b',
linewidth=2.5)
plt.xlabel('x')
plt.ylabel("cos(x/2)")
plt.title('Approximating cos(x) using neuron pulses, w='
+str(w1)+' bw='+str(bw))
plt.grid(color='r', linestyle=':', linewidth=0.5)
#plt.legend()
plt.show()
where f (xi ) is the nodal value of the function at xi , and P (x) is now a
continuous neuron basis function of pulse-like. In this case, the approximated
function f (x) becomes a continuous function, and the neuron pair become
a basis function.
Using a series of NPUs, this universal approximation theorem (UAT) can
be schematically drawn in Fig. 7.9. A UAT-Net can be constructed with
one input layer, one hidden layer, and one output layer. If the hidden layer
has sufficient number of neuron pairs, each pair can form one NPU unit,
which is responsible to produce one pulse function at a location. When all
Figure 7.9: Schematic description of the Universal Approximation Theorem. This UAT-
Net consists of one hidden layer with sufficient number of neuron pairs. Each pair forms
one NPU unit responsible to produce one pulse at a location. When all these weights at the
input and output layers and all these biases at the hidden layer neurons are set (trained)
properly, any continuous function can be approximated for any input value of x.
Activation Functions and Universal Approximation Theory 277
these weights at input and output layers and all these biases at the hidden
layer neurons are set (or trained) properly, the UAT-Net is capable of
approximating any continuous function for any input value of x.
Let us plot this function to see what it looks like.
plt.figure(figsize=(6.5, 3.0),dpi=200)
x = nd.arange(-10, 10, .05) #we refined the points
w = nd.arange(0.0, 50., 5.0)
b = nd.arange(-400, 470, 40)
plt.plot(x.asnumpy(),yfem,c='r')
#plot the approximated function
xi = -0.5*(2*b - bw)/w1
plt.plot(xi.asnumpy(),np.cos(xi.asnumpy()/2),c='b',
linewidth=0.5)
plt.xlabel('x')
plt.ylabel("cos(x/2)")
plt.title('Approximating cos(x) using neuron pulses, w='
+str(w1)+' bw='+str(bw.asscalar()))
plt.grid(color='r', linestyle=':', linewidth=0.5)
#plt.legend()
plt.show()
278 Machine Learning with Python: Theory and Applications
Figure 7.10: A cosine function is approximated using a series of neuron basis functions
that are pulse-like.
plt.figure(figsize=(6.5, 3.0),dpi=100.)
x = nd.arange(-10, 10, .05) # refined points
#w = nd.arange(0.0, 50., 5.0)
#b = nd.arange(-400, 470, 40)
w = nd.arange(0.0, 10., 5.0)
b = nd.arange(-400, 470, 5)
w1=w[-1].asscalar() # Use one but large weights for all NPUs
bw= b[1]-b[0] #20.0 # define the width of the bias
sx = int(x.size)
yfem = np.zeros(sx)
pu = np.zeros(sx)
280 Machine Learning with Python: Theory and Applications
for bi in b:
bc = 2*bi - bw # center value of b
xc = -0.5* bc/w1 # x at the center of the pulse
fcos = nd.cos(xc/2) # nodal function value of cosine
pulse = (logistic(x,w1,bi)-logistic(x,w1,bi-bw))
y = fcos * pulse # linear combination of pulse
plt.plot(x.asnumpy(),y.asnumpy(),
label="b="+str(bi[0]. asnumpy()))
yfem += y.asnumpy()
pu += pulse.asnumpy() # Check on whether the partitions
# of unity (PU) is satisfied.
plt.plot(x.asnumpy(),yfem,c='r')
plt.plot(x.asnumpy(),pu,c='b')
print(pu[:60:10])
xi = -0.5*(2*b - bw)/w1
plt.xlabel('x')
plt.ylabel("cos(x/2)")
plt.title('Approximating cos(x) using neuron pulses, w='
+str(w1)+' bw='+str(bw.asscalar()))
plt.grid(color='r', linestyle=':', linewidth=0.5)
#plt.legend()
plt.show()
Figure 7.11: A cosine function is approximated using a series of neuron basis functions
as “nodal shape functions”.
Figure 7.11 shows that the approximated function (the red line) is
smoother and hence has better quality in approximating the cosine function.
Activation Functions and Universal Approximation Theory 281
This is because the neural pulse is used as a nodal shape function that is
smooth and not necessarily being a sharp pulse.
Note that the partition of unity (PU) property/condition is one of the
most important properties of nodal shape functions used in a numerical
method for a mechanics problem [3]. The PU condition states that
Nx
P (x) = 1 (7.11)
i=1
This needs to be satisfied at any x. In the code above, we added one more
computation to check whether this property is satisfied if we would use
the neuron pair as a nodal shape function. As shown in Fig. 7.11, the PU
condition is indeed naturally satisfied by these neuron basis.
If we further reduce the pulse width, we shall obtain a very accurate
approximation of the function, as shown in Fig. 7.12.
Figure 7.12: Neuron Basis: A pair of neurons can be used as nodal basis functions for
function approximation. In this case, one does not have to use very large weight. The
resolution can be controlled by the width of the bias used in the neuron pairs.
In the study above, we used the sigmoid activation function. Many other
activation functions to be discussed in the following sections may also be
used. In Ref. [2], triangular pulses are constructed and used in addition to
the sigmoid pulses to approximate functions, and they worked better in some
ways.
7.4.3 Remarks
Note that Eq. (7.9) uses NPUs to support the Universal Approximation The-
ory, which provide a nice intuition for understanding. However, Eq. (7.10)
is more general. This is because the neuron basis function P (x) used in
282 Machine Learning with Python: Theory and Applications
def tanhf(z):
return nd.tanh(z)
Figure 7.13: Hyperbolic tangent (tanh) activation function (monotonic, smooth, and
differentiable) and its derivative (positive, smooth, and further differentiable).
l0 = int(len (x)/2)
print('x=',x[l0].asscalar(),'tanh(0)=',tanhf(x[l0]). asscalar())
Relu stands for “rectified linear unit” [5]. It is widely used in deep neural
nets in applications in computer vision, speech recognitions, etc. It is the
most popular activation function for training deep neural networks, due to
its effectiveness in mitigating the so-called gradient vanishing issues. It has
the simple form of
284 Machine Learning with Python: Theory and Applications
0 z<0
φ(z) = z + = max(0, z) = (7.14)
z z≥0
Relu function is also known as a ramp function. It is an analogy for half-
wave rectification in electrical engineering. It simply wipes out the effects
from the inputs that are negative, but keeps the non-negative inputs as it
is, resulting in nonlinearity. Such a drastic discount to the inputs might be
one of the reasons why it worked well for densely connected deepnets.
Relu function is piecewise differentiable. The derivative of the Relu
function also has a very simple form of
0 z<0
φ (z) = (7.15)
1 z≥0
The unit gradient is helpful in mitigating the gradient vanishing in a deepnet,
because it remains unchanged with the depth of the net.
Relu has a number of variations, including leaky Relu and parametric
Relu. A general formulation can be given as follows:
kn z z < 0
φ(z) = (7.16)
kp z z ≥ 0
where kn and kp are all positive constants, in which kn represents the
derivative of the negative portion and kp represents the derivative of the
positive portion. For the leaky Relus, one often uses kn = 0.01 and kp = 1.0.
In parametric Relus, we make these parameters trainable.
The derivative of the generalized Relu function has the form of
kn z < 0
φ (z) = (7.17)
kp z ≥ 0
It is seen that for a deep net, the negative part effect can vanish quickly with
depth because of the small value of kn . In such cases, the leaky Relu may
behave like the Relu.
We now write a simple code to demonstrate the generalized Relu function
and its derivatives, with tunable parameters kn and kp .
def reluf(z,kn,kp):
f = [kn*zi if zi < 0.0 else kp*zi for zi in z]
return f
def relug(z,kn,kp):
g = [kn if zi < 0.0 else kp for zi in z]
return g
Activation Functions and Universal Approximation Theory 285
Figure 7.14: Relu activation function (monotonic, continuous, piecewise smooth, and
differentiable), and its derivative (positive, discontinuous at the origin).
286 Machine Learning with Python: Theory and Applications
Note that Relu or leaky Relu function is not smooth, only piecewise
differentiable, and not differentiable at the origin. Its gradient jumps at the
origin as shown in Fig. 7.14. One would need to use the so-called sub-gradient
in the optimization process (to be discussed in Chapter 9). The sub-gradient
can be simply the average of the gradients on the both sides of the origin.
The non-smooth continuity of Relu function at the origin is similar to
domain discretization via elements in the FEM [4] where the approximated
function is continuous but not smooth on the elemental interfaces. The
treatment of the non-smoothness in FEM is using the so-called weak form
formulation.
The softplus function has a similar shape as the rule, but is differentiable in
the entire 1D space. It is defined as follows:
It is clear that for any given argument of a real number z in (−∞, ∞), it
returns a positive number within (0, ∞).
The derivative of the softplus function is
1
φ(z) = σ(z) = (7.19)
1 + e−z
It is the sigmoid function (and hence it is also smooth and differentiable).
We now plot the function and its derivative using matplotlib.
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc
%matplotlib inline
def softplus(z):
return np.log(1. + np.exp(z))
def softplusg(z):
return 1./(1. + np.exp(-z))
plt.ylabel('Softplus')
plt.title('Softplus Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.show()
plt.plot(x,softplusg(x))
plt.xlabel('x')
plt.ylabel('Derivative of Softplus function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.title('Derivative')
plt.show()
Figure 7.15: The softplus activation function (monotonic, continuous, smooth, and
differentiable), and its derivative (positive, smooth, and further differentiable).
Based on the conditions given in the previous section, we can now device
some new activation functions. We present these functions below. Note that
these functions have not yet be tested in machine learning models. Interested
readers may give it a try.
alfa = 2.0
x = np.arange(-10., 10., .1) #x -> z
y = raf(x,alfa)
plt.plot(x,y)
plt.xlabel('z')
plt.ylabel('Function value')
plt.title('Rational activation function, in (-1, 1)')
plt.grid(color='r', linestyle=': ', linewidth=0.5)
plt.savefig('raf.png',dpi=500,bbox_inches='tight')
plt.show()
plt.plot(x,d_raf(x,alfa))
plt.xlabel('z')
plt.ylabel('Derivative of the function')
290 Machine Learning with Python: Theory and Applications
Figure 7.16: The rational activation function (monotonic, continuous, confined in (−1, 1),
smooth, and differentiable), and its derivative (positive, piecewise smooth, and further
differentiable).
alfa = 2.0
Activation Functions and Universal Approximation Theory 291
Figure 7.17: The rational activation function (monotonic, continuous, confined in (0, 1),
smooth, and differentiable), and its derivative (positive, piecewise smooth, and further
differentiable).
292 Machine Learning with Python: Theory and Applications
x0 = np.array([0.0,2.0])
x = np.arange(-2, 2, 0.11)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
nx = 1
for ai in alfa:
y = powerf(x,ai)
plt.plot(x,y,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi(z) $")
plt.title('Power Activation Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.savefig('paf.png',dpi=500,bbox_inches='tight')
plt.show()
Activation Functions and Universal Approximation Theory 293
The above curves clearly show that when α → 0, it gives similar behavior
as the tanh function. Therefore, we may choose a small α for the first hidden
layer, and then let it grow when the net approaches the last layer. At the
last layer we use α = 1, a linear function that is exactly what we want.
The derivative of the power function is as follows.
α(−z)α−1 z < 0
φ (z) = (7.27)
α(z)α−1 z≥0
We write the following code to examine the derivative of the power
function.
def powerg(z,alpha):
a1 = alpha-1.
g = [alpha*(-zi)**a1 if zi < 0.0 else alpha*zi**a1 for zi in z]
return g
x0 = np.array([0.0,2.0])
x = np.arange(-2, 2, 0.11)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
for ai in alfa:
yg = powerg(x,ai)
plt.plot(x,yg,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi \ '(z) $")
plt.title('Derivative of the Power Activation Function')
294 Machine Learning with Python: Theory and Applications
Figure 7.19: The derivative of the power activation function. It is positive, continuous,
piecewise smooth, and further differentiable.
def powerf1(z,alpha,xm):
fm = xm**alpha
km = fm / xm # xm^(alpha-1)
f = [-(-zi)**alpha if (zi<-xm) else km*zi if (-xm<=zi<=xm)\
else zi**alpha for zi in z]
return f
x0 = np.array([0.0,2.0])
x = np.arange(-1, 1, 0.01)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
xm = 0.1
for ai in alfa:
y = powerf1(x,ai,xm)
plt.plot(x,y,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi(z) $")
plt.title('Power-Linear Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.savefig('plaf.png',dpi=500,bbox_inches='tight')
plt.show()
def powerg1(z,alpha,xm):
fm = xm**alpha
km = fm / xm
a1 = alpha-1.
g = [alpha*(-zi)**a1 if zi<-xm else km if (-xm<=zi<=xm)\
else alpha*zi**a1 for zi in z]
return g
x0 = np.array([0.0,2.0])
x = np.arange(-1, 1, 0.01)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
xm = 0.1
for ai in alfa:
yg = powerg1(x,ai,xm)
plt.plot(x,yg,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi \' (z) $")
plt.title('Derivative of the Power-Linear Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.savefig('d_plaf.png',dpi=500,bbox_inches='tight')
plt.show()
Activation Functions and Universal Approximation Theory 297
Yes! The derivative is caped. We observe also that the smaller the α, the
large the derivative value. This may be useful in mitigating the vanishing
derivative problem in training deep nets. The vanishing derivative occurs
when back-propagation is used to update the weights and biases using a
gradient descent technique (to be discussed in great detail in Chapter 9),
the derivatives get smaller and smaller due to the multiplication of the
derivatives from the chain rule of differentiation (as shown in Chapter 8).
With the power-linear function, one may get some help in reducing the
vanishing rate, if it is placed properly in these layers.
We see also that the derivative has a plateau in the linear range. This
can be improved further by the power-quadratic function.
(α−1)
km = zm (7.31)
298 Machine Learning with Python: Theory and Applications
(α−1)
kp = αzm (7.32)
We now assume that the derivative varies linearly in [−zm , zm ], which give
us
⎧
⎪ kp − km
⎪
⎨− z + km −zm ≤ z < 0
zm
φ (z) = (7.33)
⎪
⎪ kp − km
⎩ z + km 0 ≤ z < zm
zm
We finally integrate the above function, and using the condition that it has
to be zero at z = 0, we obtain,
⎧
⎪ kp − km 2
⎪
⎨− z + km z −zm ≤ z < 0
2zm
φ(z) = (7.34)
⎪
⎪ kp − km 2
⎩ z + km z 0 ≤ z < zm
2zm
Adding up the other two pieces on both side of the equation, we now have
the full definition for the power-quadratic function:
⎧
α
⎪ −(−z)
⎪
⎪
z < −zm
⎪
⎪
⎪
⎪ kp − km 2
⎨ − 2z
⎪ z + km z −zm ≤ z < 0
m
φ(z) = (7.35)
⎪
⎪ kp − km 2
⎪
⎪ z + km z 0 ≤ z < zm
⎪
⎪ 2zm
⎪
⎪
⎩ α
z z ≥ zm
#Power-quadratic(2) function
def apowerf2(z,alpha,xm):
a1 = alpha-1.
fm = xm**alpha
km = fm/xm
kp= alpha*xm**a1
kg = (kp-km)/xm
f = [-(-zi)**alpha if (zi<-xm) else km*zi-0.5*kg*zi**2\
if (-xm<=zi<0.) else km*zi+0.5*kg*zi**2 if (0.<=zi<=xm)\
else zi**alpha for zi in z]
return f
x0 = np.array([0.0,2.0])
x = np.arange(-2, 2, 0.02)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
xm = 0.1
for ai in alfa:
y = apowerf2(x,ai,xm)
plt.plot(x,y,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi(z) $")
plt.title('Power-Quadratic Activation Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.savefig('pqaf.png',dpi=500,bbox_inches='tight')
plt.show()
def apowerg2(z,alpha,xm):
a1 = alpha-1.
fm = xm**alpha
km = fm/xm
kp= alpha*xm**a1
kg = (kp-km)/xm
g = [alpha*(-zi)**a1 if zi<-xm else -kg*zi+km\
if (-xm<=zi < 0.) else kg*zi+km if (0.<=zi <xm)\
else alpha*zi**a1 for zi in z]
return g
#x0 = np.array([0.0,2.0])
x = np.arange(-1, 1, 0.002)
#alfa = np.arange(0.2, 1.2, .2)
alfa = 1.0/np.arange(1, 9, 1.0)
xm = 0.1
for ai in alfa:
yg = apowerg2(x,ai,xm)
plt.plot(x,yg,label = r"$\alpha$="+"{:.2f}".format(ai))
plt.xlabel('z')
plt.ylabel("$\phi \ '(z) $")
plt.title('Derivative Power-Quadratic Activation Function')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.savefig('d_pqaf.png',dpi=500,bbox_inches='tight')
plt.show()
Activation Functions and Universal Approximation Theory 301
The plateau in the linear range is now removed with a nice linearly varying
derivative. We still observe that the smaller the α, the large the derivative
value. This may be useful in mitigating the vanishing derivative problem in
training deepnets. Thorough tests are required for this proposed set of power
functions, in its application to practical problems.
References
303
304 Machine Learning with Python: Theory and Applications
The analytical method explicitly gives the formulas for the derivatives of
functions. Table 8.1 lists some scalar functions often used in ML, f (x), with
variable x also being a scalar, together with their derivatives fdx (x)
. These
formulas can be coded directly for computing the derivatives, and symbolic
computational tools such as SymPy can be used to obtain the expression of
the derivatives. However, such formulas may not be available for all functions.
Also, not all functions can be explicitly written in such a simple closed form,
Table 8.1: Analytical formulas for derivatives of simple functions (c, n are constants).
1 ecx − e−cx
f (x) = c cxn ecx log(cx)
1 + e−cx ecx + e−cx
cx 2
df 1 ce−cx e − e−cx
=0 ncxn−1 cecx c 1−
dx x (ecx + e−cx )2 ecx + e−cx
Automatic Differentiation and Autograd 305
The loss function for machine learning is usually in a form of scalar function.
Even if it is not, one can always make it a scalar function by, for example,
measuring its norm in some way. However, their variables can often be vectors
or matrices. Consider a scalar function f (x) ∈ R1 defined in a P -dimensional
space, the variable vector x will be in RP . The gradient of the function ∇f (x)
is a vector and will also be in RP ,
⎡ ⎤
∂f (x)
⎢ ∂x1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ∂f (x) ⎥
∂f (x) ∂f (x) ⎢ ⎥
∇f (x) = = ⎡ ⎤ =⎢ ⎢
∂x2 ⎥
⎥ (8.1)
∂x x1 ⎢ .. ⎥
⎢x ⎥ ⎢ . ⎥
⎢ 2⎥ ⎢ ⎥
∂⎢ ⎥ ⎣ ∂f (x) ⎦
⎢ .. ⎥
⎣ . ⎦ ∂xm
xm
Table 8.2: Analytical formulas for the gradients of some useful scalar functions.
Function Name f (x) ∇f (x)
Constant c 0
Linear x c=c x c
x c+b x c+b
General exponential e ce
General logarithm log(x c + b) −c
x c+b
1 ce−x c+b
Sigmoid
1+e−x c+b (ex c +e−x c+b )2
2
ex c+b −e−x c+b ex c+b −e−x c+b
tanh c 1−
ex c+b +e−x c+b ex c+b +e−x c+b
summation sum(x) 1
norm (quadratic in x) x2 = x x 2x
C-scalar (quadratic in x) x Cx 2Cx
X-scalar (linear in X) c Xc cc
Table 8.1 is obvious and writing it down needs only college calculus.
However, writing Table 8.2 requires careful derivation. The best way may be
using the tensor calculus, by first converting the matrix notation to (Ricci)
index notation, performing the differentiations, and then converting back
to matrix form. Using the tensor calculus, one can further derive formulas
for vector or even matrix functions with tensor variables. Here, we will not
discuss this in detail.
#import numpy as np
import autograd.numpy as np # This is a thinly-wrapped numpy
from autograd import grad # for computing grad(f)
import matplotlib.pyplot as plt
%matplotlib inline
Automatic Differentiation and Autograd 309
df1ia = grad(f_sigma)(z[0])
df2i = list(map(df2,z)) # compute df2
plt.figure(figsize=(4.5, 3.0),dpi=100)
plt.scatter(z, df1i,c='green',label='Autograd',alpha=0.2)
plt.plot(z, dfsa(z),c='red',label='Analytical')
plt.xlabel('z')
plt.ylabel(r"$\sigma'(z)$")
plt.title('Derivative of Sigmoid: analytic vs. autograd')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.show ()
plt.figure(figsize=(4.5, 3.0),dpi=100)
plt.plot(z, fi,label='Function')
plt.plot(z, df1i,label='1st derivative')
plt.plot(z, df2i,label='2nd derivative')
plt.plot(z, list(map(df3,z)),label='3rd derivative')
plt.plot(z, list(map(df4,z)),label='4th derivative')
plt.xlabel('z')
plt.ylabel(r'$\sigma(z) $'+' and its derivatives')
plt.title('Derivatives of sigmoid function via autograd')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.show ()
310 Machine Learning with Python: Theory and Applications
backwards the gradients of the loss function with respect to the variables.
The following discussion and the codes are written in reference to these at
mxnet-the-straight-dope (https://github.com/zackchase/mxnet-the-straigh
t-dope), under the Apache-2.0 License:
import numpy as np
import mxnet as mx
from mxnet import nd, autograd
mx.random.seed(1)
df (x)
= 16x for x ∈ R1 (8.4)
dx
x=nd.array([[1,5,10],[20,18,28]]) # initialization of x
# in a matrix (tensor)
At all these values of x in the above matrix, we can easily compute the
gradient using analytical formula given in Eq. (8.4).
print('x=',x)
grad_at_x = 16.0*x # analytical gradient
print('Analytical grad_at_x=', grad_at_x)
312 Machine Learning with Python: Theory and Applications
x=
[[ 1. 5. 10.]
[20. 18. 28.]]
<NDArray 2x3 @cpu(0)>
Analytical grad at x=
[[ 16. 80. 160.]
[320. 288. 448.]]
<NDArray 2x3 @cpu(0)>
This defines a path for carrying out the differentiation and the related
evaluations. This path of differentiation is also called graph. The definition
of the chain or path builds a computation graph. The differentiation is then
done using the well-known chain rule of differentiation. MXNet builds such
a graph on the fly, using a kind of recording “device”, which record the path.
Third, with the path defined, we can finally invoke
print('At x=',x)
print('The gradient values of f at x=', x.grad) # autograd
print('Analytical grad_at_x=', grad_at_x) # analytical
Automatic Differentiation and Autograd 313
At x=
[[ 1. 5. 10.]
[20. 18. 28.]]
<NDArray 2x3 @cpu(0)>
The gradient values of f at x=
[[ 16. 80. 160.]
[320. 288. 448.]]
<NDArray 2x3 @cpu(0)>
Analytical grad at x=
[[ 16. 80. 160.]
[320. 288. 448.]]
<NDArray 2x3 @cpu(0)>
We note again that the shape of x and x.grad is the same. x stores all these
x values, and x.grad stores the corresponding gradients at these x values. We
can see that the gradient value obtained at x using autograd is exactly the
same as those we got using the analytic formula.
Let us now consider the gradients of a scalar function with multiple variables
(multiple dimensional problems). We compute the gradients of some scalar
functions listed in Table 8.2. The first is the linear function defined in
p-dimensional space.
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)> Constants c =
[0.1 0.8 0.18]
<NDArray 3 @cpu(0)>
The gradient values of f at x=
[0.1 0.8 0.18]
<NDArray 3 @cpu(0)>
The analytical gradient values of f at x=
[0.1 0.8 0.18]
<NDArray 3 @cpu(0)>
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)> Constants c =
[0.1 0.8 0.18]
<NDArray 3 @cpu(0)>
The gradient values of f at x=
[0.00869581 0.06956648 0.01565246]
<NDArray 3 @cpu(0)>
The analytical gradient values of f at x=
[0.00869581 0.0695665 0.01565246]
<NDArray 3 @cpu(0)>
Finally, the general form of the tanh function is in Table 8.2.
def dtanh_nd(z):
return (1.-tanh_nd(z)**2) # analytical derivative
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)> Constants c =
[0.1 0.8 0.18]
<NDArray 3 @cpu(0)>
The gradient values of f at x=
[0.00443234 0.03545868 0.0079782 ]
<NDArray 3 @cpu(0)>
The analytical gradient values of f at x=
[0.00443233 0.03545866 0.0079782 ]
<NDArray 3 @cpu(0)>
with autograd.record():
f = x1d.sum() # summation function.
[1. 1. 1. 1.]
<NDArray 4 @cpu(0)>
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
with autograd.record(): # define computation path
f = 8 * nd.dot(x1d, x1d)
#f = (x1d.norm())**2 # alternative formula
All correct.
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)> Constants c =
[[1.1 0.8 0.18]
[0.8 2.8 0.18]
[0.18 0.18 3.8 ]]
<NDArray 3x3 @cpu(0)>
The gradient values of f at x=
[ 6.48 13.88 23.88]
<NDArray 3 @cpu(0)>
The analytical gradient values of f at x=
[ 6.48 13.88 23.88]
<NDArray 3 @cpu(0)>
At x=
[[1.1 0.8 0.18]
[0.8 2.8 0.18]
[0.18 0.18 3.8 ]]
<NDArray 3x3 @cpu(0)> Constants c =
[1. 2. 3.]
<NDArray 3 @cpu(0)>
Gradient values of f at x=
[[1. 2. 3.]
[2. 4. 6.]
[3. 6. 9.]]
<NDArray 3x3 @cpu(0)>
d df (g) dg(x)
f (g(x)) = (8.10)
dx dg dx
Automatic Differentiation and Autograd 321
The function can be defined in more than one layer, which forms a chain
functions, as follows:
g =x∗2 (8.11)
f =g∗x (8.12)
h=f (8.13)
The first part dhdf is called head gradient, which is often used in machine
learning. For the above simple example, we have
dh
=1 (8.15)
df
If we know the head gradient of the function, the gradient of the function
can be computed using f.backward(), by feeding the head gradient as an
input argument:
with autograd.record():
g = x * 2
f = g * x
h = f
head_gradient=nd.ones_like(g)
# default. hence is not necessary.
h.backward(head_gradient)
# this allows adding in a constant matrix to the head.
print('At x=',x, '\n The gradient values of f at x=', x.grad)
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)>
The gradient values of f at x=
[ 4. 8. 12.]
<NDArray 3 @cpu(0)>
322 Machine Learning with Python: Theory and Applications
with autograd.record():
g = 2* x ** 2
h = g
head_gradient=nd.ones_like(g)
h.backward(head_gradient)
print('At x=',x, '\n The gradient values of f at x=', x.grad)
At x=
[1. 2. 3.]
<NDArray 3 @cpu(0)>
The gradient values of f at x=
[ 4. 8. 12.]
<NDArray 3 @cpu(0)>
The above example is simple, but demonstrates a very powerful tool for
computing the exact gradients of complicated loss/objective functions one
may encounter in real-life applications to train a neural network. As long as
we can structure the loss function that is a chain of sub-functions, autograd
can be used. All one needs is to tell the code which are the parameters
(variables) that need their gradients to be “attached” to, and then “record”
the chain relationship of the sub-functions. The code will compute the
gradients, in a back-propagation process upon request, regardless of how
many layers there may be.
a = nd.random_normal(shape=6)
a
[ 0.03629 -0.49024 -0.9501 0.03751 -0.7298 -2.0401056 ]
<NDArray 6 @cpu(0)>
a.attach_grad()
with autograd.record():
b = a * 88 + 5.0
Automatic Differentiation and Autograd 323
print(a.grad)
Note there are operations and functions that may not be supported
by autograd, such as in-place operations, depending on the module used.
Readers are referred the online [document] (https://github.com/HIPS/
autograd/blob/master/docs/tutorial.md) and the links therein for more
detail, when encountering strange behavior.
Let us consider the simplest network with only a single neuron, as shown in
Fig. 8.2.
Figure 8.2: A 1-1-1 neuron network: One input layer, one hidden layer, and one output
layer, each with a single neuron.
We first use hand calculation to get all the values at these local nodes
and stages in the feed-forward process, together with all the gradients at
the corresponding locations in the back-propagation process. The results are
shown in Fig. 8.3.
324 Machine Learning with Python: Theory and Applications
Figure 8.3: Feed-forward and Back-propagation for the simplest net with only a single
neuron. The net is split into five local nodes (red elliptic circles), as shown in the middle
of the figure. Feed-forward process: The input value of x, current values of the learning
parameters w and b, as well as the label y value are all arbitrarily set for the calculation.
All these intermediate values calculated using these values in the feed-forward process are
given above the arrows leading to these nodes. Back-propagation: All the derivatives at
the local nodes are computed backwards and the values are given in the bottom of the
figure. The final gradient of the L2 function with respect to w is calculated. The same can
be done with respect to b, and the result should be the same for this simple case.
In this case, we use the least squared error as the loss function (known
as the L2 loss function) with φ(z) being the activation function.
∂L
= 2(x · w + b − y)x · φ (z) (8.17)
∂w
and
∂L
= 2(x · w + b − y) · φ (z) (8.18)
∂b
Let us write a code to compute all these, using the above analytic formulae
and the autograd.
import mxnet as mx
from mxnet import nd, autograd, gluon
w1=
[2.]
<NDArray 1 @cpu(0)>
b=
[3.]
<NDArray 1 @cpu(0)>
x=
[1.]
<NDArray 1 @cpu(0)>
with autograd.record():
M = nd.dot(x, w1)
x2 = M + b
y_hat = logistic(x2)
s = y_hat - y
l2 = s*s
326 Machine Learning with Python: Theory and Applications
head_gradient=nd.ones_like(l2)
l2.backward(head_gradient)
# print feed-forward results of the net at all local nodes
print('x=',x[0].asscalar,'w=',w1[0].asscalar,'M=',M[0].asscalar,
'x2=',x2.asscalar, 'y_hat=',y_hat,'s=',s,'l2=',l2.asscalar)
# print the gradients from the back-propagation
print('Autograd, w1.grad',w1.grad,' shape',np.shape(w1.grad))
print('Autograd, b.grad',b.grad)
#print the analytic results.
print('Analytic loss2=',loss2(x,w1,b,y))
print('Analytic lossdw=',lossdw(x,w1,b,y))
print('Analytic lossdb=',lossdb(x,w1,b,y))
Analytic lossdb=
[-0.01338505]
<NDArray 1 @cpu(0)>
We now check out the differences between analytical, autograd, and numer-
ical differentiation. We use the sigmoid function to do so, because it is one
of the most often used functions in machine learning.
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline
328 Machine Learning with Python: Theory and Applications
dgag=
[-3.6379788e-12 2.9103830e-11 0.0000000e+00 -1.8626451e-09
0.0000000e+00 0.0000000e+00 5.2154064e-08 2.7939677e-08
-4.3539330e-08 -1.4406396e-08]
<NDArray 10 @cpu(0)>
dgng=
[-3.2600899e-05 -5.8590609e-05 -4.2439066e-04 -2.7038064e-03
-5.7642013e-03 1.8941417e-02 -5.7641417e-03 -2.7037859e-03
-4.2444887e-04 -5.8584614e-05]
<NDArray 10 @cpu(0)>
Figure 8.4: Derivative of the sigmoid function computed using different methods, and the
error in the results.
Automatic Differentiation and Autograd 329
x=
[0.]
<NDArray 1 @cpu(0)> sigmoid(0)=
[0.5]
<NDArray 1 @cpu(0)>
The results plotted in Fig. 8.4 show clearly that the autograd gives the
exact solution (the green line above is zero). The numerical gradient gives
only an approximated solution (the orange line is not zero).
8.10 Discussion
References
[1] G.R. Liu and T.T. Nguyen, Smoothed Finite Element Methods. Taylor and Francis
Group, New York, 2010.
[2] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method. Taylor and
Francis Group, New York, 2010.
MACHINE LEARNING
WITH PYTHON
Chapter 9
9.1 Introduction
331
332 Machine Learning with Python: Theory and Applications
For optimization problems with very simple objective functions, one may
be able to find analytical solutions. For example, for an objective function
f (W ) = (W − a)2 with a given fixed real number a, one can immediately find
the global and unique minimum at W ∗ = a. This function is positive and
quadratic in W , and hence has one and only one solution in R1 . This simple
problem implies that finding the minimum is straightforward, if the function
is quadratic.
and
ŵ = [W 0 , W 1 , W 2 , . . . , W p ] ∈ Wp+1 (9.6)
Note that Eq. (9.3) requires ŵ being a column vector. This allows us to
obtain a conventional form of the normal equation later. Our task is to find
a vector ŵ that gives a best ŷ closest to the labels in the given m data
pairs, y:
[y1 , y2 , . . . , ym ] (9.7)
m ≥ (p + 1) (9.8)
Notice Eq. (9.9) mimics the idea function discussed at the beginning of this
section, and is essentially quadratic. Clearly, the loss function L is zero, if the
prediction produces the dataset exactly. Otherwise it is a positive number.
334 Machine Learning with Python: Theory and Applications
∂L(y, ŷ(ŵ)) T
∇ŵ L(y, ŷ(ŵ)) = = −2X (y − Xŵ∗ ) = 0 (9.11)
∂ ŵ
Xŵ∗ = y (9.14)
Solution Existence Theory and Optimization Techniques 335
Figure 9.1: Vectors of data-points in the feature space X2 (blue) and in the augmented
2
feature space X (red). Vectors x1 , x2 , and x3 in space X2 are linearly dependent, but x1 ,
2
x2 , and x3 on space X in R3 are not, because of the elevation.
It is clear now that X is exactly the same as the “moment matrix” in the for-
mulation of the triangular elements given in Eq. (7.10) in FEM [1]. The con-
dition for X having full rank is that the area formed by these 3 data-points
in the feature space X2 (2-dimensional) is nonzero. Such an area is half of the
determinate of the moment matrix. Figure 9.1 plots 3 data-points x1 , x2 , and
x3 in space X2 , the area for which the triangle is clearly nonzero. If we replace
x3 by x3 , x1 , x2 , and x3 become in line, the triangle becomes degenerated,
the area becomes zero, X will be singular, and its rank reduces to 2.
Notice that while x1 , x2 , and x3 are linearly dependent in space X2 ,
vector x3 can be presented by a linear combination of x1 and x2 , because
these 3 vectors (in blue) are in plane. However, when these 3 data-points are
2
projected onto the affine plane X (see Section 1.4.2), these 3 vectors (in red)
in R3 are no longer in plane and become linearly independent. Therefore,
alternatively the condition for X having a full rank is that the vectors of
2
these 3 data-points on plane of x0 = 1 on the space X in R3 are linearly
independent. If we replace x3 by x3 , x1 , x2 , and x3 are now in a plane, and
they become linearly dependent. In this case, X will become singular and
336 Machine Learning with Python: Theory and Applications
its rank will become 2. Therefore, the rank of X depends on the maximum
2
number of the linearly independent vectors of the data-points in the X
space.
Consider now cases of m > p + 1 with p = 2. This means that we still
have 2 features, but more than 3 data-points. In these cases, the rank of X
depends on the maximum number of the linearly independent vectors in R3
2
of all the data-points in the X space. Hence, the chance should be higher.
For general high-dimensional feature spaces Xp , the same argument shall
hold for X being full rank. If we examine the data-points in Xp space, we need
the “volume” of the largest polyhedron with any set of p + 1 data-points as
p
its vertexes to be non-zero. If we examine the data-points in X space, we
need the minimum number of the linearly independent vectors in Rp+1 of all
p
the data-points in the X to be p + 1. Once X has full rank, we shall have
a least-square solution given by Eq. (9.13). A general analysis for arbitrary
dimension can be found in [2].
We now summarize our discussion in the following solution existence
theory.
still holds, and the solution existence depends on whether the resulting
moment matrix X has full rank. Alternatively, one can use a more powerful
hypothesis, such as MLP a large number of learning parameters discussed in
Section 13.2. Precise analysis of solution existence for MLP becomes more
complicated, but we can assert that it depends on whether or not the number
and the distribution of the data-points support the MLP hypothesis with
the learning parameters. This assertion is important in ML model creation,
because if one fails to train a model, checking and enriching the dataset, or
reduce the complexity of MLP could be effective solutions.
particular case, such a linear function gives the least error defined in Eq. (9.9)
in fitting the dataset.
To predict higher-order and nonlinear latent features relating to the labels
in a dataset, one has to either add in higher-order bases, or other types of
nonlinear bases, or more powerful nonlinear models like MLP, as mentioned
earlier. The predictions at the output layer of neurons will still be, in Yk .
This is because of the continuous coverage over the affine spaces by each
neuron in the hidden layers ensured by both affine transformations wrapped
with activation functions. The prediction functions at the output layer are
continuous and differentiable with respect to the learning parameters, as
discussed in detail in Chapter 5 for affine transformations and in Chapter 7
for activation functions. In addition, the gradients can be found via autograd
as demonstrated in Chapter 8. Thus, a smooth and continuous coverage
of the prediction function can be realized by the learning parameters in
hypothesis space WP . As results, the predictability of the solution against
the label is ensured in theory. Regardless of the actual values of the labels
in the label space, the prediction shall be able to match those values to
the desired accuracy, by tuning the learning parameters in the hypothesis
space. The Universal Prediction Theory (see, Chapter 5) has ensured the
capability of an MLP, as long as proper configuration of the MLP is used
with a sufficiently large number of learning parameters. Effective numerical
techniques are, however, needed to find the optimal learning parameters that
minimize the loss function.
For most of the machine training models, the loss function is complicated.
It is often not possible to convert the minimization problem similar to
Eq. (9.10) to a single normal equation system. Therefore, we have to find
a way to solve the minimization problems directly by numerical means.
Iterative methods are often used, and finding a set of the minimizers requires
many steps.
For most real-life machine learning, the objective function f (x) with x being
the variable may have a number of local minimum values, at each of which
f (x) is smaller than at its nearby points. Only the x∗ , at which f (x∗ ) has the
smallest value over the entire domain, is regarded as the global minimizer,
and is usually the one we would like to find.
Consider the following function:
x
f (x) = · cos(πx), −1.0 ≤ x ≤ 4.0. (9.16)
2
def f(x):
return 0.5*x * np.cos(np.pi * x)
It is clear Fig. 9.2 that this function has two local minimums, and one
global minimum in the domain we plotted. Existence of local minima presents
a challenge, and unfortunately we do not have effective means to ensure the
finding of the global minimum for general ML models. In practice, what we
can do is to perform the optimization multiple times, each starting from a
different initial point, and then take the minimum of these multiple minima.
from left towards 0 and seems to be a minimizer when viewed away from 0
to its right, as shown in Fig. 9.3. The gradients at both opposite sides of the
saddle point are the same.
Figure 9.4: An example of cost functions in 2D space with multiple saddle points.
For this 2D case plotted in Fig. 9.4, there are multiple saddle points, at
which the gradient of the function is zero, but the values of the function are
maxima along the x1 direction and minima along the x2 direction.
Figure 9.5: An ideal example of quadratic cost functions in 2D space with only one
minimum.
As shown in Fig. 9.5, the ellipse function behaves very well and has only
one minimum. Understanding this type of property is useful in dealing with
complicated loss functions because near a local minimum, many complicated
functions are locally elliptic, and hence can be taken advantage of. The only
difficulty that may arise is when the ratio of the minor and major radii is
too large (or too small). We shall discuss this again later.
Let us examine how this would work for the simplest one-dimensional
problem.
where (δx) is a small real number. Because the function has the first
derivative, it must be a finite number (otherwise, it is not differentiable).
Therefore, we can choose a very small positive real number −η as a scalar,
so that −ηf (x) is as small as δx. This implies that η should carry a unit of
[(δx)f −1 (x)]. Equation (9.20) can now be rewritten as
x := x − ηf (x) (9.23)
would lead to a reduction of the value of f (x). Note the negative sign, which
implies that updating in x is in the negative (or downhill) direction of the
gradient.
Figure 9.6 shows that regardless of which side the x is located on, the
gradient descent takes a step closer to the minimum of the function that is
convex.
9.4.2 Remarks
Based on the above analysis, we note the following important points:
1. Updating in x is in the negative direction of the gradient, and hence the
name of gradient descent.
2. The amount of reduction is proportional to the gradient of the function.
The larger the gradient, the larger the reduction, provided that ηf (x) is
sufficiently small so that the Taylor expansion (and our analysis) can still
hold without overshooting.
3. Scalar η is called the learning rate or learning step size in machine
learning, and choices of it can be tricky. It often requires some trial and
error.
Solution Existence Theory and Optimization Techniques 347
4. On the one hand, we must take the direction at which the derivative is the
largest. On the other hand, we must not be too greedy to take too big an η
that could undermine the basis of our analysis. A larger learning rate may
lead to overshooting and hence oscillating behavior in the convergence
process, which is often observed in practicing machine learning.
5. When the derivative value f (x) at x is zero, we make no progress. This
means also that we are already at the minimum (including the saddle
point discussed in the previous section).
6. If the derivative value f (x) at x is small, we make very (one more order
smaller) little progress (note the (f (x))2 term). This implies that to speed
up the process, one would need to access the higher-order derivatives
of the cost function (such as the higher-order methods like Newton’s
method).
where ∂f (x)/∂xi is the partial derivative of the function with respect to the
ith parameter. It is the change in f at x with respect only to a small change
in xi .
To measure the change in f at x with respect to a small of change in
length (δl) in an arbitrary direction defined by a unit arbitrary vector u, we
may write
f (x + (δl)u) − f (x)
Du f (x) = lim (9.25)
(δl)→0 (δl)
where θ is the angle between ∇x f (x) and u, with 0 ≤ θ ≤ 2π. The minimum
value of cos(θ) is −1 at θ = π. Therefore, Du f (x) is minimized when u is at
the opposite direction of the gradient ∇x f (x).
We can now reduce the value of f by updating x in the following manner:
where η is a positive scalar that shall carry a unit of [(δx)∇x f (xi )−1 ], the
learning rate.
The analysis for hyper-dimensional problems is similar to what we have
done for the one-dimensional problems, but needs an additional analysis on
the direction. We also note that the gradient of a scalar function with respect
to a vector variable is a vector of the same shape of the variable vector. In
other words, taking the gradient to a scalar function is essentially the same
as taking the derivative. All one needs to do is to change the scalar variable
to a vector variable. By the same argument, all the remarks made in the
previous subsection also hold for hyper-dimensional problems.
xa + xb f (xa ) + f (xb )
f ≤ (9.31)
2 2
This condition can be given in a more general statement as follows:
These properties are useful to prove convergence theory for gradient descent
methods. They are shown pictorially in Fig. 9.7.
xa − x∗ ≤ R (9.36)
The first equation ensures a convergence, meaning that f (xt ) will eventually
approaches to f (x∗ ), and the 2nd equation ensures the convergence can be
achieved in finite steps. Note that in practice, the tuning rate can often be a
challenging task. Equation (9.41) a theoretical guideline to design a schedule
for learning rates. The schedule give below satisfies both foregoing equations.
η0
ηt = β (9.42)
t
in which η0 is a pre-specified initial learning rate, and 1/2 < β < 1, in order
to satisfy conditions in Eq. (9.41). The following is a simple code for this
(inverse) power scheduling.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
ep_list = np.array(range(2000))
lr0, beta, lr_list= 0.01, 0.51, []
for epoch in ep_list:
lr_list.append(lr_schedulerP(epoch, lr0, beta))
fig = plt.figure()
ax = plt.axes()
ax.plot(ep_list, lr_list);
352 Machine Learning with Python: Theory and Applications
ep_list = np.array(range(1500))
lr, lr_list= 0.01, []
for epoch in ep_list:
lr_list.append(lr_scheduler1(epoch,lr,0.02,1e-5,1000))
Solution Existence Theory and Optimization Techniques 353
fig = plt.figure()
ax = plt.axes()
ax.plot(ep_list, lr_list)
This schedule shown in Fig. 9.9 has constant portion at the beginning,
which can be useful to have a fast convergence at the initial stage. The
learning rate used at the final stage is set at a minimum rate that is
acceptable to the analyst. The learning rate in between varies linearly.
whole cost function. For a dataset with one million data-points, for example,
the computation of the gradient at each iteration is reduced by one million
times! Thus, this can be very cost-effective.
When the SGD is used in this kind of extreme manner, the convergence
of the cost function may become less stable and behaves in an oscillatory
manner. This is when the mini-batch concept comes in handy; instead of
randomly sampling only one data-point, we can sample randomly at uniform
a small batch of data-points at each iteration (epoch). This can significantly
improve the convergence behavior with a little increase of cost, and hence
the mini-batch technique can be very effective. It is one of the most efficient
algorithms in machine learning when dealing with a big dataset.
The SGD is even more appealing when the training data-points have
high redundancy. The stochastically sampled gradients can be a very good
estimation of the true gradient of the whole cost function using a small batch.
When the training data-points are large enough with high redundancy, a
smaller batch size can be used, and the saving in cost can be very significant.
In addition, using the stochastically sampled gradients can be considered as
a type of regularization, providing some effect in alleviating overfitting.
for multiple epochs, until all the samples in the entire dataset are more
or less exhausted and the loss function is converged to an acceptable low
level.
Because of the random sample selection for the mini-batches, one can
hope that any mini-batch is statistically a good representation of the entire
dataset. The following function is defined for the mini-batch SGD:
import mxnet as mx
import numpy as np
from mxnet import autograd,gluon
from mxnet import ndarray as nd
import random
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X[:, 0].asnumpy(),y.asnumpy(),color='r')
plt.xlabel('$x$')
plt.ylabel("$\hat{y}$")
plt.title('data-points for Linear Regression')
plt.show()
With all the necessary functions defined for use, we now write the
algorithm to train the model.
%matplotlib notebook
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d import axes3d
import numpy as np
return wi,bi,total_loss,epochs
Wi,Bi,T_cost,Epochs=train(batch_size=10,lr=.2,epochs=5,
period=10)
tw,tb = [true_w],[true_b]
true_z = cost_function(X,tw,tb,y)
print('length of Wi = ',len(Wi))
#%matplotlib notebook
%matplotlib inline
Figure 9.11: The loss function in 2D space varying quadratically with weight and bias
for the linear regression problem.
fig, ax = plt.subplots(figsize=(8,6))
plt.rcParams["font.size"] = "15"
CS = ax.contour(W11, B1, Z,20)
ax.clabel(CS, inline=1, fontsize=9)
ax.set_title('A cost function of least square error')
plt.scatter(Wi, Bi, s=20, c='r',marker='^', alpha=0.5)
plt.scatter(Wi[-1], Bi[-1], s=80, c='r',marker='X',alpha=0.9)
plt.scatter(Wi[0], Bi[0], s=80, c='r', marker='o', alpha=0.9)
plt.scatter(true_w[0],true_b,s=80,c='g',marker='o',alpha=0.5)
ax.set_xlabel('$w11$', fontsize=12, rotation=0)
ax.set_ylabel('$b1$', fontsize=12)
fig.tight_layout()
plt.show()
Solution Existence Theory and Optimization Techniques 361
Figure 9.12: Contours of the loss function in 2D space varying quadratically with weight
and bias for the linear regression problem. A minimization algorithm (SGD) finds the
minimum (green dot) starting from the initial values of weight and bias (red dot).
Figure 9.13: Convergence of the loss function, the weight, and bias for the linear regression
problem, during the SGD searching for the minimizer.
%matplotlib inline
import matplotlib.pyplot as plt
print('true_w:',true_w,' true_b:',true_b)
x = np.arange(-4, 4, .1)
yh = Wi[-1] * x + Bi[-1]
print('Predicted_w:', Wi[-1],'Predicted_b:',Bi[-1])
fig = plt.figure(figsize=(8,6))
plt.rcParams["font.size"] = "12"
plt.scatter(X[:, 0].asnumpy(),y.asnumpy(),color='r')
plt.plot(x,yh,label = "$\hat{y} = x w + b$")
plt.xlabel('$x$')
plt.ylabel("$\hat{y}$")
plt.title('Linear Regression by Minimizing the Distances')
plt.legend()
plt.show()
Figure 9.14: The results of the linear regression. It is found that a linear function fits
best to the data-points in the sense of the least squares of errors.
The gradient descent (GD) and its variation of the stochastic gradient
descent (SGD) move in each iteration based solely on the current status of
the gradient information of the cost function. It does not use any information
from the past search history. Therefore, techniques that making use of the
history may perform better. The method of gradient descent momentum
(GDM) is one of these techniques.
Figure 9.15: Pictorial description of the zigzag behavior of the gradient descent algorithm
for searching the minimizer of an elliptic loss function in an extreme case. The long axis
is in the horizontal direction, and the short axis is in the vertical direction. “Productive”
gradient components are in the horizontal direction, and they are all aligned in the positive
direction towards the minimizer, but they are small, and hence progressing slowly. The
“unproductive” gradient components are in the vertical direction, because they reverse
directions each iteration resulting in a waste. The search traces on the right-lower part
(starting from the red dot) are the actual case that we have seen in the previous example.
We managed to reach the minimizer with rather a smoother path (at least at the earlier
iterations). This was at a cost using a very small learning rate.
marching step. This time, the negative gradient point is yet again on the
opposite side. This cycle continues, forming a zigzag path moving slowly
toward the minimum point. It is clear that the GD method wasted quite a
lot of effort in the process.
The root of the problem is the drastic change in direction of the
consecutive (negative) gradient vectors, from one step to the next. For this
type of cost functions with heavy elliptical contours, the two components
of the gradient vector are very much different (it corresponds to the bad
conditioning of Hessian matrix that contains the 2nd derivatives of the cost
function). All these marching arrows have two components pointing to the x1
(along the long axis of the ellipse) and x2 (along the short axis of the ellipse)
directions. They are basically two components of the (negative) gradient
vector. It is easy to see that the x1 components for all the marching arrows
are largely aligned and pointing more or less to the target. However, the x2
component of a marching arrow is largely opposite that of the next arrow.
With this insightful observation, we can now design an algorithm that
uses some kind of weighted average of the gradient vectors in consecutive
steps. If this can be done, all the x1 components are still encouraged to point
to the target, and all the x2 components will cancel each other, leading to a
significant reduction of the zigzag. This is essentially the idea of the method
of gradient descent with momentum.
Solution Existence Theory and Optimization Techniques 365
9.6.2 Formulation
The gradient descent with momentum can be found in Ref. [3]. It is given
as follows:
vi := γvi−1 + η∇f (xi ) (9.43)
xi := xi−1 − vi (9.44)
where v is the velocity at the current step, and γ is the momentum parameter
that is between 0 and 1, and is usually set to 0.9. The learning rate η and
the gradient ∇f (x) are defined in the previous section.
To analyze the working behavior of the GDM, let us consider two
simplified cases. The first case assumes that all the ∇f (x) at consecutive
steps are pointing to the same direction and have the same values as a
constant vector g. This situation is quite similar to the component of the
gradient in the x1 -direction for the case shown in Fig. 9.15. We now start
the iteration as follows:
v0 := 0, (9.45)
v1 := ηg, (9.46)
v2 := γv1 + ηg = ηg(γ + 1), (9.47)
v3 := γv2 + ηg = ηg(γ 2 + γ + 1), (9.48)
... (9.49)
vt := γvt−1 + ηg = ηg(γ t−1 + γ t−2 + · · · + γ 2 + γ + 1), (9.50)
... (9.51)
ηg
vinf := . (9.52)
1−γ
1. For a large γ that close to 1, such as γ = 0.9, the final velocity would
be 10 times faster than the original gradient descent using the constant
g. Therefore, iterations with momentum can speed up the advancement
in the direction at which all the gradients are aligning on the same
direction, such as the component of the gradient in the x1 -direction shown
in Fig. 9.15.
2. For a small γ that close to 0, such as γ = 0.1, the effect on the acceleration
of the iteration is very small, and hence the GDM will not have much
difference from the original GD algorithm.
366 Machine Learning with Python: Theory and Applications
3. For a middle ranged γ, such as γ = 0.5, the speed up is about 50% in the
alignment direction.
This simple example shows that a momentum is built up to speed up the
marching faster and faster, if the gradients are all well aligned and a larger
γ is used.
In the second example, let us assume that all the ∇f (x) at consecutive
steps are pointing to the opposite direction, but still have the same values as a
constant vector g. This is similar to the x2 component under this assumption;
the standard GD method will not make any advancement in that direction.
Let us examine what would happen when the GDM is used.
We now start the iteration as follows:
v0 := 0, (9.53)
v1 := ηg, (9.54)
v2 := γv1 − ηg = −ηg(1 − γ), (9.55)
v3 := γv2 + ηg = ηg(1 − γ + γ 2 ), (9.56)
v4 := γv3 − ηg = −ηg(1 − γ + γ 2 − γ 3 ), (9.57)
... (9.58)
v2i−1 := ηg(1 − γ + γ 2 , . . . , −γ 2i−3 + γ 2i−2 ), (9.59)
v2i := −ηg(1 − γ + γ 2 , . . . , +γ 2i−2 − γ 2i−1 ), (9.60)
v2i−1 ≤ ηg (9.61)
v2i ≥ −ηg (9.62)
Based on the above analysis, we can conclude that a large γ that close to 1,
say γ = 0.9, is preferred. In summary, the gradient descent with momentum
converges much faster, compared to the original GD, because updates in
the long axis are much faster, and in the short axis there are some levels of
cancellation. The zigzag behavior can be significantly suppressed.
From the algorithm, it is seen that the gradients obtained in all the pre-
vious iterations are involved in the current iteration. The next question one
may ask is what the extent of such involvements. Let us analyze this further.
We start the iteration again as follows considering the gradient is changing
in each iteration:
v0 := 0,
(9.63)
v1 := η∇f (x1 ), (9.64)
v2 := γv1 + η∇f (x2 ) = γη∇f (x1 ) + η∇f (x2 ), (9.65)
v3 := γv2 + η∇f (x3 ) = γ 2 η∇f (x1 ) + γη∇f (x2 ) + η∇f (x3 ), (9.66)
... (9.67)
Because γ < 1, it is clear that the influence of the gradient of the earlier
iterations slowly reduced at the rate of γ.
Finally, one may be inspired from this analysis that a properly varying
γ could be even better! We also note that to solve this kind zigzag behavior
entirely, the conjugate gradient method should be used, but it requires the
valuation of the 2nd gradient (the Hessian matrix), which can be a lot more
expensive and the scalability can also be a problem.
368 Machine Learning with Python: Theory and Applications
import mxnet as mx
from mxnet import autograd, gluon
from mxnet import ndarray as nd
import random
import numpy as np
mx.random.seed(1); random.seed(1)
# Generate data.
n_v = 2 # number of variables (features)
n_s = 1000 # number of samples
params = [w, b]
vs = []
for param in params:
param.attach_grad()
#
vs.append(param.zeros_like())
return params, vs
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
Figure 9.16: Convergence of the loss function for the same linear regression problem,
during the SGD with momentum searching for the minimizer.
9.7.1 Formulation
9.8.1 Formulation
The Adagrad scales the base learning rate η and updates the parameters
using
η
xi := xi−1 − √ ∇f (xi ) (9.71)
Gi +
where is a small positive just for preventing a zero denominator during
the computation. Note that the above operations for these vectors are all
element-wise including the square-root operation, and Gi is also a vector
given by
i
Gi = (∇f (xq ))2 (9.72)
q=1
which is the sum of the element-wise product of the gradients of all the
previous iterations (which are accumulated during the iteration).
Note that this is done for each parameter. For parameters that received
excessive updates in the previous iterations, the current update is suppressed,
while parameters that get small updates receive larger learning rates.
372 Machine Learning with Python: Theory and Applications
For the strong elliptical objective functions discussed earlier, the Adagrad
tends promote advancement along the long axis (x1 -direction) and discour-
ages moves along the short axis (x2 -direction). This practically suppresses
the zigzag behavior shown in Fig. 9.15.
Let us examine the Adagrad algorithm using the same linear regression
problem.
# Adagrad.
def adagrad(params, sqrs, lr, batch_size):
eps_stable = 1e-7
for param, sqr in zip(params, sqrs):
g = param.grad / batch_size
sqr[:]+=nd.square(g) # Accumulate squared gradient
div=lr*g/nd.sqrt(sqr+eps_stable) #element-wise
param[:] -= div
import mxnet as mx
from mxnet import autograd, gluon
from mxnet import ndarray as nd
import random
mx.random.seed(1); random.seed(1)
# Generate data.
n_v = 2 # number of variables (features)
n_s = 1000 # number of samples
true_w, true_b = [2, -3.4], 4.2
X=nd.random_normal(scale=1,shape=(n_s,n_v))
y=true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y+=.01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
Figure 9.17: Convergence of the loss function for the same linear regression problem,
during the AdaGrad searching for the minimizer.
RMSProp is short for Root Mean Square Propagation [6]. It scales the learn-
ing rate for each of the parameters. It has been shown that RMSProp has
good adaptation of learning rate for different applications. More details can
Solution Existence Theory and Optimization Techniques 375
9.9.1 Formulation
RMSProp scales the base learning rate η and updates the parameters using
η
xi := xi−1 − √ ∇f (xi ) (9.73)
vi +
where is a small positive number for preventing a zero denominator during
the computation. Notice again that the operations are element-wise. And,
vi at the ith iteration is given as follows:
# RMSProp.
def rmsprop(params, sqrs, lr, gamma, batch_size):
eps_stable = 1e-8
for param, sqr in zip(params, sqrs):
g = param.grad / batch_size
sqr[:]=gamma*sqr+(1.-gamma)*nd.square(g)
#note the in-place computation here
div = lr * g / nd.sqrt(sqr + eps_stable)
param[:] -= div
import mxnet as mx
from mxnet import autograd, gluon
from mxnet import ndarray as nd
import random
mx.random.seed(1); random.seed(1)
# Generate data.
n_v = 2 # number of variables (features)
n_s = 1000 # number of samples
376 Machine Learning with Python: Theory and Applications
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
Solution Existence Theory and Optimization Techniques 377
print('true_w:',true_w,' true_b:',true_b)
print('Predicted_w:', np.reshape(w.asnumpy(), (1, -1)),
'Predicted_b:', b.asnumpy()[0], '\n')
x_axis=np.linspace(0,epochs,len(total_loss),
endpoint=True)
plt.semilogy(x_axis, total_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
Figure 9.18: Convergence of the loss function for the same linear regression problem,
during the RMSProp searching for the minimizer.
# Adadelta.
def adadelta(params, sqrs, deltas, rho, batch_size):
eps = 1e-5 # epsilon for stabilization
for param, sqr, delta in zip(params, sqrs, deltas):
g = param.grad / batch_size
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta=nd.sqrt(delta+eps)/nd.sqrt(sqr+eps)*g
delta[:]=rho*delta+(1.-rho)*cur_delta*cur_delta
# update weight
param[:] -= cur_delta
import mxnet as mx
from mxnet import ndarray as nd
from mxnet import autograd, gluon
import random
Solution Existence Theory and Optimization Techniques 379
mx.random.seed(1)
# Generate data.
n_v = 2 # number of variables (features)
n_s = 1000 # number of samples
true_w, true_b = [2, -3.4], 4.2
X = nd.random_normal(scale=1,shape=(n_s,n_v))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Construct data iterator.
def data_iter(batch_size):
idx = list(range(n_s))
random.shuffle(idx)
for batch_i,i in enumerate(range(0,n_s,batch_size)):
j = nd.array(idx[i: min(i + batch_size, n_s)])
yield batch_i, X.take(j), y.take(j)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
380 Machine Learning with Python: Theory and Applications
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
print('true_w:',true_w,' true_b:',true_b)
print('Predicted_w:', np.reshape(w.asnumpy(), (1, -1)),
'Predicted_b:', b.asnumpy()[0], '\n')
x_axis=np.linspace(0,epochs,len(total_loss), endpoint=True)
plt.semilogy(x_axis, total_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
Figure 9.19: Convergence of the loss function for the same linear regression problem,
during the AdaDelta searching for the minimizer.
9.11.1 Formulation
At the ith training iteration, for given parameters xi and a loss function
f (xi ), Adam algorithm scales the base learning rate η and updates the
parameters using
ηmi
xi := xi−1 − √ (9.75)
v̂i +
where is a small positive number for preventing a zero denominator during
the computation, and mi and v̂i are given as follows:
mi
mi := (9.76)
1 − β1
vi
v̂i := (9.77)
1 − β2
in which mi and vi , are, respectively, updated using
In Eqs. (9.78) and (9.79), β1 and β2 are all positive constants of forgetting
factors, respectively, for the gradients and the second moments of the
gradients.
382 Machine Learning with Python: Theory and Applications
# Adam.
def adam(params, vs, sqrs, lr, batch_size, t):
beta1, beta2 = 0.9, 0.999
eps = 1e-8 # epsilon for stabilization
for param, v, sqr in zip(params, vs, sqrs):
g = param.grad / batch_size
v[:] = beta1 * v + (1.-beta1)*g # momentum
sqr[:] = beta2 * sqr + (1.-beta2)* nd.square(g)
v_bias_corr = v / (1. - beta1 ** t)
sqr_bias_corr = sqr / (1. - beta2 ** t)
Solution Existence Theory and Optimization Techniques 383
div=lr*v_bias_corr/(nd.sqrt(sqr_bias_corr)+eps)
param[:] = param - div #RMSprop
import mxnet as mx
from mxnet import autograd, gluon
from mxnet import ndarray as nd
import random
mx.random.seed(1); random.seed(1)
# Generate data.
n_v = 2 # number of variables (features)
n_s = 1000 # number of samples
true_w, true_b = [2, -3.4], 4.2
X = nd.random_normal(scale=1, shape=(n_s, n_v))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)
# Linear regression.
def net(X, w, b):
return nd.dot(X, w) + b
# Loss function.
def square_loss(yhat, y):
return (yhat - y.reshape(yhat.shape)) ** 2 / 2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt
import numpy as np
t = 0
# Epoch starts from 1.
for epoch in range(1, epochs + 1):
for batch_i, data, label in data_iter(batch_size):
with autograd.record():
output = net(data, w, b)
loss = square_loss(output, label)
loss.backward()
# Increment t before invoking adam.
t += 1
adam([w, b], vs, sqrs, lr, batch_size, t)
if batch_i * batch_size % period == 0:
total_loss.append(np.mean(square_loss\
(net(X, w, b), y).asnumpy()))
print("Batch size %d, Learning rate %f, Epoch %d",\
"loss %.4e"%(batch_size,lr,epoch,total_loss[-1]))
print('true_w:',true_w,' true_b:',true_b)
print('Predicted_w:', np.reshape(w.asnumpy(), (1, -1)),
'Predicted_b:', b.asnumpy()[0], '\n')
Solution Existence Theory and Optimization Techniques 385
x_axis = np.linspace(0,epochs,len(total_loss),
endpoint=True)
plt.semilogy(x_axis, total_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
Figure 9.20: Convergence of the loss function for the same linear regression problem,
during the Adam searching for the minimizer.
We now present a case study done by the scikit-learn team [9]. This study
examines the performance and plots the loss function evolution, using differ-
ent stochastic minimization techniques, including SGD and Adam. The code
is available at Sklearn example site (https://scikit-learn.org/stable/auto
examples/neural networks/plot mlp training curves.html#). Figure 9.21 is
the result obtained using the code.
It is observed that SDM (momentum), Nesterov’s SDM, and Adam are
among the best performers, and Adam seems to be the overall best, based
386 Machine Learning with Python: Theory and Applications
on this study on four datasets (iris, digits, circles, and moons). Note that
there may be some bias to some techniques, and the outcome of the tests
can change for different datasets. This study shall still give a good general
indication on these different techniques. Note also that those results can be
highly dependent on some parameters. A more detailed examination of this
study can be found at the Sklearn example site.
References
[1] G.R. Liu and S.S. Quek, The Finite Element Method: A Practical Course, Butterworth-
Heinemann, London, 2013.
[2] G.R. Liu, Solution existence theory for artificial neural networks, International Journal
of Computational Methods, 19(8), in-printing, 2022.
[3] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning representations by
back-propagating errors, Nature, 323(6088), 533–536, 1986. http://www.nature.com/
articles/323533a0.
[4] Y. Nesterov, A method for unconstrained convex minimization problem with the rate
of convergence o(1/kˆ2), Doklady ANSSSR (translated as Soviet.Math.Docl.), 269,
543–547, 1983.
[5] J.C. Duchi, H. Elad and S. Yoram, Adaptive subgradient methods for online learning
and stochastic optimization, J. Mach. Learn. Res., 12, 2121–2159, 2011. http://dblp.
uni-trier.de/db/journals/jmlr/jmlr12.html#DuchiHS11.
[6] T. Tieleman and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude, COURSERA, Neural Networks for Machine Learning,
4(1), 26–31, 2012.
[7] M. Zeiler, ADADELTA: An adaptive learning rate method, arXiv, 1212.5701, 2012.
[8] D.P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014. http://
arxiv.org/abs/1412.6980.
[9] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine Learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
MACHINE LEARNING
WITH PYTHON
Chapter 10
389
390 Machine Learning with Python: Theory and Applications
x = [x1 , x2 , . . . , xp ] (10.1)
Figure 10.1: A simple p → 1 neural network with one input layer of multiple neurons
each takes one feature input, and one output layer of a single neuron that produces a
prediction ŷ, using the xw+b formulation.
Loss Functions for Regression 391
the label y. This is done through training of the network, using a dataset as
inputs together with their corresponding label value of y. The data structure
can be exactly the same as those discussed in Chapter 5, by simply replacing
z by ŷ. This replacement is usually done for the last layer of output neurons
where a comparison of the prediction is performed against the label y.
Figure 10.2: A simple p → 1 neural network with one input layer of multiple neurons and
one output layer of a single neuron, using the xw formulation.
Loss (cost, objective) function is needed to measure how far the model
predictions are from the true label data. It is used by the optimizer to
tune the training parameters. Here, we introduce a widely used loss function
in engineering and sciences, known as mean-squared-error (MSE) or L2
loss function, which is used in Chapter 9. Assume we have m labels or
measured/observed actual data. The loss function can be defined as follows:
m m
1 1 1 1
L(ŷ, y) = (ŷi − yi )2 = (ri )2 = (r r) = r22 (10.4)
2m 2m 2m 2m
i=1 ri i=1
where i is the dummy index for the ith training sample and m is the total
number of samples in the training dataset. ri is the ith entry in residual
vector r and is the residual caused by the ith prediction against the label.
Equation (10.4) gives a loss functional that maps m sets of data to a positive
real number. The MSE loss function is called L2 loss function because it uses
the L2 norm, as seen in Eq. (10.4).
The MSE or L2 loss function has the following properties:
Figure 10.3: Schematic drawing for linear regression problems using loss function
measuring the average squared distances from the data-points to the linear prediction
function.
For our linear regression problems, the loss function measures the average
squared distances from the data-points to the straight line of prediction, as
shown in Fig. 10.3 for one-dimensional problems, where we have only one
design parameter or feature, one weight, and one bias.
The error is linear in ŷ, which is in turn linear in weights and biases.
Therefore, the squared error loss function is quadratic in weights and biases.
The minimization would lead to a set of linear system equations, and we
can have a close-form solution for the minimizers, as shown in Chapter 9. In
this chapter, we compute the solution numerically and demonstrate how a
neural network can be trained to do this task.
The MAE or L1 loss function is also called Laplace loss. It has the
following properties:
1. It is still convex with respect to the residual by definition or by the fact
that all norms are convex.
2. The L1 loss is, however, not differentiable at the sample point where
the residual is zero. Special techniques (sub-gradient) are needed to use
gradient-based optimization algorithms.
3. It is distance based depending only on the residual. When the residual is
zero, it is zero. It is also translation invariant (distance-based).
4. Because L1 measures the absolute residuals, it is linearly related to
the residuals. It is thus more robust to outliers compared to its L2
counterpart.
2
1. It is approximately equal to r2 for a small residual, and hence it behaves
more like the L2 loss near zero.
2. It approaches abs(r)−log(2) when the residual is large, and hence is more
like the L1 loss.
3. It is convex, smooth, and differentiable everywhere.
4. It is less sensitive to outliers compared to the L2 loss.
Let us now write a Python code to examine these loss functions in more
detail in numbers and in graphs, using only one sample. We first check all
of them in numbers.
import numpy as np
huber=lambda r,dlt: np.where(np.abs(r)<dlt,0.5*((r)**2),\
dlt*np.abs(r)-0.5*(dlt**2))
for r in np.linspace(0.1,1.0,10): #range(0.1,1, 11):
print(f'r={r:.3f} |r|= {r:.3f} r^2= {r*r:.3f}\
Huber={huber(r,1.0):.3f} logcosh={np.log(np.cosh(r)):.3f}')
r=0.100 |r|= 0.100 r^2= 0.010 Huber=0.005 logcosh=0.005
r=0.200 |r|= 0.200 r^2= 0.040 Huber=0.020 logcosh=0.020
r=0.300 |r|= 0.300 r^2= 0.090 Huber=0.045 logcosh=0.044
r=0.400 |r|= 0.400 r^2= 0.160 Huber=0.080 logcosh=0.078
r=0.500 |r|= 0.500 r^2= 0.250 Huber=0.125 logcosh=0.120
r=0.600 |r|= 0.600 r^2= 0.360 Huber=0.180 logcosh=0.170
r=0.700 |r|= 0.700 r^2= 0.490 Huber=0.245 logcosh=0.227
r=0.800 |r|= 0.800 r^2= 0.640 Huber=0.320 logcosh=0.291
r=0.900 |r|= 0.900 r^2= 0.810 Huber=0.405 logcosh=0.360
r=1.000 |r|= 1.000 r^2= 1.000 Huber=0.500 logcosh=0.434
We notice that the L2, Huber, and log-cosh losses are very insensitive to
small residuals.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc
Loss Functions for Regression 397
plt.figure(figsize=(4.5, 2.8),dpi=100)
plt.xlabel('Residual')
plt.ylabel('Loss (Residual)')
plt.title('Comparison of loss functions')
plt.grid(color='r', linestyle=':', linewidth=0.5)
plt.legend()
plt.show()
The key features mentioned earlier are evident in Fig. 10.4. Readers
may play with the code, make changes to the hyperparameter, and have a
different view of these loss functions. For more detailed discussions, readers
are referred to the article entitled 5 Regression Loss Functions All Machine
Learners Should Know (heartbeat.fritz.ai/5-regression-loss-functions-all-
machine-learners-should-know-4fb140e9d4b0.
398 Machine Learning with Python: Theory and Applications
Figure 10.5: Regression result using linear polynomial and noisy data-points.
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
def datagen1(x,noise=0.3):
# Generate data with features of 1, x (linear)
lx = len(x)
One = np.ones(lx)
np.random.seed(8)
X = np.stack((One, x), axis=-1) # for linear polynomial
y = x + np.random.rand(lx)*noise # y vector,
return X,y
Loss Functions for Regression 399
if reg_type ==1:
X, y = datagen1(x,noise) # Data, linear polynomial
elif reg_type ==2:
X, y = datagen2(x,noise) # Data, 2nd order polynomial
else:
X, y = datagen3(x,noise=0.1) # Data with sin(x)
yout = y.copy()
yout[int(lx/2)],yout[int(lx-2)] = 1.0, 0.8 # 2 outliers
print('y=',y,y.shape)
plt.figure(figsize=(4.5, 2.9),dpi=100)
c = np.linalg.solve(XTX,XTy)
cout = np.linalg.solve(XTX,XTyout)
print(' c*=',c,c.shape)
plt.plot(x,y_hat(X,c),label = "$\hat{y}$ without outliers")
plt.plot(x,y_hat(X,cout), c='r',alpha=0.5,
label = "$\hat{y}$ with outliers")
# compute the predicted lines and plot out.
Length of x= 32 [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.
1.1 1.2 1.3 1.4 1.5 1.6 1.7
1.8 1.9 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3. 3.1]
y= [1.08734294 1.20668748 1.30558878 1.37860578 1.45269118
1.53056542
1.66768936 1.75445282 1.84962356 1.92116609 1.99700663
2.05554596
2.12812864 2.16479564 2.18741794 2.19010416 2.18848111
2.25905033
2.18722504 2.15818019 2.11587827 2.17149642 2.04128198
2.00791829
1.92255747 1.87094914 1.81486486 1.78699438 1.64953431
1.62772427
1.44399321 1.38674914] (32,)
XTy= [58.51029079 93.63832045 39.77958804] (3,)
XTX= [[ 32. 49.6 19.9954796 ]
[ 49.6 104.16 31.42797769]
[ 19.9954796 31.42797769 15.70789511]] (3, 3)
c*= [1.06733004 0.09226871 0.98918366] (3,)
Loss Functions for Regression 401
Figure 10.7: Regression result using 2nd-order polynomial and noisy data-points.
Figure 10.8: Regression result using sine feature function and noisy data-points.
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.05 * rng.randn(50)
xfit = np.linspace(0, 10, 1000)
poly7_model.fit(x[:, np.newaxis], y)
yfit = poly7_model.predict(xfit[:, np.newaxis])
plt.figure(figsize=(3.5, 3.0),dpi=100)
plt.scatter(x, y, s=15., c='b', alpha=0.8)
plt.plot(xfit, yfit,c='b',label = "Fitted with 6th order\
polynomial", alpha=0.9)
plt.plot(xfit, np.sin(xfit),c='r',label = "Original true sine\
function,"alpha=0.9)
plt.grid(color='r', linestyle=':', linewidth=0.3)
Loss Functions for Regression 403
It is seen from Fig. 10.9 that using a 6th-order polynomial basis functions,
we can fit a set of nonlinear data created using a sine function and 5% white
noise. Note that such a high-order polynomial fitting can be very dangerous
when it is used for making predictions outside of the dataset. We can print
out the features of a 6th-order polynomial (variable powered with all the
orders):
We can see that it grows very fast with the increase of the order of the
polynomial. Let us print out the prediction a little out of the fitted domain.
Figure 10.10: Linear regression of data-points with a higher-order polynomial. Note the
danger of extrapolation.
We will now use the Scikit-Learn and the code for linear regression linear
regression (https://jakevdp.github.io/PythonDataScienceHandbook/05.06-
linear-regression.html), using Gaussian basis functions as the feature
function.
@staticmethod
def _gauss_basis(x, y, width, axis=None):
arg = (x - y) / width
return np.exp(-0.5 * np.sum(arg ** 2, axis))
return self
def transform(self, X):
return self._gauss_basis(X[:, :, np.newaxis], self.
centers_,self.width_, axis=1)
xfit = np.linspace(0, 10, 1000)
gauss_model = make_pipeline(GaussianFeatures(20),
LinearRegression())
gauss_model.fit(x[:, np.newaxis], y)
yfit = gauss_model.predict(xfit[:, np.newaxis])
plt.figure(figsize=(4.5, 2.9),dpi=100)
plt.scatter(x, y, s=15., c='b', alpha=0.8)
plt.plot(xfit,yfit,c='b',label="Fitted with Gaussian\
basis")
plt.plot(xfit,np.sin(xfit),c='r',label="Original true\
sine function")
plt.legend(bbox_to_anchor=(1.72, 0.5),loc='center right');
#plt.xlim(0, 10)
This is indeed an excellent fit, as shown in Fig. 10.11. Let us print out
the prediction a little out of the fitted domain.
(-1.5, 3.0)
406 Machine Learning with Python: Theory and Applications
Figure 10.12: Linear regression of data-points with RBFs. Note the danger of
extrapolation.
• The labels are generated using the “true” feature and with artificially
added noise, synthesizing the measurement error. The noise generation
uses a random Gaussian distribution with zero mean and a small
percentage (say, 10%; readers may change it) of variance.
We first import necessary libraries for this task.
# The following codes are modified from these at https://
# github.com/zackchase/ mxnet-the-straight-dope/tree/master/
# chapter02_supervised- learning
# Under Apache-2.0 License.
print('y=',y[0:10])
print(y.shape)
print(X[:5])
print(X.shape)
408 Machine Learning with Python: Theory and Applications
y=
[ 6.0136046 2.2423 9.592255 3.569859 2.2450843
4.9734535 5.285636 6.4735103 10.942213 8.004087 ]
<NDArray 10 @cpu(0)>
(1000,)
[[ 0.03629482 -0.49024424]
[-0.9501793 0.03751943]
[-0.7298465 -2.0401056 ]
[ 1.4821309 1.040828 ]
[-0.45256865 0.3116043 ]]
<NDArray 5x2 @cpu(0)>
(1000, 2)
Next, let us compute the covariance matrix of X just to see the level of
independence of these two variables.
Covariance Matrix of X:
[[ 1.05065641 -0.01452139]
[-0.01452139 1.08663854]]
These two variables are quite independent, which is good news, and a
quality model can likely be built. Further, let us check how far the prediction
is from the label for the 1st sample:
y hat=
[5.9394197]
<NDArray 1 @cpu(0)> Label y 1=
[6.0136046]
<NDArray 1 @cpu(0)>
These values are quite close. We can hope for quite a fast training.
Finally, let us use matplotlib to plot all the data-points for easy viewing.
In order to use matplotlib, we need to convert data arrays in NDArray to
NumPy arrays using the .asnumpy() function.
Loss Functions for Regression 409
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X[:, 0].asnumpy(),y.asnumpy(),color='r')
plt.scatter(X[:, 1].asnumpy(),y.asnumpy(),color='b')
plt.show()
Figure 10.13: Computer-generated data-points for linear regression study using NNs.
These data-points for two different features are quite distinct. Our model
shall be able to learn these features.
[[-0.7022906 0.6857335 ]
[-0.02577175 0.43850085]
[-0.38564655 0.6393674 ]
[-0.6317411 -1.3817437 ]]
<NDArray 4x2 @cpu(0)>
[0.22829296 2.823702 1.2765092 7.6593714 ]
<NDArray 4 @cpu(0)>
When we run the same code again, we shall get a different set of 4 samples,
because we have set shuffle=True.
[[-2.206704 2.3855615 ]
[ 0.7445208 -0.2434989 ]
[-1.0826081 -1.027927 ]
[ 0.09882921 -0.86426044]]
<NDArray 4x2 @cpu(0)>
[-8.413907 6.1686425 5.5676003 7.5225515]
<NDArray 4 @cpu(0)>
Let us see how many batches we have, and then ensure that makes sense.
250
Loss Functions for Regression 411
First, initialize the training parameters for both the weights and bias.
Usually, this is done by assigning some random numbers.
w = nd.random_normal(shape=(num_inputs, num_outputs),
ctx=model_ctx)
b = nd.random_normal(shape=num_outputs, ctx=model_ctx)
params = [w, b]
print(params)
print('w=',w, '\n b=',b)
[
[[-2.3237102]
[-1.109485 ]]
<NDArray 2x1 @cpu(0)>,
[-0.48563406]
<NDArray 1 @cpu(0)>]
w=
[[-2.3237102]
[-1.109485 ]]
<NDArray 2x1 @cpu(0)>
b=
[-0.48563406]
<NDArray 1 @cpu(0)>
Allocate some memory for the gradient of loss functions with respect to
all learning parameters.
[0.]
<NDArray 1 @cpu(0)>
Following the formulas given in the formulation section, the net can be
defined simply as follows:
def net(X):
return mx.nd.dot(X, w) + b
We use the L2 loss function that is the squared distance between the
prediction and the label value.
Let us use the stochastic gradient descent (SGD) to train the net. This
is done in mini batches. At each iteration, a batch of training samples is
randomly drawn from the dataset. We shall then compute the gradient of
the loss function with respect to all the training parameters: the weights
and biases. The gradient provides the direction for updating the training
parameters, and a learning rate lr, usually a small number, will be specified
to determine how fast the parameter should be updated at each step along
that direction.
The training is done in stages. Each stage is called an epoch, that is, the
number of drawings of the data batches over the entire dataset. Then, for
each pass, we iterate through the train data, grabbing batches of examples
and their corresponding labels.
Loss Functions for Regression 413
epochs = 10
learning_rate = .01
num_batches = num_samples/batch_size
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx).reshape((-1, 1))
with autograd.record(): # record the path for gradient
output = net(data)
loss = square_loss(output, label)
loss.backward() # compute gradient via backward prop.
SGD(params, learning_rate) # perform optimization
cumulative_loss += loss.asscalar() # update the loss.
print(cumulative_loss / num_batches)
4.694164189037867
0.010493556420027745
0.010353110232390463
0.01032697731954977
0.010341777712805197
0.010346036787843332
0.010346328813116997
0.010331496173050254
0.01031649975432083
0.010330414514523
fg2.plot(X[:sample_size, 1].asnumpy(),
net(X[:sample_size,:]).asnumpy(),'or', label='Estimated')
fg2.plot(X[:sample_size, 1].asnumpy(),
real_fn(X[:sample_size,:]).asnumpy(),'*g', label='Real')
fg2.legend()
plt.show()
learning_rate = .01
losses = []
plot(losses, X)
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label=label.as_in_context(model_ctx).reshape((-1, 1))
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
SGD(params, learning_rate)
cumulative_loss += loss.asscalar()
plot(losses, X)
We now present a case study done by the scikit-learn team [3]. This study
uses a multi-layer perceptron network (MLPRegressor class) to perform non-
linear regression using the Boston housing price dataset, together with the
decision tree regression [4, 5]. A code is provided to perform the regressions
and then plot the partial dependence curves. The plot partial dependence
function returns a PartialDependenceDisplay class object that contains
the necessary attributes, which can be used for plotting without recalculating
the partial dependence.
416 Machine Learning with Python: Theory and Applications
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.inspection import plot_partial_dependence
We train a decision tree (will not be discussed in this book) and a multi-layer
perceptron on the Boston housing price dataset. One may use help to figure
out these classes build in Sklearn. For example, use “help(MLPRegressor)”
to find out the setting of the neural network, when using the MLPRegressor
class.
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
tree = DecisionTreeRegressor()
mlp = make_pipeline(StandardScaler(),
MLPRegressor(hidden_layer_sizes=(100, 100), tol=1e-2,
max_iter=500, random_state=0)) # 2 hidden layers, Relu.
tree.fit(X, y)
mlp.fit(X, y); # you may remove # to view outputs.
We plot partial dependence curves for two features “LSTAT” and “RM” for
the decision tree, using plot partial dependence(). The spacing of the grids
of these two plots is defined by ax.
Figure 10.16: Results using the decision tree in Sklearn for the Boston housing price
dataset.
Next, the partial dependence curves are plotted for the MLP.
Figure 10.17: Results using the MLP in Sklearn for the Boston housing price dataset.
418 Machine Learning with Python: Theory and Applications
tree_disp.plot(ax=[ax1,ax2],line_kw={"label":"Decision Tree"})
mlp_disp.plot(ax=[ax1, ax2], line_kw={"label":"Multi-layer
Perceptron","c": "red"})
ax1.legend()
ax2.legend();
Figure 10.18: Comparison of results obtained using the decision tree and MLP in Sklearn
for the Boston housing price dataset.
Note that the above-discussed regression models are in fact linear models,
even if we use nonlinear feature functions, such as higher-order polynomial,
many other special feature functions, and kernels in SVMs. The term feature
function used in regression is essentially the same as the basis function in
FEM. It is just a different name.
We may also note here that when a regression model uses a nonlinear
feature function, it is sometimes called nonlinear regression. This is different
Loss Functions for Regression 419
from the FEM types of models, where an FEM model is still called linear,
even if nonlinear basis functions are used. When we say nonlinear FEM
model, it is truly nonlinear because the system matrix will depend on the
unknown field variables, and hence a solution to a nonlinear FEM model is
usually iterative.
In fact, one should not be confused that the regression model itself is a
linear one, even if a nonlinear feature function is used, because the system
matrix X X does not depend on the regression parameter [ŵ], and [ŵ] can
be obtained by a single step with a linear equation solver. The prediction
is assumed to be a linear combination of the features and the regression
parameters, shown in the formulation above and will also be seen explicitly
in the code to be written later. This seems just to be a minor terminology
problem, but it may have conceptual significance. Moreover, there are also
true nonlinear regression models, in which the regression parameters and
some (or all) features cannot be written in linear combination. For such true
nonlinear regression models, the solution should be obtained via iterative or
successive fitting processes.
10.7 Conclusion
References
[1] G.R. Liu and G.Y. Zhang, Smoothed Point Interpolation Methods: G Space Theory
and Weakened Weak Forms, World Scientific, Singapore, 2013.
420 Machine Learning with Python: Theory and Applications
[2] G.R. Liu, Mesh Free Methods: Moving Beyond the Finite Element Method, Taylor
and Francis Group, New York, 2010.
[3] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine Learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
[4] X. Wu, R.J. Kumar Vipin et al., Top 10 algorithms in data mining, Knowledge and
Information Systems, 14, 1–37, 2007.
[5] W. Pushkar and G.R. Liu, Real-time prediction of projectile penetration to laminates
by training machine learning models with finite element solver as the trainer, Defence
Technology, 17(1), 147–160, 2021. https://www.sciencedirect.com/science/article/
pii/S2214914720303275.
Chapter 11
421
422 Machine Learning with Python: Theory and Applications
Or by the xw formulation
p
z= Wi xi = xŵ (11.2)
i=0
ŷ = z (11.3)
The values of the linear prediction function are in Y1 , and are real
numbers in (−∞, ∞).
Because the value of a logistic prediction function falls in the region of (0, 1),
it is now comparable with the given labels.
Loss function is a measure of the correctness of the prediction against the true
label that is discrete for binary classification problems. Constructing a loss
function requires careful considerations. We first define some terminology.
Case II: the label is given as y ∈ {0, 1}. In this case, the margin m(x) is
defined as ⎧
⎨(1 − ŷ), when y = 0
mg (x) = y ŷ(x) + (1 − y)(1 − ŷ(x)) = (11.8)
⎩ŷ, when y = 1
where 1() is the indicator function. Figure 11.1 shows schematically the 0–1
loss.
The 0–1 loss function has the following properties:
1. This function has 1 as its value when the margin is smaller or equal to zero.
It is zero otherwise, where it gives the correct prediction. Minimization
of the loss is to maximize the margin.
Loss Functions and Models for Classification 425
For a given margin function, the hinge loss function has the following form:
Figure 11.2 shows schematically the hinge loss together with the 0–1 loss.
Figure 11.2: Hinge loss and 0–1 loss functions for classification.
426 Machine Learning with Python: Theory and Applications
The hinge loss can be used to train an SVM model using a gradient-based
minimization algorithm.
For a given margin function, the logistic loss function has the following form:
Figure 11.3 shows schematically the logistic loss together with the hinge loss
and the 0–1 loss.
The logistic loss function has the following properties:
Figure 11.3: Logistic, Hinge, and 0–1 loss functions for classification.
Loss Functions and Models for Classification 427
Figure 11.4: Exponential, Logistic, Hinge, and 0–1 loss functions for classification.
Figure 11.4 shows schematically the exponential loss together with the hinge
loss and the 0–1 loss.
The exponential loss function has the following properties:
2. It is sensitive to outliers.
3. It may have higher complexity and hence need for samples compared to
the hinge and logistic lasses [1].
# the logloss
def logloss(mg):
return np.log(1.+np.exp(-mg))
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc
plt.figure(figsize=(4.5, 2.8),dpi=100)
plt.plot(mg,zerone(mg),label = "0-1 loss") #+str(f"{L:.2f}"))
plt.plot(mg,hinge(mg),label = "Hinge loss")
plt.plot(mg,logloss(mg),label = "Logistic loss")
plt.plot(mg,exploss(mg),label = "exponential loss")
plt.plot(mg,squared(mg),label = "Squared loss")
dlt = 1.0
plt.xlabel('Margin, $m_g=y\hat{y}$')
plt.ylabel('Loss ($m_g$)')
plt.title('Loss functions for classification')
plt.grid(color='r', linestyle=':', linewidth=0.2)
plt.legend()
plt.ylim(-0.2, 5)
plt.show()
If each sample is independent of the rest, and each label is only for its own
corresponding sample, these m events are independent. Thus, the overall
probability is the product of all the probabilities for each sample. Equation
(11.14) becomes
Pŵ (y1 |x1 )Pŵ (y2 |x2 ) . . . Pŵ (ym |xm ) (11.15)
Now, our goal is to find the prediction yˆi as close as possible to the labels
yi , but needs to be in terms of probability. Also, we already know that yˆi
will be a value between 0 and 1, when using the sigmoid logistic function to
squash any input given. Moreover, because a label yi is given as either 1 or 0
for a binary classification problem, the prediction ŷi shall have the following
correspondence to label yi :
• When yi belongs to the positive class (i.e., yi = 1), the probability for
yi being observed for the given ith sample should be set as yˆi . This is
because when yˆi is maximized, over the training parameters ŵ, we obtain
the correct correspondence.
• On the other hand, when yi belongs to the negative class (i.e., yi = 0), the
probability for yi being observed for the given ith sample should be set as
1 − yˆi . This is because when yˆi is maximized, over the training parameters
ŵ, 1 − yˆi is minimized to approach 0. We, again, obtained the correct
correspondence.
The above setting nicely covers both possible situations for our binary
classification problem. Hence, for the ith sample, this setting can be
expressed in the following formula:
ŷi , if yi = 1
Pŵ (yi |xi ) = (11.17)
1 − ŷi , if yi = 0
Finally, take a log for the probability above for all samples; our loss function
can be expressed as
m
L(y, ŷ) = − (yi log ŷi + (1 − yi ) log(1 − ŷi )), (11.19)
i=1
where the minus sign is added, because the loss function will be minimized
rather than optimized in the training process. Note also that log function
is again used to make the power function easier to deal with in gradient
432 Machine Learning with Python: Theory and Applications
computations. It is clear now that the above loss function is exactly the
measure of the binary cross-entropy of the distribution of the label with
respect to that of prediction, which are given in the last section of Chapter 4.
This loss function is called “log loss” or “binary cross-entropy”, or “negative
log likelihood”. It is a special case of cross-entropy, which can used for multi-
classification problems.
11.2.8 Remarks
Note that the loss functions discussed above are basic ones. There are
also problem-specified loss functions. For example, when we deal with
k-classification problems, cross-entropy loss may be a better choice (see the
next chapter). Also, when designing a facial recognition neural network, we
may use a special loss function called triplet loss.
With all the concepts set properly, let us build a feed forward net to perform
binary classifications. The neural network is schematically shown in Fig. 11.7.
This simple neural network has just one input and one output layer. It
uses a combination of affine mapping, logistic prediction function, and binary
cross-entropy loss. One may choose other types of combinations, depending
on the application problem. The training of the net requires a backward
process to update the learning parameters of weights and bias.
Figure 11.7: A simple neural network with just one input and one output layer for binary
classification: affine mapping, logistic prediction function, and binary cross-entropy loss.
Loss Functions and Models for Classification 433
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
data_ctx = mx.cpu()
# Change this to mx.gpu(0) to train on an NVIDIA GPU
model_ctx = mx.cpu()
with open("datasets/a1a.train") as f:
train_raw = f.read()
with open("datasets/a1a.test") as f:
test_raw = f.read()
print(len(test_raw))
print(test_raw[0:216])
114816
-1 3:1 11:1 14:1 19:1 39:1 42:1 55:1 64:1 67:1 73:1 75:1 76:1 80:1 83:1
-1 3:1 6:1 17:1 27:1 35:1 40:1 57:1 63:1 69:1 73:1 74:1 76:1 81:1 103:1
-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1
434 Machine Learning with Python: Theory and Applications
The first three lines of data are printed above. The first entry in each
row is the label value. The following tokens are the indexes of the non-zero
features. The number “1” is redundant. Let use the following code to process
the dataset.
def process_data(raw_data):
train_lines = raw_data.splitlines()
num_examples = len(train_lines)
num_features = 123
X = nd.zeros((num_examples, num_features), ctx=data_ctx)
Y = nd.zeros((num_examples, 1), ctx=data_ctx)
for i, line in enumerate(train_lines):
tokens = line.split()
label(int(tokens[0])+1)/2 #Change label:{-1,1} to {0,1}
Y[i] = label
for token in tokens[1:]:
index = int(token[:-2]) - 1
X[i, index] = 1
return X, Y
print('Xtrain:',Xtrain.shape)
print('Ytrain:',Ytrain.shape)
num_sample = len(Xtrain[0:])
print('num_sample=',num_sample)
print('Xtest:',Xtest.shape)
print('Ytest:',Ytest.shape)
The data value of each sample is given in either 1.0 or 0.0, corresponding
to “yes” or “no”. The same is true also for these labels.
Check the ratio of positive samples in the training and test datasets. This
shall give us an indication of the overall distribution of the training and test
data.
print(nd.sum(Ytrain)/len(Ytrain))
print(nd.sum(Ytest)/len(Ytest))
[0.24053495]
<NDArray 1 @cpu(0)>
[0.24610592]
<NDArray 1 @cpu(0)>
batch_size = 64
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset
(Xtrain, Ytrain), batch_size=batch_size, shuffle=True)
test_data = gluon.data.DataLoader(gluon.data.ArrayDataset
(Xtest, Ytest), batch_size=batch_size, shuffle=True)
return nd.nansum((1-y*yhat-(1-y)*(1-yhat))**2)
#work best: 0.849.
import numpy as np
# Instantiate an optimizer
# Choices on optimizer and the learning rate are made here.
trainer = gluon.Trainer(net.collect_params(),'sgd',
{'learning_rate':0.01})
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data) # output=z, net=x.w+b
loss = log_loss(output, label)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.sum(loss).asscalar()
Loss Functions and Models for Classification 437
if e%20 == 0 or e == epochs-1:
print("Epoch %s, loss: %s" % (e, cumulative_loss ))
loss_sequence.append(cumulative_loss)
num sample= 30956 epochs= 100 max epochs= 484 batch size= 64
The initial loss in the order of 21457
Epoch 0, loss: 7092.165759086609
num_correct = 0.0
num_total = len(Xtrain)
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
output = net(data)
prediction = (nd.sign(output) + 1) / 2
num_correct += nd.sum(prediction == label)
The accuracy looks good. It should be good, because we have used the
data to train the net. Therefore, this is only an indication that the training
itself is well done. Let us finally conduct a more objective assessment, using
the test data, which gives the indication of the accuracy and usefulness of
the trained model.
num_correct = 0.0
num_total = len(Xtest)
for i, (data, label) in enumerate(test_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
output = net(data)
prediction = (nd.sign(output) + 1) / 2
num_correct += nd.sum(prediction == label)
We now see the accuracy of the trained model after the training is done
for the specified epochs, using both the training and the test datasets. We
have also run the same code using 483 epochs (the max epochs), which allows
the model to see almost all the training samples, and the accuracy can reach
about 84% for the test dataset.
This model is simple, but it shall give one good experience in training
neural networks for similar types of problems.
We now introduce a case study done by the scikit-learn team [2]. It uses
a total of 10 classifiers in scikit-learn on three synthetic datasets, and plots
decision boundaries together for easy comparison.
%matplotlib inline
print(__doc__)
# Code source: Gaël Varoquaux
# Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
Loss Functions and Models for Classification 439
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons,
make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import
GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import
RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import
QuadraticDiscriminantAnalysis
X, y = make_classification(n_features=2, n_redundant=0,
n_informative=2,random_state=1,n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
440 Machine Learning with Python: Theory and Applications
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
if ds_cnt == 0:
ax.set_title(name)
ax.text(xx.max()-.3,yy.min()+.3,('%.2f'
%score).lstrip('0'),
size=15,horizontalalignment='right')
i += 1
plt.tight_layout()
plt.show()
Figure 11.8: Comparison of the classification results using various classifiers. A study by
Sklearn team.
Loss Functions and Models for Classification 443
Three randomly generated datasets are used in this study, and the
results are plotted in three row in Fig. 11.8. The methods used are
Nearest neighbors, Linear SVM, RBF SVM, Gaussian Process, Decision
Tree, Random Forest, MLP Neural Network, Ada Boost, Gaussian Naive
Bayes, and Quadratic Discriminant Analysis (QDA). The training data-
points are in solid colors, and the testing data-points are semi-transparent.
The number in the lower-right corner is the accuracy of the classification on
the test datasets. Although we did not discuss some of these classification
methods, the decision boundaries obtained by these different methods give
some intuitive indication of how these methods work. Readers may read the
documentation of Sklearn for more details.
References
[1] R. Lorenzo, De Vito Ernesto, C. Andrea et al., Are loss functions all the same?,
Neural Computation, 16(5), 1063–1076, May 2004. https://doi.org/10.1162/0899766
04773135104.
[2] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
[3] H. Drucker, Improving regressors using boosting techniques, Proc. 14th International
Conference on Machine Learning, 1997.
[4] W. Pushkar and G.R. Liu, Real-time prediction of projectile penetration to laminates
by training machine learning models with finite element solver as the trainer, Defence
Technology, 17(1), 147–160, 2021. https://www.sciencedirect.com/science/article/
pii/S2214914720303275.
MACHINE LEARNING
WITH PYTHON
Chapter 12
Multiclass Classification
For binary logistic regression problems, we often use the following activation
function for the neuron at the final layer to predict the output:
1 1 1
ŷ = σ(z) = −z
= = (12.1)
1+e 1+e−(x·w+b) 1 + e−(x·ŵ)
where σ stands for the sigmoid function. It squashes an arbitrarily real
number into a number between 0 and 1, which becomes a type of probability
output.
For k-classification problems, we create a neural network with k neurons
at the output layer. If there are p inputs xi (i = 1, 2, . . . , p), the input layer
shall have p neurons. The structure of the simplest neural network can be
shown in Fig. 12.1.
445
446 Machine Learning with Python: Theory and Applications
ez
softmax(z) = k (12.2)
zj
j=1 e
z = x W+ b = x Ŵ (12.3)
1×k 1×p p×k 1×k 1×(p+1) (p+1)×k
zj = x Ŵ[j] (12.4)
e(x Ŵ)
ŷ = softmax(x Ŵ) = k (12.5)
(x Ŵ[j])
j=1 e
ys ·log ŷs
where s stands for the sth sample in the dataset. The above loss function
essentially cares only about the loss related to the correct label. Note that
this cross-entropy loss function is still a scalar function.
Compared to the loss function used in the binary classification problems,
we can expect that the binary-cross-entropy loss function is a special case
of the k-classification loss function. To proof this, we simply let k = 2 in
Eq. (12.6), which gives
⎛ ⎞
m 2
1 ⎝
L(y, ŷ) = − yj log ŷj ⎠
m
s=1 j=1
s (12.7)
m
1
=− (yj=1 log ŷj=1 + yj=2 log ŷj=2)s
m s=1
Notice that index j is for the class, and s stands for the sth sample. Now,
in the case of binary classifications, we must have
This is true for any sample. Substituting Eq. (12.8) into (12.7), dropping
j = 1 because it is the only one left, and then moving the index s inside, we
have
m
1
L(y, ŷ) = − (ys log ŷs + (1 − ys ) log(1 − ŷs )), (12.9)
m
s=1
data_ctx = mx.cpu()
model_ctx = mx.cpu() # model_ctx = mx.gpu() # alternative
Each item in the train (and test) dataset is a tuple of an image paired
with a label:
print(len(mnist_train))
image,labelmnist_train[100] # unpack the 100th data pair
print(image.shape, label) # check the shape and label
#print(image) # print the image in floats
60000
(28, 28, 1) 5.0
(28, 28, 3)
We shall now visualize the image to see whether the image tallies with
the label.
Figure 12.2: A sample image of handwritten digit from the MNIST dataset.
label= 5.0
batch_size = 64
train_data=mx.gluon.data.DataLoader(mnist_train,batch_size,\
shuffle=True)
test_data=mx.gluon.data.DataLoader(mnist_test,batch_size,\
shuffle=False)
def softmax(y_linear):
exp=nd.exp(y_linear-nd.max(y_linear,axis=1).reshape((-1,1)))
norms = nd.sum(exp, axis=1).reshape((-1,1))
return exp / norms
sample_y_linear = nd.random_normal(shape=(2,10))
sample_yhat = softmax(sample_y_linear)
print(sample_yhat)
print(nd.sum(sample_yhat, axis=1))
452 Machine Learning with Python: Theory and Applications
[1. 1.]
<NDArray 2 @cpu(0)>
def net(X):
y_linear = nd.dot(X, W) + b
yhat = softmax(y_linear)
return yhat
y = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] (12.10)
We can compute the accuracy of a model with its current learning parame-
ters, by finding out the ratio of the number of correct answers with the total
number of tests. We use the following evaluation loop to do this:
For our randomly initialized model, roughly one tenth of all examples
may belong to each of the 10 classes. We thus expect an accuracy of 0.1.
Now, we are ready to train the model.
params = [W, b]
print(W.shape,b.shape)
454 Machine Learning with Python: Theory and Applications
start_t = time.process_time()
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1,784))
label = label.as_in_context(model_ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
cumulative_loss += nd.sum(loss).asscalar()
plt.figure(figsize=(15, 60))
sample_data = mx.gluon.data.DataLoader(mnist_test,10,shuffle=True)
start_t = time.process_time()
for i, (data, label) in enumerate(sample_data):
data = data.as_in_context(model_ctx)
print(data.shape)
im = nd.transpose(data,(1,0,2,3))
im = nd.reshape(im,(28,10*28,1))
imtiles = nd.tile(im, (1,1,3))
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
break
Figure 12.3: Predicated digits from the trained one-layer NN, using the testing dataset
of MNIST.
Summary of this study: The accuracy is about 90% after five epochs of
training, and the time taken on the author’s laptop for the training is about
500 s. The time taken for the prediction of 10 digits is only about 0.55 s.
When using 10 epochs, the test accuracy increases to 91%, and the time
taken for the training is 1,100 s (about double). When using 50 epochs, the
test accuracy increases to 92%. When using 100 epochs, the test accuracy
increases to 92.3%. Further training using this model may not make much
difference. To improve further, one may need to increase the learning ability
of the net to capture more nonlinear features for the images of handwritten
digits. Based on the Universal Prediction Theory, this can be done using
more hidden layers known as the multilayer perceptron (MLP, see the next
chapter).
For comparison, let us do the same digit prediction using the gradient
boosting classifier. It is one of the powerful techniques for classification
problems. Readers may read an online article by Harshdeep Singh at towards
datascience. We will not discuss this technique in detail, but will simply
use the model and codes provided by scikit learn (or Sklearn) [1] for this
comparison study. We slightly modified the code to allow readers to view
more detailed differences in the results for different parameter settings.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline
Multiclass Classification 457
<matplotlib.image.AxesImage at 0x1eea6883dd8>
plt.show()
print('Predicated digits for the first 10 in the test set:',
predicted_digits[0:10])
First 10 true digits in the test set: [0 6 5 6 2 8 9 5 0 9]
Figure 12.5: Predicated handwritten digits using the Sklearn Random forest and the
testing dataset of MNIST.
Multiclass Classification 459
• When n estimators = 50, and min samples split=2 are used, the accuracy
is: 96.11%.
• When n estimators = 50, and min samples split=2 are used, the accuracy
is: 96.81%.
• When n estimators = 100, and min samples split=2 are used (this is the
default setting), the accuracy is: 96.96%.
• When n estimators = 100, and min samples split=4 are used, the accuracy
is: 96.91%.
• When n estimators = 200, and min samples split=2 are used, the accuracy
is: 97.15%.
• When n estimators = 400, and min samples split=2 are used, the accuracy
is: 97.14%.
• When n estimators = 200, and min samples split=4 are used, the accuracy
is: 97.01%.
• When n estimators = 400, and min samples split=4 are used, the accuracy
is: 97.14%.
460 Machine Learning with Python: Theory and Applications
50965 8
1315 3
57499 4
11420 2
1328 5
Name: label, dtype: int64
Multiclass Classification 461
rf=RandomForestClassifier(n_estimators=n_e,min_samples_ split=m_s)
start_t = time.process_time()
rf.fit(x_train,y_train)
t_elapsed = (time.process_time() - start_t)
print('RandomForestClassifier, training time =',f'{t_elapsed}','s')
ef = ExtraTreesClassifier(n_estimators=n_e, min_samples_split=m_s)
start_t = time.process_time()
ef.fit(x_train,y_train)
t_elapsed = (time.process_time() - start_t)
print('ExtraTreesClassifier, training time =',f'{t_elapsed}','s')
gf = ensemble.GradientBoostingClassifier(n_estimators=n_e,\
min_samples_split=m_s)
start_t = time.process_time()
gf.fit(x_train,y_train)
t_elapsed = (time.process_time() - start_t)
print('GradientBoostingClassifier,training time=',\
f'{t_elapsed}','s')
start_t = time.process_time()
rf_predicted_digits =rf.predict(x_test)
t_elapsed = (time.process_time() - start_t)
print('RandomForestClassifier, prediction time=',
f'{t_elapsed:.5f}','s')
start_t = time.process_time()
ef_predicted_digits =ef.predict(x_test)
t_elapsed = (time.process_time() - start_t)
print('ExtraTreesClassifier, prediction time =',
f'{t_elapsed:.5f}','s')
start_t = time.process_time()
gf_predicted_digits =gf.predict(x_test)
t_elapsed = (time.process_time() - start_t)
print('GradientBoostingClassifier,prediction time=',
f'{t_elapsed:.5f}','s')
It is clear that the time for prediction is much less than linear with
n estimators, and they all are very fast in general.
Due to rapid development over the past decades, many Python packages
have already been developed for k-classification problems. TensorFlow is one
such excellent package. Readers may take a look at the TensorFlow tutorial
problems of classification images of clothing, at https://www.tensorflow.org/
tutorials/keras/classification. After going through this chapter, readers shall
not have much problem in understanding this TensorFlow tutorial.
Multiclass Classification 465
12.7 Remarks
Reference
[1] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
MACHINE LEARNING
WITH PYTHON
Chapter 13
467
468 Machine Learning with Python: Theory and Applications
Figure 13.1: An MLP with multiple hidden layers with a chain of stacked affine
transformations. All the neurons are connected with weights (dense, or fully connected),
and a bias is also added (not shown) to each of the neurons in all the layers (except the
input layer). Here cross entropy loss is used for classification, but loss function discussed in
Chapter 12 may also be used. When the MLP is used for regression, L2 loss or any other
loss function discussed in Chapter 10 may be used.
At the input layer, we have a input vector of training samples with p features:
(N )
where z(NL ) is an affine transformed vector with k components zi L (i =
1, 2, . . . , k), and the computation of the exponentiates is element-wise. The
dimensions of all these matrices can be seen in the following form:
For an MLP network, x(NL −1) is the output from the last hidden layer and
can be written as a vector of
where the substrate nh is the number of neurons in the last hidden layer of
the MLP.
470 Machine Learning with Python: Theory and Applications
For an MLP, in the feed-forward process, the values that the neurons
receive and produce are
x(0) = x = [x1 , x2 , . . . , xp ] feature at the input layer
(0) (1) (1)
x(1) = φ(1) (x
W + b ) feature at the 1st hidden layer
z(1)
x(2) = φ (x W(2) +
(2)
(1)
b(2)) feature at the 2nd hidden layer
z(2)
...
(i−1) (13.5)
x(i) = φ(i) (x
W((i)) ((i))
+ b ) feature at the ith hidden layer
z(i)
...
ŷ = x(NLi )
(NL − 1)
= φ(NLi ) (x
W(NLi )
+ b
(NLi )
) prediction at the output layer
z(NLi )
Here, the number in the parentheses in the superscript stands for the layer
number. The terms above the curly brackets are the affine transformation
functions z(i) (i = 1, 2, . . . , NL ), which can change from layer to layer and
bring in information for the past layers. After fitted to the activation
function, it becomes new independent “features” for the next layer. The
dimension of the new feature space changes accordingly.
The above chain equations for the network flow can also be expressed as
ŷ = φ(NL ) (· · · φ(2) (φ(1) (xW(1) + b(1) )W(2) + b(2) ) · · · W(NL ) + b(NL ) ) (13.6)
We now see explicitly that our MLP model is indeed a giant function
with k components in the feature space Xp and controlled (parameterized)
by the training parameters in WP , as mentioned in Section 1.5.5. When these
parameters W(i) (i = 1, 2, . . . , NL − 1) are tuned, one gets k giant functions
over the feature space Xp . On the other hand, this set of k giant functions can
also be viewed as differentiable functions of these parameters for any given
data-point in the dataset, which can be used to form a differentiable loss
function using the corresponding k labels given in the dataset. The training
is to minimize such a loss function for all the data-points in the dataset, by
updating the training parameters to become minimizers.
This feed-forward network creates a chain “reaction”, and such a nice
flowing structure is made use of in computing the gradients in the so-
called backward propagation or backprop or BP [10], using the chain rule
of differentiation. The backprop process can be achieved via the autograd
discussed in detail in Chapter 8. Each layer in an MLP has its own set
of learning parameters (in the weight matrix and bias vector). As the layer
Multilayer Perceptron (MLP) for Regression and Classification 471
number grows, the total number of learning parameters can grow fast. Notice
also that the activation function used in each layer can be in theory different,
although in many current practices, the hidden layer often uses the same type
of activation function, typically Relu and tanh.
For the last output layer, the activation function is often chosen differently
based on the application problem the network is built for. For many
engineering problems of the regression type, we may simply use a linear
activation function, meaning that we do nothing for the last layer, and output
whatever is received.
ŷ = x(NL ) = z(NL ) = x(NL −1) W(NL ) + b(NL ) = x(NL −1) Ŵ(NL ) (13.7)
In this case, the mean square error (or L2) loss function is often used and
minimized.
For classification problems, we may use the sigmoid or tanh function for
binary classification. In general, one can use the softmax as the prediction
function, as given in Eq. (13.2). Note that we have proved in the previous
chapter that a binary classification problem can be treated as a special case
of k-classification problem.
(1) n1
x(1) = φ(1) x(0) W ∈X at the 1st hidden layer
(2) n2
x(2) = φ(2) x(1) W ∈X at the 2nd hidden layer
... (13.9)
(j) nj
x(j) = φ(j) x(j−1) W ∈X at the ith hidden layer
...
(j) b(j)1
φ(j) x(j−1) W = φ(j) x(j−1)
W(j)0
(j)
(j−1)
1 b(j)
=φ 1 x
0 W(j)
(13.10)
= φ(j) 1 x(j−1) W(j) + b(j)
= 1 φ(j) x(j−1) W(j) + b(j)
= 1 x(j) = x(j)
In the 3rd line in Eq. (13.10), we moved φ(j) inside the array, because we
assumed the constant 1 does not subject to activation and the fact that
φ(j) operation is element-wise. It is clear that Eq. (13.10) achieves the same
results as in Eq. (13.5), but with an extra constant 1 component in the
array. The constant 1 is automatically generated when moving to the next
layer. This is because an affine transformation stays in the affine space, as
demonstrated in Chapter 5. The affine transformation weight matrix Ŵ(j)
performs the transformation from one affine space to another. Therefore,
when our xw formulation is used, there is no need to inject the bias b to each
neuron, which is one major difference from the xw+b formulation.
Note also that at the output layer, affine transformation ends and we use
only Ŵ(NL ) as discussed in Chapter 5, producing predictions naturally back
to a vector label space Y (k) , where a loss function can now be constructed
for ”terminal” control, using given labels.
Using the affine transformation matrix, Eq. (13.6) becomes neatly to,
(1) (2)
ŷ = φ(NL ) · · · φ(2) φ(1) x W W · · · Ŵ(NL ) (13.11)
Assuming the number of (total) neurons in the jth layer is nj , the dimension
(j)
of the affine transformation weight matrix W is nj−1 × nj , offers a natural
match in matrix computations for chained ATAs.
Multilayer Perceptron (MLP) for Regression and Classification 473
The reason for subtracting 1 in the parenthesis is because the first column
(j)
in the affine transformation weight matrix W contains all constants (1 or
zeros) that are not trainable, as shown in Eq. (13.8). Equation (13.12) is an
alternative equation to Eq. (5.32) that is based on the xw+b formulation.
This number is usually reported in the summary report when an MLP model
is created in a module. The vector of trainable parameters ŵ is in space WP ,
as discussed in Section 1.5.5.
Figure 13.2: An MLP with multiple hidden layers with a chain of stacked affine
transformations for a n0 → k mapping. When affine transformation weight matrix is used,
the constant feature 1 in the top neuron of each layer is automatically generated. The
network flows naturally without injection of biases, and the dimensions of all these weight
matrices between layers match naturally. Here cross entropy loss is used for classification,
but loss function discussed in Chapter 12 may also be used. When the MLP is used for
regression, L2 loss or any other loss function discussed in Chapter 10 may be used.
474 Machine Learning with Python: Theory and Applications
The most often asked question in the machine learning community is on the
relationship between the number neurons/layers in an NN and the number
of samples (or data-points) in the dataset needed to train the NN. However,
no concrete answers were available, until now. The author has searched for
such a convincing answer but failed. In the past years, the author studied
various NNs, which has led the Neurons-Samples Theory [13] presented in
this section.
Multilayer Perceptron (MLP) for Regression and Classification 475
The P ∗ -number increases with the number of layers and the number of
neurons in the previous layers. In other words, the P ∗ -number is also the
total number of neurons in the MLP, including neurons in the input layer
but excluding the neurons in the output layer where affine transformation is
no longer needed.
Deriving Eq. (13.14) requires a careful analysis and deep understanding
on the relationship between the pseudo-dimension of the affine space
nj−1 , dataset X(j−1) in the affine space, and its corresponding affine
(j)
transformation weight matrix W in the jth layer.
Optimal learning parameters in a layer: Consider a given MLP, from
which we isolate a nj−1 → 1 neural network (an ATU discussed in Section
5.2) with an interlayer input of nj−1 neurons and a single output neuron
at the jth layer, as shown in Fig. 5.2. The single output neuron is the ith
neuron in the layer. Using Eq. (9.12), the optimal solution for the learning
parameters ŵi∗ for that neuron can be written as follows.
X(j−1) X(j−1) ŵi∗ = X(j−1) yi (13.15)
where X(j−1) denotes the interlayer dataset with m data-points in the affine
space, with the original dataset at the input layer as X(0) . And yi is an
assumed label vector corresponding to the ith neuron in the layer. It is a
vector for these m data-points in the dataset. We do not usually know the
content of yi for these hidden layers. Fortunately, we do not need to know,
because its content is not relevant in this analysis.
Now, for the (i + 1)th neuron in the same jth layer, we shall have
∗
X(j−1) X(j−1) ŵi+1 = X(j−1) yi+1 (13.16)
476 Machine Learning with Python: Theory and Applications
than the P ∗ -number, which is the sum of the dimensions of all the affine
spaces used in 1 ∼ NL layers.
m ≥ P∗ (13.17)
We mention that Eq. (13.14) is the most important and essential equation
for estimating the minimum number of data-points for training an ML model
based on affine transformations, including the standard MLP, RNN and
CNN.
Note that in the openly available ML modules, the learning parameters
used in an MLP are the total number P given in Eq. (13.12). In all these
MLP or deepnet algorithms, P parameters are indeed allowed to vary
independently. However, because of the special construction of the MLP, ŵi
for all the neurons in the jth layer lives only in the same hypothesis-space
Wn(j−1) , which is only a small sub-space of Wn(j−1) ×(n(j) −1) , as discussed in
this section. Therefore, P ∗ given by Eq. (13.14), and not Eq. (13.12), relates
directly the number of samples in the dataset. The P ∗ -number is the lower
bound of m.
478 Machine Learning with Python: Theory and Applications
For hidden layers in MLP, one has to use some nonlinear activation functions
to take the advantage of multilayer effects. If the whole network uses only
linear function, it would be equivalent to using the simple one-layer network,
because an output in a hidden layer will be linearly dependent on that of
the previous layer. This can be seen more clearly in the derivation below,
using the xw formulation:
(1) (2)
ŷ = (· · · ((x W )W ) · · · · · Ŵ(NL ) )
(1) (2)
(13.19)
=xW ·W · · · · · Ŵ(NL ) = xWcombined
Let us now build an MLP with two hidden layers and one output layer for
the same task of identification of handwritten digits. We use the code (with
slight modifications) provided at mxnet-the-straight-dope under the Apache-
2.0 License, which allows us to go through the process step by step. We will
use the same MNIST dataset that is familiar to us.
train_data = gluon.data.DataLoader(mx.gluon.data.vision.
MNIST(train=True, transform=transform),
batch_size, shuffle=True)
test_data = gluon.data.DataLoader(mx.gluon.data.vision.
MNIST(train=False, transform=transform),
batch_size, shuffle=False)
482 Machine Learning with Python: Theory and Applications
ezj
ŷj = k (13.22)
zi
i=1 e
Note that ezi is still there in the last term. We next compute the last
term using
k ⎛ ⎞
k
∗
log ezi = z ∗ + log⎝ ezi −z ⎠ (13.24)
i=1 i=1,i=i∗
∗
where we simply factor out ez from the first sum with z ∗ = max(z1 ,
z2 , . . . , zk ) with i∗ being the index for z ∗ .
It is clear now that all the numbers involved in the computation are now
bounded. In these computations, we need only z as inputs. This formula
is known as softmax cross-entropy, because both ideas of softmax and log
cross-entropy are used. Using mxnet, it is defined as follows:
def relu(z):
return nd.maximum(z, nd.zeros_like(z))
def net(X):
h1_linear = nd.dot(X, W1) + b1 # 1st hidden layer
h1 = relu(h1_linear) # Relu activation
h2_linear = nd.dot(h1, W2) + b2 # 2nd hidden layer
h2 = relu(h2_linear)
yhat_linear = nd.dot(h2, W3) + b3 # output layer
return yhat_linear
484 Machine Learning with Python: Theory and Applications
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1,
num_inputs))
Multilayer Perceptron (MLP) for Regression and Classification 485
label = label.as_in_context(model_ctx)
label_one_hot = nd.one_hot(label, num_outputs)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
cumulative_loss += nd.sum(loss).asscalar()
We use some randomly selected data-points from the test set to perform
predictions.
%matplotlib inline
import matplotlib.pyplot as plt
# Define the function to do prediction
def model_predict(net,data):
output = net(data)
return nd.argmax(output, axis=1)
samples = 10
mnist_test=mx.gluon.data.vision.MNIST(train=False,
transform=transform)
# Sample randomly "samples" data-points from the test set
sample_data=mx.gluon.data.DataLoader(mnist_test,samples,
shuffle=True)
plt.figure(figsize=(10, 60))
for i, (data, label) in enumerate(sample_data):
data = data.as_in_context(model_ctx)
im = nd.transpose(data,(1,0,2,3))
im = nd.reshape(im,(28,samples*28,1))
imtiles = nd.tile(im, (1,1,3))
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
print('true labels :', label)
break
Figure 13.3: Predicted digits using the 3-layer MLP on the testing dataset of MNIST.
One can also try predictions using the training dataset. It should reflect
how well the model is trained.
mnist_test=mx.gluon.data.vision.MNIST(train=True,
transform=transform)
# Sample randomly "samples" data-points from the test set
sample_data=mx.gluon.data.DataLoader(mnist_test,samples,
shuffle=True)
plt.figure(figsize=(10, 60))
for i, (data, label) in enumerate(sample_data):
data = data.as_in_context(model_ctx)
im = nd.transpose(data,(1,0,2,3))
im = nd.reshape(im,(28,samples*28,1))
imtiles = nd.tile(im, (1,1,3))
plt.imshow(imtiles.asnumpy())
plt.show()
pred=model_predict(net,data.reshape((-1,784)))
print('model predictions are:', pred)
print('true labels :', label)
break
Figure 13.4: Predicted digits using the 3-layer MLP on the training dataset of MNIST,
just to see how good the training is.
13.6.9 Remarks
With two hidden layers (three layers with Ws), each containing 256 neurons,
the MLP model has achieved over 97% accuracy, compared to the 1-layer
net used in the previous chapter, and we had about 6% improvement.
488 Machine Learning with Python: Theory and Applications
We now present a case study done by the scikit-learn team [12] on visua-
lization of MLP weights on the MNIST dataset. To have the model run
faster, we use only one hidden layer, and train it only for a small number of
iterations.
%matplotlib inline
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = X / 255.
# rescale the data, use the traditional train/test split
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
Multilayer Perceptron (MLP) for Regression and Classification 489
start_t = time.process_time()
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
t_elapsed = (time.process_time() - start_t)
print('One-layer neural network, training time=',
f'{t_elapsed}','s')
#plt.figure(figsize=(50, 100))
fig, axes = plt.subplots(5, 10, figsize=(20, 10))
# use global min/max to ensure all weights are shown on
# the same scale
print(mlp.coefs_[0].shape,mlp.coefs_[1].shape)
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray,
vmin=.5 * vmin, vmax=.5 * vmax)
ax.set_xticks(())
ax.set_yticks(())
plt.show()
(784, 50) (50, 10)
Figure 13.5: Images of the weight matrix in the layer of the one-layer MLP trained on
the MNIST dataset.
the dependence between the target function and a set of “target” features,
marginalizing over the values of all other features (the complement features).
For possible visualization of the results, the target features should be one or
two. We thus choose the important target features to plot.
The study uses the California housing dataset [3]. The dataset has a total
of 8 features, and 4 of them are found relatively more influential to the house
price. We will build MLP models to reveal these influences. For comparison
purposes, we will also build a model of gradient boosting regressor to do the
same study.
There are a total of eight features, which means that the regression is over
an eight-dimensional space, and hence is it not possible to view all of them in
such a high-dimensional space. We thus plot in price-dependence curve one
Multilayer Perceptron (MLP) for Regression and Classification 493
by one, or plot contour plots two by two. The target features for the plots are
4 relatively important features: median income (MedInc), average occupants
per household (AvgOccup), median house age (HouseAge), and average rooms
per household (AveRooms). We have added in also two less important features
“AveBedrms” and “Population” to the plots, for comparison purposes.
Note that this tabular dataset has very different value ranges for these
features. Because neural networks can be sensitive to features with varying
scales, preprocessing these numeric features is critical.
print("Training MLPRegressor...")
tic = time ()
# a pipeline to scale the numerical input
# features and tuned the neural network size and learning rate.
mlp = make_pipeline(QuantileTransformer(),
MLPRegressor(hidden_layer_sizes=(50, 50),
learning_rate_init=0.01, early_stopping=True))
#configure the MLP
mlp.fit(X_train, y_train) #Train the MLP
print("Done in {:.3f}s".format(time() - tic))
print("Test R2 score: {:.2f}".format(mlp.score(X_test,
y_test))) # Check the performance
Training MLPRegressor...
Done in 20.779s
Test R2 score: 0.80
One can increase the complexity of an MLP for stronger learning ability.
However, to avoid overfitting, the complexity of the MLP should be in
accordance with the number of the data-points in the training set.
494 Machine Learning with Python: Theory and Applications
plot_partial_dependence(mlp,X_train,features,n_jobs=3,
grid_resolution=20)
print("done in {:.3f}s".format(time() - tic))
fig = plt.gcf()
fig.suptitle('Partial dependence of house value on\n
non-location features''for the California\n
housing dataset, with MLPRegressor')
fig.subplots_adjust(wspace=0.4,hspace=0.4)
Figure 13.6: Results on price-feature dependence obtained using the MLP regressor in
Sklearn for the California housing dataset.
Multilayer Perceptron (MLP) for Regression and Classification 495
The tick marks on the x-axis in the above plots represent the deciles of
the feature values in the dataset. We can observe the following findings:
The contour plots show the interactions of two target features on the
house price. For example, for average occupancy greater than two, the house
price is nearly independent of the house age. For average occupancy less than
two, there is some dependence.
print("Training GradientBoostingRegressor...")
tic = time()
gbr = HistGradientBoostingRegressor()
gbr.fit(X_train, y_train)
print("done in {:.3f}s".format(time() - tic))
print("Test R2 score: {:.2f}".format(gbr.score(X_test, y_test)))
Training GradientBoostingRegressor...
done in 1.600s
Test R2 score: 0.85
For this tabular dataset, the Gradient Boosting Regressor is found both
significantly faster to train and more accurate on the test dataset, compared
to the MLP. It is also relatively easier to tune the hyperparameters, and the
defaults usually work well.
496 Machine Learning with Python: Theory and Applications
Figure 13.7: Results on price-feature dependence obtained using the gradient boosting
regressor in Sklearn for the California housing dataset.
Figure 13.8: Results using the gradient boosting regressor in Sklearn for the California
housing dataset.
Figure 13.9: Results using the MLP in Sklearn for the California housing dataset.
print("Training DecisionTreeRegressor...")
tic = time()
tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)
print("done in {:.3f}s".format(time()- tic))
print("Test R2 score: {:.2f}".format(tree.score(X_test, y_test)))
Training DecisionTreeRegressor...
done in 0.243s
Test R2 score: 0.61
Figure 13.10: Results using the decision tree regressor in Sklearn for the California
housing dataset.
<matplotlib.legend.Legend at 0x18e35abb128>
Figure 13.11: Comparison of results using three regressors in Sklearn for the California
housing dataset.
Figure 13.11 shows that MLP prediction curves are much smoother than
the gradient boosting and the decision tree regressors.
Note that all the test results depend heavily on settings of the hyper-
parameters used to set up the model, datasets, computer system, and even
on when it is run. The setting parameters used in the above tests may not
be fair for some models. Thus, the test results are subjective.
References
[1] D.E. Rumelhart and J.L. McClelland, Learning internal representations by error
propagation, in Parallel Distributed Processing: Explorations in the Microstructure
of Cognition: Foundations, pp. 318–362, MIT Press, Cambridge, MA, 1987.
[2] J. Orbach, Principles of neurodynamics: Perceptrons and the theory of brain
mechanisms, Archives of General Psychiatry, 7(3), 218–219, 09 1962. https://doi.org/
10.1001/archpsyc.1962.01720030064010.
500 Machine Learning with Python: Theory and Applications
m ≥ P∗ (14.1)
501
502 Machine Learning with Python: Theory and Applications
Figure 14.1: Fitting to data-points. Left: one data-point, infinite number of lines can be
used to fit this data; Right: two data-points, one straight line fits perfectly to these data-
points, but there are infinite numbers of higher-order curves can fit these two data-points.
Figure 14.2: Fitting to find decision boundaries for classification problems. Orange line:
overly simplified boundary; Green line: overly fitted boundary; Black line: better fitted
boundary. Modified based on image from Wikimedia Commons by Chabacano (https://
en.wikipedia.org/wiki/Overfitting) under the CC BY-SA 4.0 license.
practice, m can be much larger than the P ∗ -number. This lets the data
do the job. Use the regularization as a preventive tool or as a last resort.
ŷ = Xŵ (14.2)
For given dataset X and label y, we built a loss function: L(y, ŷ(ŵ)) =
[y − Xŵ] [y − Xŵ]. We then went through an analytical minimization
process, and the solution is found in the following matrix form:
T T
ŵ∗ = (X X)−1 X y (14.3)
Overfitting and Regularization 505
The first term on the right-hand side is the data loss. The 2nd term
is added as a regularization loss. λ is the regularization parameter that
is a pre-specified positive real number. Vector ŵ0 is also pre-specified for
the preferred the parameter ŵ. ŵ0 is often set to zero, because we usually
do not know what is preferred. If we have some knowledge on ŵ0 we can
put it in, and we shall have the shifted regularization. Hence, ŵ0 is also a
regularization parameter (more on this in the example studies).
Adding of the regularization term means that if the parameter ŵ departs
from the preferred vector, there will be penalty to the loss function. The
farther it departs, the more the penalty is scaled by the regularization
parameter. Therefore, a compromise needs to be arrived at to fit the data
and also with least regularization penalty.
We now go through the same analytical minimization process to find a
regularized solution that is the minimizer to the above loss function:
min L(y, ŷ(ŵ), λ) = min([y − Xŵ]T [y − Xŵ] + λ[ŵ − ŵ0 ] [ŵ − ŵ0 ]) (14.5)
ŵ ŵ
∂L(y, ŷ(ŵ), λ) T
= −2X (y − Xŵ∗ ) + 2λ[ŵ∗ − ŵ0 ] = 0 (14.6)
∂ ŵ
506 Machine Learning with Python: Theory and Applications
Figure 14.4: Contours of both the data loss function and the regularization loss function.
In some practices, regularizations are only applied to the weights not the
biases. This is equivalent to using a diagonal λ matrix, but set zero for these
diagonal terms corresponding to the biases. We will demonstrate this in the
following example too.
We are now ready to test out this method using a very simple example,
so that we can see through the process and feel how the regularization is at
work.
import numpy as np
import random
lxd = len(xdata)
One = np.ones(lxd)
print('Length of xdata=',lxd, ' Value of xdata=', xdata)
X = np.stack((One, xdata), axis=-1) # form X matrix
print('X=',X,X.shape)
Overfitting and Regularization 509
np.random.seed(8)
y=xdata.T+np.random.randn(lxd)*0. # Label y, 1 data-point
#y=xdata.T+np.random.rand(lxd)*2. # Label y, >=3 data-points,
# using random number so that [x_i, y_i] will not be inline.
#y=2.*xdata.T+np.random.randn(lxd)*2.
# form y, for >= 3 data-points,
# using random number so that [x_i, y_i] will not be inline.
print('y=',y,y.shape)
It is clear that when only one data-point is used, the condition number
of the X X matrix is very large, and the rank of the X X matrix is only 1.
It is not solvable for ŵ∗ = [W ∗0 , W ∗1 ] = [b∗ , W ∗ ] . This means that there are
infinite number of [W 0 , W 1 ] = [b, W ] that can be the solution, which honor
the data.
We now use Eq. (14.7) to compute regularized solutions for this linear
regression, using different settings for the regularization parameter λ.
We used Part II of our code given below to perform the computation and
Part III to plot the results. Here, we shall have three cases to be studied, by
510 Machine Learning with Python: Theory and Applications
setting different values for lb0 and lw0 in vector L0=np.array([lb0, lw0]) in
the codes below. For all these cases, we will get a unique solution because
of the effects of the regularization. However, the solution behavior will also
be different, affected by the setting of the regularization parameters for b
and/or for W . Let us first examine Case 2.
print('w0=',w0,w0.shape)
Note in the code above, that the input XTy is a 1D numpy array, and
the result is also a 1D array, which is not the convention of linear algebra, as
discussed in Section 2.9.13. One can also purposely define a column vector
(a 2D array with only on column) following the convention of linear algebra,
and get the solution. The solution will be the same, but is in a column vector.
Use of 1D numpy array will be more efficient. We have also seen an example
in Section 3.1.11.
Case 2: Using the same data as case 1 (one data-point x0 = 1, x1 =
2., X = [1 2.], y = [2.]) but set the regularization parameters as L0 =
np.array([1.0, 0.]), which regularizes only the bias b.
We write the following code to compute and plot the results.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import rc
print('Length of x=',lx, x)
Xm = np.stack((One, x), axis=-1) #build up X using points x.
I = np.identity(2)
print('w0=', w0, w0.shape)
print('Lambda base for b, and w =',L0, L0.shape)
print('X=',X,X.shape)
print('y=',y,y.shape)
plt.figure(figsize=(4.5, 2.9),dpi=100)
for L in Lmda:
lmd = L*L0
K = XTX+lmd*I #regularized system matrix.
XTy0 = XTy +lmd*w0 #regularized y vector.
wr = np.linalg.solve(K,XTy0)
print('Lambda='+str(f"{L:.2f}"), ' wr=', wr,XTy.shape,\
w0.shape,XTy0.shape,L.shape)
plt.plot(x,y_hat(Xm, wr),label="Lambda="+str(f"{L:.2f}"))
# compute the predicted lines and plot out.
Figure 14.5: Simple linear regression for one data-point, with bias regularized at various
levels.
Using Part II and III of our code again, we now study Case 3:
Case 3: Use the same dataset as Case 2, but set L0 = np.array([0.0,
1.0]), which regularizes W 0 only.
514 Machine Learning with Python: Theory and Applications
Figure 14.6: The regression line for Case 3: only W is regularized, but the results are not
affected by the regularization parameter λ (these 4 lines are overlapping). b is set free from
regularization.
In this case, b is left free (from regularization). The results are shown in
Fig. 14.6. We shall observe the following points:
• This solution would produce a zero solution for W (hence the slope will be
zero), because it is regularized, and some value for b, because it is allowed
to change freely.
• As we can see, it is a horizontal line, as expected. This is because the
solution for W is zero and it is responsible for the slope.
• The line is no longer passing the origin, because b is no longer zero.
• The regression line still passes through the one data-point. This is because
we regularize one parameter and the other parameter is free. Therefore, we
have a total of two conditions: one is passing the data-point, and one is zero
slope W . Both conditions determine uniquely two regression parameters, b
and W .
• The magnitude of the λ has, again, no effect on the solution, because of
the setting with two conditions and two parameters. The amount of the
regularization is immaterial.
• It is clear that the Tikhonov regularization helped us again obtain a unique
solution for the problem (Case 1) that has an infinite number of solutions.
But, the solution is different because of the different way in which the
regularization is applied.
Using Part II and III of our code again, we now study Case 4 where we
regularize both b and W .
Overfitting and Regularization 515
Figure 14.7: The regression line for Case 4: both b and W are all regularized.
Case 4: Use the same dataset as Case 2, but set L0 = np.array([1.0, 1.0])
in code Part II.
In this case, both b and W are all regularized, as shown in Fig. 14.7. We
shall observe the following points:
• This solution would produce a nonzero solution for both b and W , and
hence the intersection and the slope will all be nonzero. This is because
both cannot change freely.
• The regression line is no longer passing through the one data-point. This is
because we regularize both parameters. Therefore, we have a total of three
conditions: one is passing the data-point, one is a nonzero bias b, and one
is nonzero slope W . Because we have only two regression parameters b and
W , none of these conditions will be satisfied. The solution is a compromise.
• This time, the magnitude of the λ matters. When it is very small (0.1 for
example), the line is nearly passing through the data-point. Otherwise,
the line will not pass the data-point, for the same reason mentioned in the
above point.
• It is clear that the Tikhonov regularization helped us again obtain a unique
solution for the problem (Case 1) that has an infinite number of solutions.
But, the solution is different, depending on the level of the regularization
parameters used.
Figure 14.8: The regression line for Case 4: both b and W are all regularized with a small
regularization parameter.
Figure 14.9: The regression line for Case 5: both b and W are all regularized, with respect
to the preferred solution set at the true solution of (0.4, 0.8).
Using our codes, we found that when we use a very small Lambda=0.01
for Case 4 where both b and W are all regularized, the resulting regularized
solution becomes [0.3992016 0.79840319]. We can then guess that the true
solution should be [0.4, 0.8]. Let us now study the following case.
Case 5: Set L0 = np.array([1.0, 1.0]) and ŵ0 = [0.4, 0.8] in the Part II
code.
In this case, both b and W are all regularized with respect to a preferred
solution that is the true solution. The results are plotted in Fig. 14.9.
We shall observe the following points:
• This solution becomes the true solution for both b and W , meaning that if
we know the true solution, we can put it in the regularization term, and
Overfitting and Regularization 517
Figure 14.10: The regression line for Case 5: both b and W are all regularized, but with
respect to the preferred solution set at the true solution of (0.4, 0.8).
our code will reproduce it. The regression line passes through the data-
point. Figure 14.10 illustrates schematically what has happened in this
case.
• As shown in Fig. 14.10, these minimizers are for both the data loss and
the regularization loss, and thus the combined loss is located at the same
place (the true solution).
• In reality, it is not possible to know where the true solution for complex
problems is (if we do, we will not need any model). However, if we have
some but not exact knowledge about the true solution, using this shifted
regularization may be effective, because the trained model can be more
representative to the data, and yet biased to the knowledge. Because of
this, ŵ0 can also be a regularization parameter.
• Notice that ŵ0 is a vector, and hence the specification on it can be done
per parameter/feature.
• The magnitude of the λ does not matter in this case. This is because our
data loss and regularization loss are essentially the same for this simple
problem. When the data loss differs from the regularization loss (which is
true for practical complex problems), the magnitude of the λ shall have
effects.
With the understanding established for the simple examples using one data-
point, we can now extend the study to cases with two data-points, which shall
enforce our understanding of the regularization effects and the relationship
518 Machine Learning with Python: Theory and Applications
Figure 14.11: The regression line for Case 2 with 2 data-points: only b is regularized, and
W set free from regularization.
with the data. We shall do this in reference to all these 5 cases we have
studied for one data-point.
For Case 1, when we have two data-points, and still using the linear
regression model in one-dimensional space, we shall have a unique solution
as long as these two data-points are not coinciding. This is because the rank
of the system matrix X X is 2. The solution will be passing exactly these
two data-points. When these two data-points are X = [[1. 1.], [1. 3.]] and
y = [1.4534 3.4534] (note randomly generated), the true (un-regularized) is
ŵ∗ = [0.45341172, 1.] .
For Case 2 using two data-points, we also set L0 = np.array([1.0, 0.]).
In this case, we regularize b only and leave W free (from regularization). The
solution is plotted in Fig. 14.11.
We shall observe the following points:
• The regression line does not pass through these two data-points unless the
regularization parameter is set to zero. This is because we regularize one
parameter (and the other parameter is free). Therefore, we have a total
of three conditions: two for passing these two data-points, and one is for
zero bias b. The regularized solution is a compromise.
• When the regularization parameter is very small (0.01 for this case, blue
line in Fig. 14.11), the solution line closely passes these two points and
the result is close to Case 1 using two data-points.
• When the regularization parameter is large, the line is turned to give a
nearly zero bias b while minimizing the data loss (the sum of the distances
between these two data-points and the regression line).
Overfitting and Regularization 519
Figure 14.12: The regression line for Case 3: only W is regularized, and b set free from
regularization.
• The magnitude of the λ has effects on the solution. The larger the λ, the
closer the b to zero.
• The Tikhonov regularization helped us obtain a compromise solution. It
uniquely depends on the data and the regularization parameter.
• The regression lines do not pass through these two data-points unless the
regularization parameter is set to zero. This is because we regularize one
parameter (and the other parameter is free). Therefore, we have a total
of three conditions: two for passing these two data-points, and one is for
small weight W . The regularized solutions are compromises.
• When the regularization parameter is very small (0.01 for this case, blue
line in Fig. 14.12), the solution line closely passes these two data-points
and the result is close to Case 1 using two data-points.
• When the regularization parameter is large, the line is turned to give
a small slope controlled by W while minimizing the data loss (the
sum of the distances between these two data-points and the regression
line).
• The magnitude of the λ has effects on the solution. The larger the λ, the
closer the slope W to zero.
520 Machine Learning with Python: Theory and Applications
Figure 14.13: The regression line for Case 4: both b and W are all regularized.
• The regression line does not pass through these two data-points, unless the
regularization parameter is set to zero. This is because we regularize two
parameters. Therefore, we have a total of four conditions: two for passing
these two data-points, and two for small weight W . The regularized solution
is a compromise.
• When the regularization parameter is very small (0.01 for this case, blue
line in Fig. 14.13), the solution line closely passes these two data-points
and the result is close to Case 1 using two data-points.
• When the regularization parameter is large, the line is turned to give a
small slope controlled by W , and also shifted downward to give a smaller
intersection controlled by b, while minimizing the data loss (the sum of
the distances between these two data-points and the regression line).
• The magnitude of λ has effects on the solution. The larger the λ, the
smaller the intersection and the closer the slope W to zero.
From the study of Case 1 with two data-points, we obtained the true (un-
regularized) solution ŵ∗ = [0.4534, 1.] . We can then use the true solution
for a shifted regularization.
Overfitting and Regularization 521
Figure 14.14: The regression line for Case 5 with two data-points: both b and W are all
regularized, with respect to the preferred solution set at the true solution of [0.4534, 1.0].
• This solution becomes the true solution for both b and W , meaning that if
we know the true solution, we can put it in the regularization term, and
our code will reproduce it. The regression line passes through these two
data-points, regardless of the level of regularization.
Figure 14.15: The regression line for Case 2 with 3 data-points: only b is regularized, and
W set free from regularization.
• The regression lines do not pass through these three data-points, even if
the regularization parameter is set to zero. This is because we have a total
of four conditions/equations: these three data-points and one for zero bias
b. Therefore, the regularized solution is a compromise.
• When the regularization parameter is very small (0.01 for this case, blue
line in Fig. 14.15), the solution line is close to the un-regularized true
least-square solution.
• When the regularization parameter is large, the line is turned to give a
nearly zero bias b while minimizing the data loss (the sum of the distances
between these three data-points and the regression line).
• The magnitude of the λ has effects on the solution. The larger the λ, the
closer the b to zero.
• The Tikhonov regularization helped us obtain a compromise solution. It
uniquely depends on the data and the regularization parameter.
Figure 14.16: The regression line for Case 3 with three data-points: only W is regularized,
and b set free from regularization.
Figure 14.17: The regression line for Case 4 with three data-points: both b and W are all
regularized.
• When the regularization parameter is very small (0.01 for this case, blue
line in Fig. 14.17), the solution line is close to the un-regularized true
least-square solution.
• When the regularization parameter is large, the line is turned to give a
small slope controlled by W , and also shifted downward to give a smaller
intersection controlled by b, while minimizing the data loss.
• The magnitude of λ has effects on the solution. The larger the λ, the
smaller the intersection and the closer the slope W to zero.
From the study of Case 1 with three data-points, we obtained the true
(un-regularized) solution ŵ∗ = [1.8138, 0.9979] . We can then use the true
solution for a shifted regularization.
• This solution becomes the true least-square solution for both b and
W , meaning that if we know the true solution, we can the put it in
the regularization term, and our code will reproduce it. The regression
line passes through these three data-points, regardless of the level of
regularization.
Overfitting and Regularization 525
Figure 14.18: The regression line for Case 5 with three data-points: both b and W are
all regularized, with respect to the preferred solution set at the true solution of [1.8138,
0.9979] .
These studies show that the Tikhonov regularization has the following major
effects:
• For small datasets, it allows one to make use of knowledge of the data to
make up the shortfall in information to obtain a unique solution.
• For excessive datasets, it offers means to control the magnitudes of the
training parameters to prevent overfitting.
• For noisy datasets, the regularization offers means for mitigating the error.
Based on our study on the universal approximation theory, we found
that the weights and biases are responsible for producing the pulses for
approximating a function (see Section 7.3). We have also demonstrated
that the larger the weights, the steeper the pulse, hence offering more
capability in approximating more complex functions. If the weight values
are sufficiently large, it may even pick up the noises, which generally
have higher variation in frequencies. When the Tikhonov regularization is
applied, the values of the weights are generally reduced as shown explicitly
in these case studies. Therefore, the model will be less sensitive to noise.
• All these are done via the use of regularization parameters including the
following:
(1) Parameter λ or λ matrix of parameters to control the norms of all or
partial or individual features.
526 Machine Learning with Python: Theory and Applications
(2) Shifting parameter ŵ0 that can also be specified for individual
features, to allow the regularization loss to be more in alignment with
the data loss.
• Regularization parameters can be set differently based on the types of the
features.
• Other norms may be used for accounting the regularization loss.
Based on the above detailed study using the simple examples, we shall have a
good understanding of how a regularization may work for a machine learning
model. We are ready for a study on its effects on a real machine learning
model. We now revisit the handwritten digit classification problem on the
MNIST dataset, which was studied earlier.
Note that our earlier models do not overfit because we used all 60,000
training examples, which is far more than the P ∗ -number and sometimes
even the P -number in those models. For example, we used a one-layer net
that has P = 7,850 parameters: 784 × 10 = 7,840 weights plus 10 biases, and
P ∗ = 785 (which is the pseudo-dimension of the affine space for the single
layer). Thus, m P ∗ and m P .
Let us do a case study to examine what might happen if we use fewer
training samples, say 2,000. For the one-layer net, we have m > P ∗ but
m P . This study will show in detail the effects of the regularization tech-
nique on mitigating the over-fitting. It is done following the study and using
the code at mxnet-the-straight-dope (https://github.com/zackchase/mxnet-
the-straight-dope/blob/master/chapter02 supervisedlearning/regularization-
scratch.ipynb), under the Apache-2.0 License.
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
mnist = mx.test_utils.get_mnist()
num_examples = 2000
batch_size = 64
train_data = mx.gluon.data.DataLoader(mx.gluon.data.ArrayDataset(
mnist["train_data"][:num_examples], mnist["train_label"]
[:num_examples].astype(np.float32)), batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.ArrayDataset(\
mnist["test_data"][:num_examples], mnist["test_label"]
[:num_examples].astype(np.float32)), batch_size, shuffle=False)
We use a simple linear model with the softmax as activation as the prediction
function. We set up the following neural network model:
def net(X):
y_linear = nd.dot(X, W) + b # Define the affine mapping
yhat = nd.softmax(y_linear, axis=1)
return yhat # return the prediction function
We use the cross entropy, which we have discussed in detail earlier, as the
loss function.
528 Machine Learning with Python: Theory and Applications
The following function is used to plot out the change of the loss functions
against the iterations during the minimization process:
fg1.set_xlabel('epoch',fontsize=14)
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.grid(True,which="both")
fg1.legend(['training loss', 'testing loss'],fontsize=14)
fg2.set_title('Comparing accuracy')
fg1.set_xlabel('epoch',fontsize=14)
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.grid(True,which="both")
fg2.legend(['training accuracy', 'testing accuracy'],
fontsize=14)
We now train the neural network model to its fullest learning ability.
epochs = 1000
moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, .001)
# Keep a moving average of the losses
niter +=1
moving_loss=.99*moving_loss+.01*nd.mean(loss).asscalar()
est_loss = moving_loss/(1-0.99**niter)
530 Machine Learning with Python: Theory and Applications
if e % 300 == 299:
print("Epoch %s. Train Loss:%.2e, Test Loss:%.2e,
Train_acc:%.2e, Test_acc:%.2e" %
(e+1,train_loss, test_loss,train_accuracy,
test_accuracy))
Figure 14.19: A case of over-fitting when the one-layer net is used for a reduced dataset
of MNIST that has only 2,000 samples.
Overfitting and Regularization 531
In the above example, it is seen that at the 1,000th epoch, the MLP model
achieves 100% accuracy on the training data (the loss function is still quite
far from zero, partially because m > P ∗ ). However, it classifies only ∼79%
of the test data accurately. This is a typical case of overfitting. Clearly, such
a “well” trained model is not possible to perform well on the test dataset or
dataset in the real world the model did not seen during the training.
If the sample is reduced to 1,000 with all other settings unchanged, this
MLP model would achieves 100% accuracy on the training data at the 700th
epoch, but it classifies only ∼75% of the test data accurately.
def l2_penalty(params):
penalty = nd.zeros(shape=1)
for param in params:
penalty = penalty + nd.sum(param ** 2)
return penalty
epochs = 1000
moving_loss = 0.
l2_lambda = 0.1 #0.3 # regularization parameter
niter=0
532 Machine Learning with Python: Theory and Applications
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = nd.sum(cross_entropy(output, label_one_hot))+\
l2_lambda*l2_penalty(params) # regularization
# loss added
loss.backward()
SGD(params, .001)
Figure 14.20: The L2 regularization has reduced the gap between the training and testing
accuracy. One-layer net using the reduced dataset of MNIST that has only 2,000 samples.
534 Machine Learning with Python: Theory and Applications
Ideally, we would want a model having the same level of accuracy for both
the training and test datasets. Tuning the regularization parameter can help
to achieve this goal, but it can often be a challenging task. Readers may give
it a try using the above code.
We finally note that the choice of the regularization parameter is an issue.
Experiences may help, or trial and error may be needed to train a robust
model. Below is another case study.
We now introduce another case study done by the scikit-learn team [2] for
classification to investigate the influence of the regularization parameters on
the predicted decision boundary. We have made minor changes to the original
code to fix issues possibly caused by module version differences. We have also
done some tuning on the parameters. This study is on the effects of the regu-
larization parameter “alpha” (equivalent to the lambda in our formulation).
This study uses synthetic datasets for classification, and the plots will show
different decision boundaries produced when using different alpha values.
%matplotlib inline
print(__doc__)
# Author: Issam H. Laradji, License: BSD 3 clause
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons,make_circles,
make_classification
from sklearn.neural_network import MLPClassifier
Overfitting and Regularization 535
classifiers = []
for i in alphas:
classifiers.append(MLPClassifier(solver='lbfgs', alpha=i,
max_iter=1500,random_state=1, hidden_layer_sizes=[100, 100]))
X, y = make_classification(n_features=2, n_redundant=0,
n_informative=2, random_state=0, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max()-.3,yy.min()+.3,
('%.2f' % score).lstrip('0'), size=15,
horizontalalignment='right')
i += 1
figure.subplots_adjust(left=.02, right=.98)
plt.show();
References
[1] G.R. Liu and X. Han, Computational Inverse Techniques in Nondestructive Evaluation,
Taylor and Francis Group, New York, 2003.
[2] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
Chapter 15
539
540 Machine Learning with Python: Theory and Applications
Figure 15.1: A 3 × 3 filter (in the middle of the figure) acting like an edge detector that
captures the vertical edge line in the middle of the 6 × 6 pixel image as a 4 × 4 bright
strap, when the filter slides over and within the 6 × 6 pixel image.
in the middle. Therefore, it acts like a vertical edge detector. We can expect
the following:
• If we rotate the red filter matrix, we shall have a horizontal edge detector.
• If we rotate the red filter matrix in an arbitrary angle, we shall have an
inclined edge detector.
• One can have many other types of filters, such as the “Sobel” filter and
“Scharr” filter.
• In the above case, the filter slides per-pixel, or the “stride” is 1. One may
have multiple strides, such as two strides per sliding.
• The filter size can also change, and 5 × 5 (but usually an odd number, in
order to always have a central pixel) is also an often used filter size.
• The above convolution results in a reduction in size in the filtered image.
To avoid this, one may use a technique call “padding”: extending the
original 6 × 6 pixel image by filling in zeros to extend pixels on its
boundaries, and make the original image bigger, so that the filtered image
can have the same size.
Figure 15.2: A 3 × 3 filter (gray) sliding over a 5 × 5 pixel image (blue) with stride=1
and padding=1 (padding part is not colored). It generates a cyan image of the same (5 ×
5) size as the blue image.
542 Machine Learning with Python: Theory and Applications
The settings for different strides and paddings are shown in Fig. 15.3.
Figure 15.3: In the upper row, from left to right: f=3,s=1,p=0; f=4,s=1,p=2;
f=3,s=1,p=1. In the bottom row, from left to right: f=3,s=2,p=0; f=3,s=2,p=1;
f=3,s=2,p=1.
where f stands for filter, s for stride, and p for padding. Readers may
take a look at cartoons and pictures for more cartoons to get a better
understanding.
Figure 15.4: Schematic of a typical affine transformation unit (ATU) wrapped with an
activation function. Learning parameters, weights wi , and bias b, in a single channel of
a conv2D using a 3 × 3 filter in one layer of CNN. φ stands for an activation function
(usually the Relu). For a 3 × 3 filter, we have a total of 9 weights, and one bias, resulting
in a hypothesis space W10 for the ATU. For this 6 × 6 pixel image with 1 stride and no
padding, the output is a 4 × 4 image. This output image may be subjected to a “pooling”
and then to another layer of convolution.
15.3 Pooling
Figure 15.5: Max pooling in a CNN: a 2 × 2 pooling filter covers 2 × 2 pixels in a filtered
4 × 4 image. It takes the maximum pixel value from the 2 × 2 pixels of each of these four
covers, and forms a 2 × 2 new image.
• One may also take a different number of strides, when covering the image.
• One may also use a different size of covers.
15.4 Up Sampling
Figure 15.8: Configuration of a CNN: an RGB image is first fed to a set of filters (F)
and is then subjected to a possible pooling (P) operation. Another set of filters may be
applied, followed by another possible pooling. An F-P pair may consist of a CNN layer.
The output of the final CNN layer may be fed to few normal dense layers producing a
prediction.
pooling is applied, it reduces the height and width again, but keeps the
number of channels unchanged. Usually, an F-P pair may consist of a CNN
layer, and we can have quite a large number of CNN layers. The output of
the final CNN layer may (or may not) be fed to few normal dense layers,
and then produces a prediction. The following operation is essentially the
same as those in an MLP.
The network in Fig. 15.8 has 1 input layer, 2 CONV layers, 1 hidden
dense layer, and 1 output dense layer. In practice, many CNNs have a large
number of CONV layers.
With the building blocks we have discussed so far, one can build a really
complex neural network for very complicated tasks of computer vision and
image-based object detection. The development in this area is very fast
and many useful special networks have already been proposed. This section
discusses briefly some of the widely used landmark CNN networks. These
networks were built upon the basic configurations of dense layers, CONV
layers, and their variations and combinations. Many new special techniques
and tricks have been invented to improve the learning capacity and efficiency.
In fact, there is practically no limit to the types of neural networks that
we can build, but some work well and some may not for some problems.
Having an overall knowledge of these successful landmark networks can be
very useful before one starts to build a new network. The source codes of
many of these networks are made openly available at GitHub, which can be
effectively made use of based on the specified license there.
We will not be able to cover all of the good landmark CNN networks,
but readers can find more in the open sources with the leads given in the
Convolutional Neural Network (CNN): Classification and Object Detection 547
15.6.1 LeNet-5
This famous CNN was proposed in 1998 by LeCun et al. [8]. The basic (not
exactly the same) structure is given in Fig. 15.9. It uses a black-white image
with a size of (32 × 32 × 1) as the input, the sigmoid activation function for
the hidden layers, and the softmax at the output.
The detailed numbers related to the configuration of the LeNet-5 network
are listed in Fig. 15.10.
Figure 15.9: LeNet-5 for handwritten digit recognition using black-white images. It has
1 input layer, 2 CONV layers, 2 hidden dense layers, and 1 output dense layer.
Figure 15.10: Data size and learning parameters for LeNet-5. It has a total of 61,706
training parameters. The CONV layers have very few training parameters, and most of
the learning parameters are due to the dense layers.
548 Machine Learning with Python: Theory and Applications
Figure 15.10 shows clearly how the number of data flowing forward across
the network and the number of learning parameters in each layer and in total
are calculated. Similar calculations can be used to examine other networks.
15.6.2 AlexNet
Another famous landmark CNN network was proposed in 2012 by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. This network won
the 2012 ImageNet LSVRC-2012 competition. Its configuration is shown in
Fig. 15.11. The details can be found in the original paper [9]. It is a widely
used paper and can be easily found online. It has some similarities with
LeNet-5, but with 3 more CONV layers, and one more dense layer. These
3 dense layers have much more neurons. It was built for 1000-classification
on the ImageNet. The AlexNet is much large in scale. It also uses a number
of new techniques, including the Relu activation function to deal with the
vanishing gradient problem and dropout techniques for the first 2 fully
connected dense layers to mitigate over-fitting.
Reader may read the original paper to find more details. Good analysis
of the AlexNet is also given by Andrew Ng’s online lectures and Hao Gao’s
article.
Figure 15.11: AlexNet: It has 1 input layer, 5 CONV layers, 3 hidden dense layers, and
1 output dense layer. The total learning parameters are about 62.3 million, about 6% of
which are from the CONV layers, and 94% from the fully connected dense layers.
Convolutional Neural Network (CNN): Classification and Object Detection 549
15.6.3 VGG-16
• All the CONV layers are unified with the same 3 × 3 filter size, stride 1,
and 1 padding, followed by a max pooling.
• All the max pooling is with the same 2 × 2 size and stride 2.
• The height and width of the net block are roughly halved from one CONV
layer to the next.
• The number of filter channels always doubles from one CONV layer to the
next.
15.6.4 ResNet
ResNet stands for residual neural network. It was proposed by He, Kaiming;
Zhang, Xiangyu; Ren, Shaoqing; and Sun, Jian in 2015 [11]. The major
difference for the ResNet from the nets discussed above, such as the VGG-16,
is the use of skip connection: the activation of one layer (CONV or dense)
can be connected to the next-next layer, by skipping one layer and then
applying the activation (often the Relu). This well mitigates vanishing of
gradients and enables the net going even deeper, and thus it made up to 152
layers. It has about 23 million learning parameters in total. The capability of
going deep is of importance for many visual recognition tasks. The ResNet
has obtained 28% relative improvement on the COCO object detection
dataset. It is the foundation for the ILSVRC & COCO 2015 competitions,
where the authors won the 1st place in the tasks of ImageNet detection,
ImageNet localization, COCO detection, and COCO segmentation.
550 Machine Learning with Python: Theory and Applications
Here the number in the parentheses in the superscript stands for the layer
number. The terms above the curly brackets are the affine transformation
functions z(i) (i = 1, 2, . . . , NL − 1), which can change from layer to layer
and bring in information for the past layers, as new independent “features”
for the next layer after fed to an activation function. The dimension of the
new feature space changes accordingly. For layers that take inputs from the
previous layers, the dimensionality adds up. This may be the reason for
mitigating gradient vanishing issues.
Equation (15.1) assume that the skipping is applied to all the layers
starting from the 3rd hidden layer. One may choose not to skip some layers.
In that case, the term in blue in the corresponding layer should be removed.
It is also assumed that the skips are only over one layer. When multiple-layer
skipping is performed, Eq. (15.1) can be easily adjusted accordingly. Note
that the activation function used in each layer can be in theory different,
although we typically use Relu for all the layers in a ResNet. Also, the
activation in any layer zl can be replaced by that of a CONV layer. This
similarly applies to a net with mixed dense and CONV layers.
A more detailed description of the ResNet can be found in the original
paper and also an article at the website towards data science, where a
detailed architecture of the ResNet is given, in comparison with the VGG-
19 (a later version of VGG-16) and the so-called plain network (without
skipping).
Convolutional Neural Network (CNN): Classification and Object Detection 551
15.6.5 Inception
• It uses the so-called 1×1 convolution across the channels. This allows a
configuration of a densely connected 1D network across the channel within
a CONV layer, termed as network-in-network. The 1×1 convolution was
proposed by Min Lin, Qiang Chen, and Shuicheng Yan in 2014 [13].
• The use of network-in-network configuration enables the reduction in
channel numbers within a CONV layer, which leads to drastic saving in
computation.
• It constructs Inception Module that is packaged with multiple convo-
lutional operations, including 1×1, 3×3, 5×5, and max pooling all in a
CONV layer, which allows the net to learn which are the best operations
and filter (or kernel) sizes to use in the CONV layer. This is made possible
and efficient with the help of 1×1 convolutions.
• The Inception modules become new building blocks for constructing a
deep Inception network.
Readers may find more details from the original papers and an introduc-
tion article at towards data science.
There are many other techniques and tricks developed and implemented
in YOLO. Interested readers may read the original paper and online articles,
such as the one at towards data science. One may also visit the YOLO
project website, where one finds a benchmark study on the performance of
object detection on the COCO dataset. One may also follow up with the
latest development there.
This course has prepared readers with quite a large amount of funda-
mentals on machine learning. We have so many building blocks now, and we
have seen some of the outstanding implementations. It is the time to try to
put these building blocks together, to solve some problems at hand. When
the network becomes deep, there are many different ways to add in different
types of connections, and to put things together for complicated tasks. We
shall also continue the study and invent new techniques and tricks. It may
also be productive to follow up with current work of the front-runners in
this very active area of development. There are many novel techniques and
tricks developed every day, and some of them can be useful not only for
projects that you may working on but also for developing new ideas if one
is interested in research in the related areas. The author and collaborators
have also conducted some study in the related area [6].
Finally, let us look at a CNN network proved by the TensorFlow.
a few lines of code. The material is Licensed under the Apache License,
Version 2.0.
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
The CIFAR10 dataset contains 60,000 color images in 10 classes, with 6,000
images in each class. The dataset is divided into 50,000 training images and
10,000 testing images. The classes are mutually exclusive and there is no
overlap between them.
To verify the dataset, let us plot the first 25 images from the training set
and display the class name below each image.
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
plt.figure(figsize=(10,10))
for i in range(36):
plt.subplot(6,6,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
# The CIFAR labels happen to be arrays,
554 Machine Learning with Python: Theory and Applications
plt.show()
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', \
input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.summary()
Model: "sequential"
The output tensor from the last Conv2D layer of shape (4, 4, 64) is fed into
one or more dense layers to perform multi-classification. Because the dense
layers take 1D vectors as input, the 3D Conv2D output shall be flattened
to 1D. Since the CIFAR has 10 classes at the output, the final Dense layer
shall have 10 output neurons with a softmax activation.
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
Model: "sequential"
As shown above, the 3D (4, 4, 64) outputs at conv2d 2 are flattened into
a 1D vector of shape (1024) before feeding to the first dense layer.
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True), metrics=['accuracy'])
Epoch 1/10
1563/1563 [================] - 45s 28ms/step - loss: 1.7221 -
accuracy: 0.3648 - val loss: 1.2347 - val accuracy: 0.5573
Epoch 2/10
1563/1563 [================] - 42s 27ms/step - loss: 1.1582 -
accuracy: 0.5869 - val loss: 1.0298 - val accuracy: 0.6363
.....
Epoch 9/10
1563/1563 [================] - 39s 25ms/step - loss: 0.5681 -
accuracy: 0.7993 - val loss: 0.8510 - val accuracy: 0.7178
Epoch 10/10
1563/1563 [================] - 39s 25ms/step - loss: 0.5337 -
accuracy: 0.8114 - val loss: 0.9123 - val accuracy: 0.6919
The training of our CNN model is done for 10 epochs with about 40 s
running time for each.
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'],label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')
test_loss,test_acc=model.evaluate(test_images,test_labels, verbose=2)
This simple CNN has achieved a test accuracy of about 70%, as shown
in Fig. 15.13. Not bad for a few lines of code! For another CNN model,
readers may take a look at an example using the Keras subclassing API and
a tf.GradientTape here.
Figure 15.14: YOLOv3 object detection: cars, persons, traffic light around a traffic light
junction.
Figure 15.15: YOLOv3 has successfully detected a man standing on a truck (on the right
side of the photo and quite difficult for humans to see), despite that part of the image
being rather dark.
Figure 15.17: Four cars and one fire hydrant detected by the trained YOLOv3.
Figure 15.18: Two birds on marsh land detected by the trained YOLOv3.
Before ending this chapter, we may note that from MLP to conv2D, there
is quite a significant change. Such a change has led to huge advancements
in ML. This tells us that our mind needs to be wide open for advancements
in sciences and technology. One may ask now: Can we have conv1D? The
answer is yes. The interested reader may take a look at the online article by
562 Machine Learning with Python: Theory and Applications
References
[1] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position, Biological Cybernetics, 36, 193–
202, 2004.
[2] D. Ciresan, M. Ueli and J. Schmidhuber, Multi-column deep neural networks for image
classification, 2012 IEEE Conference on Computer Vision and Pattern Recognition,
3642–3649, 2012.
[3] I. Goodfellow, B. Yoshua and C. Aaron, Deep learning, Nature, 521, 436–444, 2015.
[4] M.V. Valueva, N.N. Nagornov, P.A. Lyakhov et al., Application of the residue
number system to reduce hardware costs of the convolutional neural network
implementation, Mathematics and Computers in Simulation, 177, 232–243, 2020.
https://www.sciencedirect.com/science/article/pii/S0378475420301580.
[5] Duan Shuyong, Ma Honglei, G.R. Liu et al., Development of an automatic lawnmower
with real-time computer vision for obstacle avoidance, International Journal of
Computational Methods, Accepted, 2021.
[6] Duan Shuyong, Lu Ningning, Lyu Zhongwei et al., An anchor box setting technique
based on differences between categories for object detection, International Journal of
Intelligent Robotics and Applications, 1–14, 2021.
[7] V. Dumoulin and F. Visin, A guide to convolution arithmetic for deep learning, arXiv,
1603.07285, 2018.
[8] Y. Lecun, L. Bottou, Y. Bengio et al., Gradient-based learning applied to document
recognition, Proceedings of the IEEE, 86(11), 2278–2324, 1998.
[9] A. Krizhevsky, I. Sutskever and G. E., Imagenet classification with deep convolutional
neural networks, Advances in Neural Information Processing Systems, 2012.
[10] S. Karen and Z. Andrew, Very deep convolutional networks for large-scale image
recognition, arXiv preprint arXiv:1409.1556, 2014. https://www.bibsonomy.org/bib
tex/2bc3ee27a1dd159f48b10ac3555879865/buch jon.
[11] He Kaiming, Zhang Xiangyu, Ren Shaoqing et al., Deep residual learning for image
recognition, CoRR, abs/1512.03385, 2015. http://arxiv.org/abs/1512.03385.
[12] Christian Szegedy, Wei Liu, Yangqing Jia et al., Going Deeper with Convolutions,
2014.
[13] Min Lin, Qiang Chen and Shuicheng Yan, Network In Network, 2014. https://arxiv.
org/abs/1312.4400.
[14] R. Joseph, D. Santosh, G. Ross et al., You only look once: Unified, real-time object
detection, CoRR, abs/1506.02640, (1), 2015. http://arxiv.org/abs/1506.02640.
Chapter 16
Recurrent neural networks (RNNs) are yet another quite special class of
artificial neural networks (NNs) for datasets with features of sequence or
temporal significance. An RNN typically has an internal state recorded
in a memory cell, so that it can process sequential inputs, such as
video and speech records. Affine transformations are used in an extended
affine space that includes memorized states as additional features. RNNs
have important applications in various areas including natural language
processing, speech recognition, translation, music composition, robot control,
and unconstrained handwriting recognition [1], to name just a few. A recent
review on RNNs can be found in [2].
Different types of RNNs have been developed [2]. This chapter focuses on
the so-called long short-term memory networks (LSTMs) presented originally
by Sepp Hochreiter and Jürgen Schmidhuber [3]. This is because (1) LSTM
is most widely used and studied; (2) when LSTM is well understood, it is
relatively easy to understand other RNNs; and (3) most publicly available
modules have LSTM classes built in for ready use.
A typical LSTM unit is composed of a cell, an update gate, a forget
gate, and an output gate. The cell remembers values over time sequence
corresponding to the sequential data inputs. These three gates regulate the
information exchange between the memory cell and newly inputted data. The
level of the regulation is controlled by nested neural network (NN) layers with
learning parameters. These parameters are trained via a supervised process
similar to the NN or MLP models discussed in previous chapters.
LSTMs were developed to deal with the vanishing gradient problem
encountered in training traditional NNs, because they allow information to
travel deep in the layers through interconnections built in the LSTM.
563
564 Machine Learning with Python: Theory and Applications
In an LSTM, the initial vectors a(0) and c(0) are usually set to zero.
These activations can be outputted at each time sequence or only at the
final time T , depending on the type of labels used in the training for the
problem at hand.
Figure 16.1: Unfolded identical LSTM units for processing a sequential dataset in an
LSTM RNN. For a given sequential dataset x, the LSTM uses a memory cell to memorize
the state c at any time. The corresponding activation becomes a, which is updated and
can be outputted at each time sequence or only at the final time T .
Recurrent Neural Network (RNN) and Sequence Feature Models 565
Figure 16.2: An LSTM unit. It has, recurrently, a memory cell c(t−1) that memorizes
the state and an activation state a(t−1) , both in Xl . The unit corresponds to data input
x(t) ∈ Xp at time t. The orange blocks are neural network layers, or an MLP, that perform
1+p+l
affine transformations in an extended affine space X producing a vector in Xl . Yellow
circles and ellipse denote element-wise operations.
These gates are equipped with independent learning parameters, and used
to control the information exchange between x, a and c.
Note that Fig. 16.1 is used to show the sequential process inside an LSTM.
In actual computation, only one LSTM unit is needed in a loop, by feeding
back the a(t) and c(t) to the next time sequence. A typical unit is shown in
Fig. 16.2, where all the detailed operations are shown.
Figure 16.3: LSTM layer. An LSTM can be simply regarded as a normal NN layer that
can be placed horizontally (left) or vertically (right). The input x has typically a shape of
(m, T, p) and the output a has a shape of (m, T, l), where m stands for mini-batch size. If
only the final activation output is needed, the shape of a(T ) has a shape of (m, l).
We highlight the input in blue, and output in red. These two are
the connection points between LSTM units. The learning parameters are
Ŵc , Ŵu , Ŵf , and Ŵo . If single NN layers are used in an LSTM, they
all are in W(1+l+p)×l , and thus the total number of training parameters is
P = 4 × (1 + l + p) × l. Equation (16.3) is practically a computational graph
of LSTM, which gives both forward and backward paths. Since all these
operations are chained and all the derivatives of these activation functions
are available, computing the gradients with respect to any of these training
samples can be done in the standard way using autograd, as discussed in
Chapter 8.
It is important to note that Eq. (16.3) and the figures may looks complicated,
but it can be simply regarded as a single NN layer. The input to LSTM is
only x(t) , and the output from it is just a(t) . All the others are the internal
variables. An LSTM layer can be shown in the Fig. 16.3. Therefore, an LSTM
layer is very easy to use, and can be used as a hidden layer in a neural network
model that has a proper loss function defined at the terminal output layer
of the entire model.
Consider now a special case when the number of the time sequences T = 1.
In this case, a(t−1) = c(t−1) = 0. Our formulation is reduced to
Recurrent Neural Network (RNN) and Sequence Feature Models 567
1+p
x = [1, x(t) ] ∈ X standard affine space
c̃ = tanh(x Ŵc ) ∈ (−1, 1)l candidate memory
gu = σ(x Ŵu ) ∈ (0, 1)l update gate
gf = σ(x Ŵf ) ∈ (0, 1)l forget gate (16.4)
go = σ(x Ŵo ) ∈ (0, 1)l output gate
c(t) = gu ∗ c̃ ∈ (−1, 1)l new memory
a(t) = go ∗ tanh(c(t) ) ∈ (−1, 1)l new activation
There are many other versions of LSTM, and peephole LSTM [4] is one of
them. Peephole LSTM is said to allow the gates to access the constant error
carousel, using the cell state c(t−1) instead of activation a(t−1) . Note that
c(t−1) and a(t−1) have the same shape, as discussed earlier. The formulation
is thus quite similar and is given as follows:
1+p
x = [1, x(t) ] ∈ X standard affine space
1+l+p
xc = [x, c(t−1) ] ∈ X extended affine space
568 Machine Learning with Python: Theory and Applications
All the other operations of the peephole LSTM are similar to the LSTM.
Gated recurrent units (GRUs) were introduced by Kyunghyun Cho et al. [5].
It may be viewed as a version of LSTM without a dedicated memory cell
and hence with less learning parameters than LSTM. The formulation is as
follows.
Consider time t. The activation state at its previous time is a(t−1) with
a(0) = 0. Let l be the dimension of the output of the GRU. Data x(t) is
a vector of p components, and is now inputted. Data x(t) and previous
activation a(t−1) are concatenated to form an extended affine space together,
1, which shall have a dimension of 1 + p + l. We perform the following
operations:
1+l+p
xa = [1, x(t) , a(t−1) ] ∈ X extended affine space
gu = σ(xa Ŵu ) ∈ (0, 1)l update gate
gr = σ(xa Ŵr ) ∈ (0, 1)l reset gate
1+l+p
(16.7)
xra = [1, x(t) , gr a(t−1) ] ∈ X reset extended affine space
ã = tanh(xra Ŵa ) ∈ (−1, 1)l candidate activation
a(t) = (1 − gu )a(t−1) + gu ã ∈ (−1, 1)l new output activation
The learning parameters in a GRU are Ŵu , Ŵr , and Ŵa , 3/4 of those in
an LSTM. Alternative versions of GRUs are also available. Interested readers
my visit the Wiki page on GRUs and then the links therein.
Recurrent Neural Network (RNN) and Sequence Feature Models 569
16.5 Examples
Let us first look at an example for the reduced LSTM that can be
easily handcrafted for comprehension. We will use the simple and familiar
regression problem examined step by step in Chapter 10, using a synthesis
dataset and the MxNet. The details on the problem setting will not be
repeated here. We will follow our xw formulation in the following code:
from __future__ import print_function
from mxnet import nd, autograd, gluon
import numpy as np
import mxnet as mx
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(precision=4)
X=nd.random_normal(shape=(num_samples,p1),ctx=data_ctx)
# randomly generate samples for x1 and x2
noise=.1*nd.random_normal(shape=(num_samples,num_outputs),
ctx=data_ctx)
# 10% noise, zero mean, 1 variance
X[:,0] = 1.0 # / num_samples
print(f" X in affine space:{X[0:4]},{X.shape}")
True w:
[[ 4.2]
[ 2. ]
[-3.4]]
<NDArray 3x1 @cpu(0)>,(3, 1)
X in affine space:
[[ 1. -0.4902 -0.9502]
[ 1. -0.7298 -2.0401]
[ 1. 1.0408 -0.4526]
[ 1. -0.8367 -0.7883]]
<NDArray 4x3 @cpu(0)>,(1000, 3)
y=
[[6.5203]
[9.5923]
[7.7648]
[5.2164]]
<NDArray 4x1 @cpu(0)>(1000, 1)
l = 50
w, wo, wu, wc = init_wb(p1,l,num_outputs) #p1 = p+1
params = [w, wo, wu, wc]
learning_rate = .1
losses = []
plot(losses, X)
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx).reshape((-1,1))
loss.backward()
SGD(params, learning_rate)
cumulative_loss += loss.asscalar()
Recurrent Neural Network (RNN) and Sequence Feature Models 573
if e%10 ==0:
print("Epoch %s, batch %s. Mean loss: %s" %
(e, i, cumulative_loss/num_batches))
losses.append(cumulative_loss/num_batches)
plot(losses, X)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
np.set_printoptions(precision=3)
tf.keras.backend.clear_session()
tf.random.set_seed(8)
inputs = tf.random.normal([m, T, p])
#inputs: A 3D tensor with shape [m, T, p].
print(inputs[-1, T-1,0:p:10],'\n',inputs.shape)
print(output0[-1,:],'\n',output0.shape) #(32, 4)
lstm2=tf.keras.layers.LSTM(l,return_sequences=True,return_state=True)
print(x_train.shape, x_train.shape[1:])
T = x_train.shape[1:][0]
p = x_train.shape[1:][1]
k = len(np.unique(y_train)) # k-classes of MNIST dataset
l = int(x_train[0].shape[0]*x_train[0].shape[1]/10)
# use 1/10th the total image features as the LSTM output, l
p1 = 1 + p + l # dimension of the extended affine space
print(f"Number of classes k ={k}")
576 Machine Learning with Python: Theory and Applications
model = Sequential()
# LSTM layer to process sequential data [m, T, k]
# It produces an output [m, l]. Use default activation:tanh
# Outputs only the final activation a^T.
model.add(LSTM(l, input_shape=(x_train.shape[1:])))
# Add a dense layer for k-classification, using LSTM outputs.
model.add(Dense(k, activation='softmax')) # a^T is feed here
optimizer=tf.keras.optimizers.Adam(learning_rate=.01,decay=1e-6)
model.summary()
Our model has 33,384 training parameters for the LSTM layer, same as
our calculation. The whole model consists of one LSTM layer and one dense
NN layer. We shall now train the model.
Recurrent Neural Network (RNN) and Sequence Feature Models 577
model.fit(x_train,y_train,epochs=3,validation_data=(x_test,y_test));
# Using the default m_size 32
Epoch 1/3
1875/1875 [==============================] - 32s 16ms/step - loss:
0.2583 - accuracy: 0.9175 - val loss: 0.1020 - val accuracy: 0.9703
Epoch 2/3
1875/1875 [==============================] - 29s 16ms/step - loss:
0.1091 - accuracy: 0.9683 - val loss: 0.0897 - val accuracy: 0.9716
Epoch 3/3
1875/1875 [==============================] - 30s 16ms/step - loss:
0.0882 - accuracy: 0.9731 - val loss: 0.0792 - val accuracy: 0.9772
Lmlp = Sequential()
l = int(l/2) #reduce LSTM output, because using more layers
Lmlp.add(LSTM(l, input_shape=(x_train.shape[1:]),
return_sequences=True)) # True: allows feed to next layer
Lmlp.add(LSTM(l, input_shape=(x_train.shape[1:])))
Lmlp.add(Dense(2*k, activation='relu')) # a^T is feed here
Lmlp.add(Dense(k, activation='softmax')) # for classification
optimizer=tf.keras.optimizers.Adam(learning_rate=.001,decay=1e-6)
Lmlp.summary()
Lmlp.fit(x_train,y_train,epochs=3,validation_data=(x_test,y_test));
Epoch 1/3
1875/1875 [===================] - 39s 19ms/step - loss:
0.4861 - accuracy: 0.8447 - val loss: 0.1857 - val accuracy: 0.9420
Epoch 2/3
1875/1875 [===================] - 36s 19ms/step - loss:
0.1394 - accuracy: 0.9597 - val loss: 0.1105 - val accuracy: 0.9674
Epoch 3/3
1875/1875 [===================] - 36s 19ms/step - loss:
0.0965 - accuracy: 0.9711 - val loss: 0.0818 - val accuracy: 0.9755
import numpy as np
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM
import matplotlib.pylab as plt
import sys
%matplotlib inline
Recurrent Neural Network (RNN) and Sequence Feature Models 579
np.set_printoptions(precision=5)
tf.keras.backend.clear_session()
Figure 16.7: Sample coordinates of the trajectory of a moving vector with noise.
Xt, yt = createDataset(X, T = T )
print(Xt[1][0],Xt.shape,Xt.shape[1])
print(yt[1],yt.shape)
We split the dataset sequentially, so that we can train the model using
the first part of the data, and then predict the future movements of the 2D
vectors. One may shuffle the dataset and then split the dataset. In this case,
we train the model for predicting the hidden features of the dataset.
print(X_train[1][0],X_train.shape)
print(X_test[1][0],X_test.shape)
model.summary()
Model: "sequential"
model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=.05);
Epoch 1/5
59/59 [==============================] - 8s 78ms/step - loss:
0.1552 - val loss: 0.0290
Epoch 2/5
59/59 [==============================] - 4s 73ms/step - loss:
0.0151 - val loss: 0.0154
Epoch 3/5
59/59 [==============================] - 4s 73ms/step - loss:
0.0124 - val loss: 0.0126
Epoch 4/5
59/59 [==============================] - 4s 73ms/step - loss:
0.0121 - val loss: 0.0128
Recurrent Neural Network (RNN) and Sequence Feature Models 583
Epoch 5/5
59/59 [==============================] - 4s 73ms/step - loss:
0.0119 - val loss: 0.0125
# Accuracy assessment
from numpy.linalg import norm
predicted = model.predict(X_test)
rmse = np.sqrt(((predicted - y_test) ** 2).mean(axis=0))
print(f"Root Mean Square Error = {rmse}")
print(f"Relative Root Mean Square Error= {rmse/norm(y_test)}")
plt.plot(y_test[:nt:ns][:,1],":")
plt.legend(["Predict", "Test"])
plt.xlabel('Time t')
plt.ylabel('Coordinates of vectors, x2');
It is seen in Fig. 16.9 that our LSTM model gives a reasonable prediction
based on this dataset. It captures the major waving features of the moving
vectors.
LSTM is one of the most powerful tools for speech recognition. There are
many open-source examples. Interested readers may take a look at the mxnet
on Speech LSTM. The source code is also available at their GitHub site.
References
[1] G. Alex, L. Marcus, F. Santiago et al., “A Novel Connectionist System for Uncon-
strained Handwriting Recognition”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 31, pp. 855–868, 2009.
[2] Yu Yong, Si Xiaosheng, Hu Changhua et al., “A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures”, Neural Computation, 31(7), 1235–1270,
2019.
[3] H. Sepp and S. Jürgen, “Long Short-Term Memory”, Neural Computation, 9(8), 1735–
1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735.
[4] F.A. Gers and E. Schmidhuber, “LSTM recurrent networks learn simple context-
free and context-sensitive languages”, IEEE Transactions on Neural Networks, 12(6),
1333–1340, 2001.
[5] Cho KyungHyun, On the Properties of Neural Machine Translation: Encoder-Decoder
Approaches, CoRR, abs/1409.1259, 2014. http://arxiv.org/abs/1409.1259.
Chapter 17
17.1 Background
The world now has a massive amount of data, and it is growing. How to
make use of these data becomes very important. There are mainly two ways
to make the data useful. One is to have the data examined by experts in
the related areas, and have it labeled, and then the data can be used to
train machine learning models. Labeling data, however, is a very expensive
and time-consuming task. The other way is to develop machine learning
algorithms that read the (unlabeled) raw data, and try to extract some
latent features or characteristics from the data. This is the unsupervised
learning that we are discussing in this chapter.
There are many applications for unsupervised learning, including data
compression, de-noising, clustering, recommendation, abnormality detection,
classification, just to name a few.
Unsupervised machine learning methods include principle components
analysis (PCA) for principal value extraction from datasets, the family of
K-means clustering methods, mean-shift methods, and the family of Autoen-
coders. We have already discussed PCA in Chapter 3 and hence will simply
use it here for clustering algorithms. In this chapter, we shall discuss in detail
K-means clustering, mean-shift clustering, and Autoencoders. We shall focus
more on the fundamentals of these methods to help readers understand other
related methods and techniques that are not covered in this book.
585
586 Machine Learning with Python: Theory and Applications
If there is more than one nearest means, we choose one of those, so that
(t)
data-point xq is assigned only to one set Si . This process is done for all the
data-points in the dataset, which leads to a partitioning of all these data-
points by the edges of the Voronoi diagram generated by these means (this
588 Machine Learning with Python: Theory and Applications
will be shown more clearly in the case study examples later). Each Voronoi
(t)
cell hosts a cluster with data-points in set Si .
(t+1) 1
mi = xj , i = 1, 2, . . . , k (17.2)
ni (t)
xj ∈Si
(t)
where ni is the number of data-points in set Si .
This leads to an updated set of k means for the t + 1 step, which is then
used for a new round of iteration: re-assignment of data-points to a new set
of k clusters using Eq. (17.1), and then re-update the k means for the t + 2
step. This is done until the process converges, meaning that these k means
do not change significantly based on some criteria.
Note that this algorithm does not guarantee finding the global optimum
[2]. In practice, however, the K-means algorithm is fast, and it may be one
of the fastest clustering algorithms available for local minima. Because the
initiation of the k means is of random nature, one may need to perform the
clustering several times and choose the best result.
Lloyd’s algorithm discussed above has an average complexity of O(kmT ),
where m is the number of data-points and T is the number of iterations.
Even faster variations of Lloyd’s algorithm have been developed, one of
which is Elkan’s algorithm, which uses the triangular inequality property
to significantly improve the efficiency. Elkan’s algorithm is quite frequently
used, and we will demonstrate Elkan’s algorithm in the case study examples.
Note also that one may use a different distance norm measure other than
the squared Euclidean distance for these K-means type of algorithms. This
may change the convergence behavior. More details on this can be found
at the Wikipedia page (https://en.wikipedia.org/wiki/K-means clustering)
and the links therein.
Figure 17.1 is a picture taken from a nicely made animation by
Chire with CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)
via the Wikimedia Commons (https://commons.wikimedia.org/wiki/File:K-
means convergence.gif). This animation shows clearly how the iteration is
performed in a typical K-means algorithm.
Unsupervised Learning Techniques 589
Figure 17.1: Convergence process of a typical K-means algorithm. The data-points are
given and fixed in 2D feature space, which are to be grouped in k = 3 clusters: blue, yellow,
and red ones.
Figure 17.2: Examination of the converging process of a typical K-means algorithm. The
data-points are given in 2D feature space, and are to be grouped into k = 3 clusters: blue,
yellow, and red ones.
5th iteration, the clustering is already quite good and not too far from that
achieved at the final 14th iteration.
Note that when the number of the clusters k increases, the Voronoi
diagram will become more complicated, as will be shown in one of the
examples later for a 10-digit clustering problem.
We will use the following function to evaluate the performance of the method.
It uses the SKlearn class make pipeline to compute the evaluation metrics.
This function will be called when a K-means run is completed using an
initiation means.
Parameters
----------
kmeans : KMeans instance
class:`~sklearn.cluster.KMeans` instance with the
initialization already set.
name : str
Name given to the strategy. It will be used to show
the results in a table.
data : ndarray of shape (n_samples, n_features)
The data for clustering.
labels : ndarray of shape (n_samples,)
The labels used to compute the clustering metrics
which require some supervision.
"""
t0 = time()
(n_samples, n_features), n_k = data.shape,
np.unique(labels).size
estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
fit_time = time() - t0
results = [name, fit_time, estimator[-1].inertia_]
592 Machine Learning with Python: Theory and Applications
results+=[m(labels,estimator[-1].labels_) for m in
clustering_metrics]
print('data.shape:',data.shape)
print('labels:',labels,' Shape:',labels.shape)
#____________________________________________________________
#____________________________________________________________
# Define a function to plot the clusters
# The data needs to be reduced to 2D, if it not already 2D
# kmeans needs already been fitted
def plot_clusters(reduced_data,kmeans,fig_title):
plt.figure(figsize=(4.0, 4.0),dpi=100)
# Plot the decision boundary. Assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1,
reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1,
reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
plt.title(fig_title)
594 Machine Learning with Python: Theory and Applications
#xyd = 3.5
xyd = min(abs(x_min), x_max, abs(y_min), y_max)
x_min, x_max, y_min, y_max = -xyd, xyd, -xyd, xyd
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
#print('centroids=',centroids)
We now generate a set of synthetic data-points using the SKlearn make blobs
class in a random manner. In this example, we designed a total of four study
cases, which are detailed in the cell below. Readers may simply assign a
number (from 1 to 4) to a variable “Case”, and the date points for this case
will then be generated.
if Case == 1:
#_______________________________________________________
# Case 1: synthetic sample with 2 groups of data-points
# in 2D space, n_features=2, n_k=2
centers = [[1, 1], [-1, -1]] # centers for 2D points
#_______________________________________________________
elif Case == 2:
# Case 2: synthetic sample data that has 3 groups of
# data-points in 3D space, n_features=3, n_k=3
centers = [[1, 1, 1], [-1,-1,-1], [1,-1,0.5]] # for 3D points
#_______________________________________________________
elif Case == 3:
# Case 3: synthetic sample data that has 3 groups of
# data-points in 2D space, n_features=2, n_k=3
centers = [[1, 1], [-1,-1], [1,-1]] # set 3 centers 2D points
#_______________________________________________________
Unsupervised Learning Techniques 595
elif Case == 4:
# Case 4: synthetic sample data that has 2 groups of
# data-points in 6D space, n_features=6, n_k=5
centers = [[ 1, 0, 1, 1, 1, 0], [-1,-1,-1,-1, 0,-1],
[1,-1,-1, 0,-1,-1], [ 1, 1, 0,-1,-1,-1],
[ 0, 1, 1, 0,-1, -1]]
# set 5 centers for 5 group of data-points in 6D
#________________________________________________________
else:
print('There are a total of 4 cases for this example!!!')
# Generation of data-points
n_s, c_std = 1888, 0.4888 # n_s: number of samples;
#c_std: standard deviation for random data-points generation
data,labels=make_blobs(n_samples=n_s,centers=centers,
cluster_std=c_std) # number of samples
print_data_info(data, labels,n_features,n_k,n_samples,n_comp)
bench_k_means(kmeans=kmeans, name="k-means++",data=data,
labels=labels)
r_data = PCA(n_components=n_comp).fit_transform(data)
bench_k_means(kmeans=kmeans,name="k-means++pca",data=r_data,
labels=labels)
print(78 * '_')
print_evaluation_metrics()
Set Case = 1 in the codes given above; the clustering has been completed
on the author’s laptop and the results are summarized in the same output
table above. We may observe the following:
• First is the computation time for clustering. They are all very fast in
the order of 20 ms. The K-means with PCA initiated means (the PCA-
init method) takes about 7 ms, and it runs fastest (without counting the
time for the PCA). This is because it runs only once. The time for the
other three methods is roughly the same, and slower than the PCA-init,
because they are set to run for 5 times. Because our toy problem with
synthetic data is very small in scale, the measure of computation time is
not accurate. We cannot make much point.
• Next is the quality of clustering. These first three methods listed in the
output table give the same quality results in all the measuring metrics.
Because this toy problem is too simple, there may just be one global
598 Machine Learning with Python: Theory and Applications
optimal solution, and any of these methods give the same answer. However,
the k-means++pca gives a result that is different from that of all the other
three methods, especially on the inertia measure. This is because the K-
means is done using the PCA processed data. For Case-1, the PCA did not
reduce the data dimension, but projected the dataset to its two principle
axes. The dataset in the principle component space become flatter (see
Chapter 3). Thus, the measured inertia becomes much larger. The scores
measured in other metrics have some change but not very much.
• Using the same code, we set the n init to 1 (instead for 5) and run. We
obtained the same results. This also supports the argument that there is
likely just one optimal solution to this toy problem, because randomly
selected initial means do not give a different solution for all the methods
used. Readers can easily confirm this using the code given.
Readers may set Case to other numbers (2 or 3 or 4). Similar observations
can be made for all these toy cases.
else:
print('There are a total of 4 cases for this example!!!')
print_data_info(data, labels,n_features,n_k,n_samples,n_comp)
kmeans = KMeans(init="k-means++", n_clusters=n_k, n_init=n_i)
kmeans.fit(reduced_data)
fig_title = "Case-"+str(Case)
plot_clusters(reduced_data,kmeans,fig_title)
print('Cluster centers set for the data-points =', centers)
data= [[ 0.60389228 -1.62706035]]
data.shape= (1888, 2)
No of clusters k:3; No of samples:1888; No of features p:2;
No of PCA comps:2
data.shape: (1888, 2)
labels: [2 0 2 ... 1 0 2] Shape: (1888,)
Figure 17.4 plots the clusters and the data-points for all these four cases.
We note the following:
• All these cases are well clustered using K-means algorithm, with the
converged centroids (marked with white X) each for a cluster.
600 Machine Learning with Python: Theory and Applications
Figure 17.4: K-means clusters for synthetic data-points generated randomly for four
different cases of datasets. The original datasets for Case-2 and Case-4 are in higher-
dimensional feature spaces. For these datasets, its feature space is reduced to 2D and then
the K-means clustering is performed.
• These clusters are hosted in the colored Voronoi cells (one cluster per
color), and the boundaries between these Voronoi cells divide these
clusters.
• The line that connects these centroids (for each of the cases) is perpendic-
ular (orthogonal) to the Voronoi cell boundary (or its extended straight
line). This orthogonality is proof that these cells are essentially the Voronoi
cells by definition. It is produced by minimizing the squared Euclidean
distances of a point in cluster to its centroid in the K-means clustering
algorithm.
We first load the dataset using the SKlearn load digits() and perform the
clustering using the same SKlearn’s K-means. This dataset contains images
of 10 handwritten digits from 0 to 9. We would like each cluster to contain
the images of the same digits. We thus have k = 10 clusters for the problem.
Each of these images has 8 × 8 pixel, meaning that the number of features
is 64.
# Import necessary module for this task
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline
# load the digits dataset
data, labels = load_digits(return_X_y=True)
(n_samples,n_features),n_digits=data.shape,np.unique(labels).size
n_k = n_digits
n_comp = 2
print_data_info(data, labels,n_features,n_k,n_samples,n_comp)
print('Take a look at the image for digit 9:')
plt.imshow(data[9].reshape(8,8))
<matplotlib.image.AxesImage at 0x1fdcd0ca1d0>
602 Machine Learning with Python: Theory and Applications
Figure 17.5: Sample image of a handwritten digit in the dataset of UCI ML hand-written
digits.
As shown in Fig. 17.5, in this dataset, each of the images has 64 features
that are the image values at the 8 × 8 pixels.
print('data.shape:',data.shape)
print(f"Number of digits (k):{n_digits}; Number of samples (n):\
{n_samples};Number of features (p) {n_features}")
print(78 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\
tAMI\tsilht')
n_i = 5
pca = PCA(n_components=n_digits).fit(data)
kmeans = KMeans(init=pca.components_, n_clusters=n_digits,
n_init=1)
bench_k_means(kmeans=kmeans, name="PCA-init", data=data,
labels=labels) # by default it uses the Elkan algorithm
kmeans = KMeans(init="random",n_clusters=n_digits,n_init=n_i,
random_state=0)
bench_k_means(kmeans=kmeans, name="random", data=data,
labels=labels)
Unsupervised Learning Techniques 603
n_compr = 10 # n_digits
r_data = PCA(n_components=n_compr).fit_transform(data)
bench_k_means(kmeans=kmeans,name="k-means++pca",data=r_data,
labels=labels)
print(78 * '_')
print_evaluation_metrics()
• First is the computation time taken for the clustering. They are all very
fast for this handwritten digit dataset. The K-means with PCA initiated
means (the PCA-init method) takes only ∼20 ms. It runs fastest (without
counting the time for the PCA). This is because it runs only once. The
time for the other three methods is roughly the same, and slower than
the PCA-init, because they were set to run for 5 times. The k-means++
method took about 5 times more time. Note that the runtime can also
change depending on whether or not the computer is on other tasks at the
same time.
• The inertia value is the sum of squared distances of samples to their
centroid. The random and k-means++ methods give quite close inertia
values. The PCA-init gives a higher value and the k-means++pca gives
a much smaller value. This is because the use of PCA has reduced the
604 Machine Learning with Python: Theory and Applications
Readers can easily try out different setting and the run this example using
the code given.
Let us now plot the clustered data-points in 2D plane with the help of
(again) the PCA.
17.2.5.3 Visualize the results for handwritten digit clustering using PCA
The dataset for the handwritten digit is 64-dimensional in the feature
space. It is not possible for us to visualize the clustered results is such a
high dimension. We thus use the PCA to reduce the feature space to a
two-dimensional PCA principal components space, and then plot the data
together with the clusters.
Unsupervised Learning Techniques 605
Figure 17.6: Results of the 10-means clustering using the dataset of UCI ML hand-written
digits.
Notice from Fig. 17.6 that we have a more complicated Voronoi diagram
for the 10-means clustering problem.
%matplotlib inline
print(__doc__)
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=n_s, centers=centers,
cluster_std=c_std)
print('X=',X[0:2:], ' X.shape=', X.shape)
# This dataset is the same as Case-3 in Example 1
# MeanShift clustering
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
# print('Predicted cluster centers=', cluster_centers)
labels_unique = np.unique(labels)
# Find the unique elements in an array
n_clusters_ = len(labels_unique)
# MeanShift clustering
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(reduced_data)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
# print('Predicted cluster centers=', cluster_centers)
labels_unique = np.unique(labels)
# Find the unique elements in an array
n_clusters_ = len(labels_unique)
print("Number of estimated clusters : %d" % n_clusters_)
fig_title ='MeanShift for Digits'
plot_clusters(reduced_data,ms,fig_title)
print('Centroids by MeanShift is: Centroids by
kmeans++ was:\n')
ms_cntr = np.sort(ms.cluster_centers_,axis=0)
# sort for easy comparison
km_cntr = np.sort(kmeans.cluster_centers_,axis=0)
for i in range(n_clusters_):
print (f'{ms_cntr[i,0]:12.4f}{ms_cntr[i,1]:9.4f}'
f'{km_cntr[i,0]:19.4f}{km_cntr[i,1]:9.4f}')
Bandwidth= 5.722875000219844
Number of estimated clusters : 10
Figure 17.8: Results of a MeanShift clustering using the dataset of UCI ML hand-written
digits.
Unsupervised Learning Techniques 609
17.4 Autoencoders
Figure 17.9: A neural network for autodecoder. It consists of two sub-nets, one for the
encoder and one for the decoder, which are bridged by the bottleneck layer z. An input
data x into the encoder, with the current training parameters, produces latent features z.
The same z is then feed into the decoder and produces an output x̃. The input x itself is
used as the label at the final output layer, and the residual r = x̃ − x is minimized to force
the output x̃ close to the input x. Once trained properly with a dataset, the autoencoder
produces the latent features at the bottleneck, that are the extracted/hidden features in
the dataset.
Unsupervised Learning Techniques 611
data is then feeds to the decoder that may have one or more layers and
their own training parameters. The decoder decodes the compressed data to
reconstruct the data.
The network is trained with a simple criterion that the decoded (recon-
structed) data at the output layer must be as close as possible to the
original input data. This is why the training does not require the input
data to be labeled. Cost functions defined in the previous chapter can then
be used to satisfy this criterion for different types of data. The training
process is essentially the same as training any other neural network, via a
back-propagation process. Once the training is completed, all the training
parameters in these three parts are set, and the autodecoder is capable of
reconstructing the data with the compressed data when it is needed. This is
the basic strategy for an autoencoder.
Autoencoders can be built with various types of layers including dense
layers, convolution layers, their combinations, and other forms of layers.
import pixellib
from pixellib.semantic import semantic_segmentation
segment_image = semantic_segmentation()
segment_image.load_pascalvoc_model
('deeplabv3_xception_tf_dim_ordering_ tf_kernels.h5')
segment_image.segmentAsPascalvoc('../images/inkedcars15.jpg',
output_image_name = '../images/cars15-s.jpg')
img = Image.open('../images/inkedcars15.jpg')
plt.figure(figsize=(10, 15))
plt.imshow(img);
img = Image.open('../images/cars15-s.jpg')
plt.figure(figsize=(10, 15))
plt.imshow(img);
Figure 17.11: Resulting image after segmentation. The object type of the image is color
coded.
Unsupervised Learning Techniques 613
The color code is used to distinguish the type of objects. The details of the
color codes can be found at pixel libraries site. One can also overlay the
segmented objects with the original image
segment_image.segmentAsPascalvoc('../images/inkedcars15.jpg',
output_image_name = '../images/cars15-s.jpg', overlay = True)
img = Image.open('../images/cars15-s.jpg')
plt.figure(figsize=(10, 15))
plt.imshow(img);
Figure 17.12: Overlaid image of the original and the segmented images.
import pixellib
from pixellib.semantic import semantic_segmentation
import time
from pixellib.instance import instance_segmentation
segment_image = instance_segmentation()
segment_image.load_model("mask_rcnn_coco.h5")
start = time.time()
segment_image.segmentImage("../images/inkedcars-blocked.jpg",
output_image_name = "../images/cars-blocked-s.jpg")
end = time.time()
print(f"Inference Time: {end-start:.2f}seconds")
img = Image.open("../images/inkedcars-blocked.jpg")
plt.figure(figsize=(10, 15));
plt.imshow(img);
Processed image saved successfully in your current working
directory.
Inference Time: 19.08seconds
614 Machine Learning with Python: Theory and Applications
img = Image.open("../images/cars-blocked-s.jpg")
plt.figure(figsize=(10, 15))
plt.imshow(img)
<matplotlib.image.AxesImage at 0x1fa39721108>
Figure 17.14: Overlaid image of the original and segmented images. The object type is
color coded.
segment_image.segmentImage("../images/inkedcars-blocked.jpg",
output_image_name = "../images/cars-blocked-sf.jpg",
show_bboxes = True)
img = Image.open("../images/cars-blocked-sf.jpg")
plt.figure(figsize=(10, 15))
plt.imshow(img);
Unsupervised Learning Techniques 615
Figure 17.15: Overlaid image of the original and segmented images. The object type is
color coded. Objects are boxed.
Examples of Autoencoders using the MNIST dataset and the codes by the
Keras Team are available at The Keras Blog (https://blog.keras.io/building-
autoencoders-in-keras.html). Variational Autoencoders (VAE) can also be
used for segmentations. The theoretical discussion about VAE will be given in
the last section of this chapter. Autoencoders have many other applications,
including Text Embeddings (http://yaronvazana.com/2019/09/28/training-
an-autoencoder-to-generate-text-embeddings/), Anomaly detection, and
Popularity prediction, just to name a few.
Figure 17.16: Relationship between an Autoencoder and PCA [10] (with permission).
The multi-colors in the output neurons in the decoder indicate that the
weights related to the bottleneck neurons may all contribute to a neuron of
the output in the Decoder.
The above comparison uses words like “similar” and “equivalent”. If one
uses linear activation functions, it is possible to train an Autoencoder to
be the same as the linear PCA, by properly enforcing a set of constraints
during the training including tying the weights in the encoder and decoder,
and orthogonal conditions on the weights. For more details, one may refer to
the article by Ranjan. Constraint Autoencoder [9] is used for feature-space
dimension-reduction for inverse analysis using Tubenets [11–13].
Figure 17.17: A variational autoencoders (VAE). It consists of two sub-nets, one for the
encoder and one for the decoder, which are bridged by the bottleneck layer z that consists
of three components: vector of means μ, covariance matrix Σ, and vector drawn from the
standard normal distribution. An input data x into the encoder, with the current training
parameters, produces μ and Σ that is used together with to form latent features z that is
a normal distribution. The same z is then feed into the decoder and produces an output x̃.
The input x itself is used as the label at the final output layer, and the residual r = x − x̃
is minimized to force the output x̃ close to the input x, while the KL-divergence of the
standard normal distribution from the distribution of z is also minimized. Once trained
properly with a dataset, the autoencoder produces the latent distributive features at the
bottleneck, that can be used to reconstruct data-points in the input dataset, or to generate
new data-points that may not in the dataset. [Image modified based on that of EugenioTL
from Wikimedia Commons, under the CC BY-SA 4.0 license].
z = μ + Σ1/2
2πσq
where μ and σ are, respectively, the mean and the standard deviation of the
distribution specified by the subscript. Note in this case, both p(x) and p(x)
are scalar functions of x, and their distributions are full controlled by their
means and standard deviations.
The KL-divergence from q(x) to p(x) can be written as
DKL (qp) = q(x) log(q(x)) − log(p(x)) dx
1 x − μq 2 1 x − μp 2
= q(x) − log(σq ) − + log(σp ) + dx
2 σq 2 σp
σp 1 x − μp 2 1 x − μq 2
= q(x) log + − dx
σq 2 σp 2 σq
σp 1 x − μp 2 x − μq 2
= Eq log + −
σq 2 σp σq
620 Machine Learning with Python: Theory and Applications
σp
1 2 1 2
= log + 2 Eq (x − μp ) − Eq (x − μq )
σq
2σp 2σq2
σp 1 2 1
= log + 2 Eq (x − μp ) −
σq 2σp 2
σp σq2 + (μq − μp )2 1
= log + −
σq 2σp2 2
1 σp σq2 (μq − μp )2
= 2 log + 2+ −1 (17.6)
2 σq σp σp2
σq2 + μ2q 1
DKL (qp) = − log σq − , (17.9)
2 2
1 1 −1
p(x) = N (x; μp , Σp ) = exp − (x − μp ) Σp (x − μp )
k 2
(2π) |Σp |
1 1 −1
q(x) = N (x; μq , Σq ) = exp − (x − μq ) Σq (x − μq )
2
(2π)k |Σq |
(17.10)
1 det Σp
= Eq log + (x − μp ) Σ−1
p (x − μp )
2 det Σq
−1
− (x − μq ) Σq (x − μq )
1 det Σp
= log + Eq [(x − μp ) Σ−1
p (x − μp )]
2 det Σq
− Eq [(x − μq ) Σ−1
q (x − μq )]
1 det Σp
= log + Eq [(x − μp ) Σ−1
p (x − μp )] − k
2 det Σq
1 det Σp −1 −1
= log + tr(Σp Σq ) + (μq − μp ) Σp (μq − μp ) − k
2 det Σq
(17.11)
1
DKL (qp) = − log(det Σq ) + tr(Σq ) + μ
q μq − k (17.13)
2
Finally, for numerical stability reason, we often replace Σqi with exp(Σqi ),
and the final equation for KL-divergence often used in VAE in Eq. (17.4) is
given as follows.
k
1
DKL (qp) = exp(Σqi ) + μ2qi − log(Σqi ) − 1 , (17.15)
2
i
where the means and variances are computed using the current learning
parameters W for q, and W̃ for p for the deterministic part. The standard
normal distribution is used for the stochastic part as shown in Eq. (17.3).
This allows the back-propagation during training by-passing through the
bottleneck layer without the stochastic part effecting the computation of
gradients with respect to W in a VAE model. This is known as the so-called
re-parameterization trick.
With the above formulation, a VAE can now be built with minimum
change to the standard autoencoder. An example code is available at
https://keras.io/examples/generative/vae/ where a step by step how to
create, train and make use of a VAE model.
References
[1] G. Hamerly and C. Elkan, Alternatives to the k-means algorithm that find better
clusterings, 2002. https://doi.org/10.1145/584792.584890.
[2] J.A. Hartigan and M.A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm,
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108,
1979. http://www.jstor.org/stable/2346830.
[3] P. Fabian, V. Gae, G. Alexandre et al., Scikit-learn: Machine learning in Python,
Journal of Machine Learning Research, 12(85), 2825–2830, 2011. http://jmlr.org/
papers/v12/pedregosa11a.html.
624 Machine Learning with Python: Theory and Applications
[4] D. Comaniciu and P. Meer, Mean shift: A robust approach toward feature space
analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5),
603–619, 2002.
[5] A. Kramer Mark, Nonlinear principal component analysis using autoassociative
neural networks, AIChE Journal, 37(2), 233–243, 1991. https://aiche.onlinelibrary.
com/doi/abs/10.1002/aic.690370209.
[6] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, 2016.
[7] P. Diederik and W. Max, An introduction to variational autoencoders, CoRR,
abs/1906.02691, 2019. http://arxiv.org/abs/1906.02691.
[8] G.E. Hinton, A. Krizhevsky and S.D. Wang, Transforming auto-encoders, Artificial
Neural Networks and Machine Learning — ICANN 2011, 2011.
[9] D. Shuyong, H. Zhiping, G.R. Liu et al., A novel inverse procedure via creating
tubenet with constraint autoencoder for feature-space dimension-reduction, Interna-
tional Journal of Applied Mechanics, 13(8), 2150091, 2021.
[10] C. Ranjan, Understanding Deep Learning: Application in Rare Event Prediction,
Connaissance Publishing, 2020. www.understandingdeeplearning.com.
[11] G.R. Liu, S.Y. Duan, Z.M. Zhang et al., Tubenet: A special trumpetnet for explicit
solutions to inverse problems, International Journal of Computational Methods,
18(01), 2050030, 2021. https://doi.org/10.1142/S0219876220500309.
[12] L. Shi, F. Wang, G. Liu et al., Two-way TubeNets Uncertain Inverse methods for
improving Positioning Accuracy of robots Based on Interval, The 11th International
Conference on Computational Methods (ICCM2020), 2020.
[13] D. Shuyong, S. Lutong, G.R. Liu et al., An uncertainty inverse method for parameters
identification of robot arms based on two-way neural network, Inverse Problems in
Science & Engineering, Revised, 2021.
Chapter 18
625
626 Machine Learning with Python: Theory and Applications
Games: The agent learns to take the best actions to maximize the score.
Each action will affect the states of the agent in the gaming environment. A
typical example is the widely publicized AlphaGo.
We know from the previous chapters that any optimization problem has
an objective function, which is given in terms of the learning parameters.
The learning process is essentially finding the minimum, making use of the
gradient information of the objective function. In a reinforcement learning
process, the objective is to accumulate rewards as much as possible, but the
agent does not know where the award is in the given environment.
Clearly, the agent needs to take actions in steps at different points in time.
It must keep track of how the state of the action-outcomes is evolving over
time. At each time, the current state is the basis for making an action that
updates the state, which shall result in a new state for the next time. Clearly,
the outcomes at each time are partially random and partially depending on
the actions taken. Therefore, the Markov decision process or MDP can be
a proper description for our problem.
At any point in time, we need a rule that tells the agent how to act, given
any possible values of the state. Such a rule, determining the actions as a
function of the states, is called a policy or policy function [5], and the very
fundamental equation known as the Bellman equation.
We naturally want the state to evolve in the best possible way, so
that actions on the evolving states can maximize our objective, which is
accumulating rewards. The best possible value of the accumulated rewards,
written as a function of the state, is called the state-value function. When
it is written as a function of the action (and hence also implicitly of the state),
it is called the action-value function. Our reinforcement learning problem
is now casted as an optimization problem with the action-value function as
the objective function, in discrete time.
The final important strategy is how to find the optimizer (the optimal
action and state) that optimizes the action-value function. For this, we use
the Bellman equation, which gives the relationship between the action-
value function at a time point and that at the next time point. Our
optimization problem in discrete time is stated in a step-by-step recursive
form (known as backward induction). This leads to the so-called value
iterations, known as value-based algorithms.
Alternatively, RL can also be casted as a policy optimization problem
and let the agent to learn an optimal policy. This is because when the
policy is optimized and converged in an MDP process, the value function
is also converged at the optimal. The policy and value function are directly
interconnected in the MDP processes. This leads to policy-based algorithms.
Reinforcement Learning (RL) 629
• State space S for the agent is interacting with and observing from.
• Action space A defines possible actions that agent may take.
• Transition probability function P (s → s ; a) defines the probability of tran-
sition from state s to state s under action a.
• Reward function R(s → s ; a) defines the immediate reward after transition
from s to s under action a.
• Discount factor γ ∈ (0, 1) that discounts awards from future steps in the
process.
630 Machine Learning with Python: Theory and Applications
18.3 Policy
π : S × A → [0, 1] (18.1)
It is a conditional probability:
Optimal policy: There can be a large number of possible policies (the com-
binations of variables in the state space and that in the action space). The
agent needs to find an optimal policy that maximizes the expected return in
terms of cumulative reward. In value-based algorithms, the simplest ”opti-
mal” policy for the agent is the greedy policy, which takes the action that
gives the best values among possible choices at each step. However, this at-
single-time best choice may not be the best in the entire MDP process. Thus,
in policy-based algorithms, the goal is set for agent to learn a smoother opti-
mal policy over multiple consecutive states in the MDP process, which can
be done using MLPs (Chapter 13), while the agent explores the environment.
When the agent takes an action, it shall receive an immediate reward for the
current action. In addition, the agent is going to take more actions in the
coming time steps, for which the agent should also receive rewards. However,
Reinforcement Learning (RL) 631
where γ ∈ (0, 1) is the discount rate considering the reward rt+k+1 from the
future, which should be valued less currently. Note also that the return of
the rewards from the far future terminal state (if any) should vanish.
When the policy π is followed by the agent at a state, the state-value
function Vπ (s) can now be defined as the expected return from the future
starting with state st . It estimates the total return from all these future
rewards:
∞
Vπ (s) = E [R | st = s, π] = E γ k rt+k+1 | st = s, π (18.4)
k=0
The Bellman equation gives Q(s, a) values at the current time step in
terms of those at the next time step Q = Q(s , a), known as backward
induction. Using Eq. (18.6), we can device the following simple but very
basic algorithm, which is the base for a Q-learning algorithm.
632 Machine Learning with Python: Theory and Applications
robot can take, and has a value for each cell. It also has to choose one of the
four actions that the robot takes. Therefore, we can build a table of record,
and let us call it Q-table with shape (16, 4). At the beginning, nothing has
happened, and hence the Q-table is initialized with all zeros. At the goal cell,
we give it a value of 1.0 as the reward. All the cells occupied by obstacles
(the robot does not know where it is at the beginning) will send the robot
back to square one.
With this setting, and the formulations that are presented in the previous
sections, the formula for updating Q in the Q-algorithm can be easily
written as
temporal-difference (gradient-like)
during the learning process, and hence when it arrives at the goal cell (if it
does), the number of the steps it takes can be quite large (larger than 16).
Even if it failed to arrive at the goal cell because of hitting on an obstacle
and ends the episode, the number can also be quite large, because of the
possibility of “wondering”.
We are now ready for coding our Q-learning algorithm. We would now need
a module called Gym under the MIT License Copyright (c) 2016 [OpenAI]
(https://openai.com). It is a tool kit for developing and comparing reinforce-
ment learning algorithms. For our example, we will use OpenAI Gym to set
up the environment, the room with obstacles. We will also use TensorFlow
to build the algorithm and matplotlib to plot the results. First, we import
necessary modules. The following code is from awjuliani/DeepRL-Agent
under the MIT License:
# https://github.com/awjuliani/DeepRL-Agents
# awjuliani/DeepRL-Agents is licensed under the MIT License
# import necessary modules
import numpy as np
import gym
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
Note that the robot does not know this setting before starting.
We shall now examine the Q-learning algorithm for the robot exploring
a room with obstacles.
def print_state(Q,ns,na):
#Q = (env.observation_space.n,env.action_space.n)
print('Initial Q-Table Values: (',ns,',',na,')')
print('Norm of Q=',np.linalg.norm(Q,np.inf))
for s in range(ns//na):
for a in range(na):
print (f'{Q[4*a,s]:7.4f}{Q[4*a+1,s]:7.4f}'
f'{Q[4*a+2,s]:7.4f}{Q[4*a+3,s]:7.4f}')
lrv = schedule(lr_0,episodes,beta_lr)
gav = schedule(gamma_0,episodes,beta_g)
epsv = schedule(eps_0,episodes,beta_eps)
print(f'lrv_last={lrv[episodes]:.4e}', f'gav_last=
{gav[episodes]:.4e}',f'epsv_last={epsv[episodes]:.4e}')
# initialize lists for recording the total rewards and steps
jList = [] # walked steps in episodes
np.random.seed(1)
j+=1
# compute the action values using the greedy policy
# using the current Q-table
a=np.argmax(Q[s,:]+np.random.randn(1,env.
action_space.n)*eps)
# The action values are added with random values
# The random value is a must at the initial episode
# because the robot has no choice but walk randomly
# It should be reduced with increase of episodes.
# It shall vanish when the episode gets infinite.
rAll += r
s = s1 # update the state
if done: break
Figure 18.2: Reward received by the agent in the FrozenLake-v0 environment through
Q-learning.
640 Machine Learning with Python: Theory and Applications
Figure 18.2 shows the record on rewards received over each episode of
attempt by the robot starting with a random action (randomly generated
Q-table). It is shown that the robot is wondering at the initial few hundred
episodes. It seems to learn its way to reach the goal cell with a good success
rate.
Figure 18.3 shows the evolution of the maximum value in the Q-table
over each episode. Such values become quite close to 1.0 after a few hundred
episodes.
# a utility function
def mv_avg(x, n): # move average over past n points
return [sum(x[i:i+n])/n for i in range(len(x)-n)]
642 Machine Learning with Python: Theory and Applications
The following procedure is for setting up the graph for our Q-Network.
It should have an array of 16 neurons at the input layer, a 16 × 4 weight
matrix, and an array of 4 neurons at the output layer. The structure has a
simple configuration of 16 × 4 × 1.
to create and run a session following the graph for training the Q-Network,
based on the Q-learning algorithm.
init=tf.global_variables_initializer()
# instance for initializing variables one more piece for the graph
for i in range(episodes):
# Reset the room environment, which gives an initial
# state for robot agent
s = env.reset()
jList.append(j)
rList.append(rAll)
jRate.append(rAll/j)
N_avg = 100
plt.plot(rList,label="Current Reward")
plt.plot(mv_avg(rList, N_avg),label="Average reward over"
+str(N_avg))
plt.title('Reward received')
plt.legend(loc='lower right')
plt.show()
Reinforcement Learning (RL) 645
Figure 18.4: Award received (top) by the agent and Q value evolution (bottom) in the
FrozenLake-v0 environment during Q-learning using an NN.
The results and learning process are shown in Fig. 18.4. Quite similar
observations can be made as Example 1.
With the understanding of how to build a Q-Network, one should not
have much difficulty in learning to construct deepnets for Q-learning. Our
study has found that when a greedy policy is used at each step, such an at-
single-time best choice may not be the best in the entire process. The agent
may be trapped in local optimal and fail to arrive to the global optimal,
as shown in the above examples. Policy-based algorithms can overcome this
problem.
646 Machine Learning with Python: Theory and Applications
The Q-learning and DQN studied in the previous sections is known as value-
based aiming to maximize the Q values. Instabilities can often be observed
in this type of algorithms. Alternatively, one can optimize the policy instead,
known as policy-based methods, such as the policy gradient methods. These
methods have been studied quite intensively in the past few years, and
some significant improvements are made. These methods aim to optimize the
policy, instead of the value. Because the transition probability P (s → s ; a)
determines the return R, and is under action a that is in turn controlled
(conditioned) by the policy π. Therefore, if we can find an optimal policy
that maximize the return, our goal is achieved. There are many policy
gradient methods and techniques developed in the past years, and study
all these can be a challenge. In this section, we introduce first one of the
best performers, Proximal Policy Optimization (PPO) [6], and then derive
a formulation procedure that outlines the major path leading to PPO.
where
• ŵ is the vector of all parameters used for the policy model.
• Et denotes an empirical expectation over time-steps.
• rt is the ratio of the probabilities under the current and old policies.
• At is the estimated advantage (the expected return of a state with its
baseline subtracted).
• ε is a hyperparameter, usually set as 0.1 or 0.2.
The clip function is defined as
⎧
⎨ 1 − ε if x < 1 − ε
clip(x, 1 − ε, 1 + ε) = 1 + ε if x > 1 + ε (18.9)
⎩
x else
⎡ ⎤
⎢ ⎥
⎢ T ⎥
⎢
max E ⎢ R(st , at ) | πŵ ⎥ ⇒ max E [R(τ ) | π ] ⇒ max P (τ ; ŵ)R(τ )
ŵ ⎥ ŵ
ŵ
ŵ
⎣ t=0 ⎦ τ
R(τ ) U (ŵ)
(18.10)
To solve the above problem, we need to evaluate the gradient of U (ŵ) with
respect to ŵ.
∇ŵ U (ŵ) = ∇ŵ P (τ ; ŵ)R(τ ) = ∇ŵ P (τ ; ŵ)R(τ )
τ τ
∇ŵ P (τ ; ŵ)
= P (τ ; ŵ) R(τ ) (18.11)
τ
P (τ ; ŵ)
= P (τ ; ŵ) [∇ŵ log P (τ ; ŵ)] R(τ )
τ
Using m sample paths and policy πŵ , the above expectation can be
approximated as a summation, and we obtain,
1
m
∇ŵ U (ŵ) ≈ ∇ŵ log P (τ (i) ; ŵ) R(τ (i) ) (18.12)
m
i=1
Note here that the gradient is applied only to the P . Thus, R needs not be
smooth and can be discrete.
P →π
We now ready to make a connection between the transition probability
function P to policy π, using the states and actions.
⎡ ⎤
⎢ (i)
T
(i) (i)
(i)
(i) ⎥
∇ŵ log P (τ (i) ; ŵ) = ∇ŵ log ⎢
⎣ P s t+1 | s t , at · π ŵ at | s t
⎥
⎦
t=0
transition probability policy
T
T
(i) (i) (i) (i) (i)
= ∇ŵ log P st+1 | st , at + log πŵ at | st
t=0 t=0
T
(i) (i)
= ∇ŵ log πŵ at | st (18.13)
t=0
policy only
R→A
Reinforcement Learning (RL) 649
The removal of the terms independent of the current action helps to lower
the variance.
Using Eqs. (18.10) and (18.14), our problem can now re-casted as
πŵ (at | st )
rt (ŵ) = (18.18)
πŵold (at | st )
Consider a pole attached to a cart via a hinge joint. The cart can move
along a frictionless one-dimensional track. Our goal is to have the pole stay
vertically up by controlling the cart moving +1 or −1. When an RL agent
learns to achieve this goal, it is placed in an environment that gives a reward
of +1 at every time-step when the pole stays upright. The episode ends when
the pole inclines more than 15 degrees from the vertical position, or the cart
moves more than 2.4 units away from the center. More detailed on this
environment can be found at the OpenAI Gym.
This time we use the stable-baselines3 module to train the RL agent.
Some codes have also made public by philtabor, and Nicholas Renotte
at github to perform these tasks. Examples introduced below uses stable-
baselines3 (which require an install of stable-baselines3).
import os
import gym
from stable_baselines3 import PPO
Reinforcement Learning (RL) 651
np.set_printoptions(precision=3)
#[cart position, cart velocity, pole angle, pole angular velocity]
print(f"Type of the observation_space:{Env.observation_space}")
print(f"A sample from the space:{Env.observation_space.sample()}")
Type of the observation_space:Box(4,)
A sample from the space:[-4.199e+00 -3.231e+38 2.745e-01
-1.295e+38]
-----------------------------
| time/ | |
| fps | 995 |
| iterations | 1 |
| time_elapsed | 2 |
| total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/ | |
| fps | 679 |
| iterations | 2 |
| time_elapsed | 6 |
| total_timesteps | 4096 |
Reinforcement Learning (RL) 653
| train/ | |
| approx_kl | 0.0033051167 |
| clip_fraction | 0.0277 |
| clip_range | 0.2 |
| entropy_loss | -0.305 |
| explained_variance | 0.123 |
| learning_rate | 0.0003 |
| loss | 196 |
| n_updates | 1240 |
| policy_gradient_loss | -0.00145 |
| value_loss | 169 |
------------------------------------------
......
------------------------------------------
| time/ | |
| fps | 625 |
| iterations | 25 |
| time_elapsed | 81 |
| total_timesteps | 51200 |
| train/ | |
| approx_kl | 0.0022214411 |
| clip_fraction | 0.0128 |
| clip_range | 0.2 |
| entropy_loss | -0.241 |
| explained_variance | 0.391 |
| learning_rate | 0.0003 |
| loss | 83.3 |
| n_updates | 1470 |
| policy_gradient_loss | -0.000732 |
| value_loss | 154 |
------------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x1aeda592518>
The training is quite fast. It took about three minutes on authors laptop to
train 200k steps.
654 Machine Learning with Python: Theory and Applications
model_path = os.path.join('Trained_Models','PPO_'+Env_str),
model.save(model_path)
(200.0, 0.0)
Env.close()
N_avg = 100
x,y=np.array([range(N_avg,len(rList))]).T, mv_avg(rList,N_avg)
plt.plot(rList,label="Current Reward")
plt.plot(x, y, label="Average reward over "+str(N_avg))
plt.title('Reward received, after 10k training')
plt.legend(loc='upper left') #'lower right' )
plt.savefig('images/'+Env_str+'_50k.png', dpi=300)
plt.show()
Figure 18.7: An agent trained with 10k steps on the CartPole-v0 environment.
Figure 18.8: An agent trained with 50k steps on the CartPole-v0 environment.
observed = Env.reset()
frames = []
jwSteps = 100
while j < jwSteps: # maximum steps over the state
action, _states = model.predict(observed)
observed, rewards, done, info = Env.step(action)
Reinforcement Learning (RL) 657
One may also use an alternate algorithm, such as the deep Q-Network (DQN)
to train the model. The code is as simple as follows.
print(f"action_space:{Env.action_space})
print(f"A sample in action_space:{Env.action_space.sample()})
action_space:Box(3,)
A sample in action_space:[0.5832531 0.9261694 0.27739018]
The action space is a three-dimensional box, allowing the car turning left
and right, and speed.
print(f"observation_space:{Env.observation_space}")
print(f"A part of a sample in observation:\
{Env.observation_space.sample()[0,0:5,:]}")
observation_space:Box(96, 96, 3)
A part of a sample in observation:
[[139 246 145]
[ 18 144 155]
[ 95 226 46]
[215 70 49]
[105 88 1]]
log_path = os.path.join('Trained_Models','Logs_'+Env_str)
model_path = os.path.join('Trained_Models','PPO_500k'+Env_str)
model=PPO("CnnPolicy",Env,verbose=2,tensorboard_log=log_path)
Using cpu device
Wrapping the env with a 'Monitor' wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.
model.learn(total_timesteps=500_000)
# try smaller steps first model.save(model_path)
660 Machine Learning with Python: Theory and Applications
| fps | 39 |
| iterations | 1 |
| time_elapsed | 51 |
| total_timesteps | 2048 |
---------------------------------
The following code shall produce animations of the CarRacing car running
on the track.
evaluate_policy(model,Env,n_eval_episodes=2,render=True)
(-93.59801837056875, 0.11267495155334473)
episodes = 2 #5
for episode in range(1, episodes+1):
s = Env.reset() # initial state
done = False
rAll = 0
frames = []
while not done:
Env.render() # for in-window real-time render
action, _ = model.predict(s.copy())
s1, reward, done, info = Env.step(action)
rAll+=reward
s = s1
print(f"Episode {episode}: Score={rAll}")
We have video recorded the trained car using a screen recorder. The video
plays when the following code are executed with the recorded files.
<IPython.core.display.Video object>
Our tests found that a not-well-trained car with small training steps (<20k)
runs wildly. When it is trained 500k steps, it runs reasonably well. The
training took a number of hours on the author’s laptop. Interested readers
may run the above codes, or contact the author for these videos. Our tests
found that the PPO is much more stable compared to the Q-algorithm.
662 Machine Learning with Python: Theory and Applications
18.9 Remarks
References
[1] R. Sutton and A. Barto, Reinforcement Learning: An Introduction, MIT Press,
Cambridge, MA, 2018.
[2] C.J.C.H. Watkins, Learning from Delayed Rewards, PhD thesis, May 1989, Cam-
bridge, UK. http://www.cs.rhul.ac.uk/∼ chrisw/new thesis.pdf.
[3] C.J.C.H. Watkins and P. Dayan, Q-learning, Machine Learning, 8(1), 279–292.
https://doi.org/10.1007/BF00992698.
[4] Duan Shuyong, Zhang Linxin, G.R. Liu et al., A smoothed-shortcut Q-learning
algorithm for optimal robot agent path planning, Journal of Mechanical Engineering,
Accepted, 2021.
[5] R.E. Bellman, Dynamic Programming, Dover, 1957.
[6] Schulman John, “Proximal Policy Optimization Algorithms”, CoRR, vol. abs/1707.
06347, arXiv, 2017. online.
[7] Mnih Volodymyr, “Asynchronous Methods for Deep Reinforcement Learning”, CoRR,
vol. abs/1602.01783, 2016. online.
[8] Schulman John, “Trust Region Policy Optimization”, CoRR, vol. abs/1502.05477,
arXiv, 2015. online.
Index
663
664 Machine Learning with Python: Theory and Applications
C D
10-classification, 451 data compression, 133, 211, 585
candidate memory, 565 data encoded parameter, 211
CarRacing, 655 data iterator, 356, 368, 372, 376, 379, 383,
CartPole, 649 409
case, 197 data loss function, 507
chain multiplication, 185 data scaling (normalization), 149
chain of stacked affine transformations, data type, 11
215, 473 data-based model, 3
chained structure, 474 data-parameter converter, 10, 190, 210
chaining ATA, 215 data-point, 5–7, 228, 226
CIFAR images, 552 dataset, 5–7
CIFAR10 dataset, 553 de-noising, 585
class inheritance, 86, 90 decision boundaries for classification
classification, 5 problems, 241, 443, 503, 538
clip function, 645 decision rule, 226
close-form solution, 334, 392 decision tree, 443
closed-form, 619 decoder, 610
CNN policy, 657 deeper (more layers) net, 479
COCO dataset, 552
denoising, 611
code performance, 93
dense layers, 556
column-space of matrix, 476
densely connected, 213
comment line, 21
derivative of the function, 348
comparison operators, 77, 460
diagonal matrix, 129
complete linear bases, 8
different strides and paddings, 542
computation complexity, 258
discount factor, 627
computational graph, 305
discount rate, 629
concatenation, 42, 565
distance function, 239, 258
condition number, 116–117
distance-based, 424
conditional probability, 181
distances, 189
conditions for activation functions, 288
distributed reward, 625
configuration of a CNN, 545–546
doc-strings, 21
confusion matrix, 254
conjugate gradient method, 367, 386 dot-product, 102, 106, 115, 167, 225, 317
controlled random sampling, 158–159 double-star, 54
CONV layers, 546
Conv2D layer, 556 E
convergence process, 347, 589 edge detector, 540–541
convergence theorem, 349 effect of the L2 regularization, 533
convex functions, 343, 348 effects of parallel data-points, 331, 337
ConvNet, 539 eigenvalue decomposition, 113, 136
convolution filter, 540 element-wise computations, 61
Index 665
element-wise operations, 65, 266, 565 gradient descent in one dimension, 345
encoder, 610 gradient descent momentum (GDM), 363,
encouragement rule, 230 365
enrichment functions, 220 gradient vanishing, 288
entropy of probability, 167 gradient-based algorithms, 354
error exception, 34 gradient-based techniques, 332
expected return, 629 greedy policy, 628
exponential loss, 427
extended affine space, 565, 568 H
extra-forest, 460 handwritten digits recognition, 181
head gradient, 320–321
F hermitian matrix, 116
f -string, 39 Hessian matrix, 248, 364
feature function, 220, 401, 404 hidden layers, 467–468, 478
feature space, 6 high-order polynomial fitting, 401,
features, 5, 7 403–404
feed-forward, 323–324 higher-order curves, 502
feed-forward process, 470 higher-order functions, 218
filter and convolution, 539 higher-order nonlinear latent behavior,
filter matrix, 541 218
fitting to data-points, 502 higher-order polynomial bases, 220
flattened tall vector, 9 highly elliptical shape, 363
forget gate, 563 hinge loss, 425
FrozenLake-v0, 633 Huber loss function, 394
fully connected, 468 hyper-parameters, 638
fully trained model, 501 hyperbolic cosine, 394
functional of prediction functions, 10 hyperbolic tangent function (tanh), 282
functions with conditions, 322 hyperparameters, 495
hypothesis space, 9, 217, 477
G
gated recurrent units (GRUs), 568 I
gauss distribution, 167 image compression, 611
gauss integration, 147 image segmentation, 611
gauss points, 147 import external module, 22
gauss weights, 147 in-place operations, 66
Gaussian basis functions, 404, 406 inception module, 551
Gaussian distribution, 407 inception network, 551
Gaussian kernel, 258 independent connections, 218
Gaussian Naive Bayes, 443 independent features, 218
Gaussian process, 443 independent learning parameters, 478, 565
GD algorithm, 349 indicator function, 424
general guideline, 501 inequality constraints, 249
global minimum, 340 initial data treatment, 148–149
global optimal, 598, 643 initialization of means, 586
global optimum, 249, 588 inner product, 6, 234, 246, 250
GoogLeNet, 551 integrated development environment
gradient boosting regressor, 443, 460, 495 (IDE), 19
gradient descent (GD), 344 interpolation, 120
gradient descent in hyper-dimensions, 347 iris dataset, 260
666 Machine Learning with Python: Theory and Applications
J log-likelihood, 185
joint probability, 180 logistic function, 266
Jupyter Notebook, 19, 24 logistic loss, 426
logistic prediction function, 422
K long short-term memory (LSTM), 480,
563
K-means clustering, 585 loss function codes, 9, 412, 396, 421
kernel density estimation (KDE), 149 loss function comparison, 395
kernel trick, 257 loss functions for classification, 423
KL-divergence, 175–177, 619–620, 648 loss functions for regression, 389
Krylov methods, 110 LSTM layer, 566
LSTM output spaces, 564
L LSTM unit, 564–565
0–1 loss, 424 LSTM-MLP, 577
L1 loss function, 393
L2 loss function, 333, 392 M
L2 norm for the regularization, 506 10-means clustering, 605
L2 regularization, 531 2D-matrix-dot-product, 540
label space, 8 makedown, 19
labeled dataset, 226 margin function, 226, 423, 425
labels (ground truth), 5–6 margin-based, 424
lagrangian multiplier, 243–247 Markov decision process (MDP), 626–627
Lambda functions, 86 mathematical trick, 185
landmark CNNs, 546 matrix inversion, 111
LAPACK, 110 matrix of variables, 319
Lasso, 506 matrix-matrix multiplication, 106
leaky Relu, 284 max pooling, 544
learning direction, 345 MaxPooling2D, 555
learning parameters, 9–10, 196 mean absolute error (MAE) loss function,
learning rate, 345–346, 348, 351, 369 393
least square formulation, 332, 334 mean-shift, 605
least-square solution, 122 mean-squared-error (MSE), 392
left-singular vectors, 129 memory address, 27
LEGB rule, 85 memory size, 35
LeNet-5, 547 min-max scaling, 149
linear algebra, 95 mini-batch algorithm, 354
linear basis functions, 6 mini-batch SGD, 355
linear combination, 245 minimizer of learning parameters, 10
linear function prediction unit, 189, 197, minimum number of data-points, 477
422 MINPACK, 143
linear prediction function, 190, 218 MLPClassifier, 385
linear regression, 368, 390 MNIST dataset, 181, 448, 481, 488, 526,
linear schedule, 352 575
linear SVM, 443 modified function, 247
linearly separable, 224, 230, 234 modified objective function, 242–243
Lipschitz continuity, 351 module in Python, 92
list comprehensions, 76 modules, 91
local derivatives, 305 moment matrix, 8, 335
local minima, 339 momentum parameter, 365
log-cosh loss function, 394 monotonically increasing function, 430
Index 667
vector space, 6 X
versions of Python, 15 xw formulation, 192, 213, 391
VGG-16, 549 xw+b formulation, 192, 214
viewing the weight matrix, 490
Voronoi diagram, 587, 589
Y
W YOLO, 551
YOLOv3 object detection, 559
wide (more neurons in a layer) net,
479
wrapped affine transformation, 209 Z
wrapped with activation functions, zigzag behavior, 364
338 zigzag path, 364