Fundamentals Machine Learning Using Pyth
Fundamentals Machine Learning Using Pyth
ARCLER
P r e s s
www.arclerpress.com
Fundamentals of Machine Learning using Python
Euan Russano and Elaine Ferreira Avelino
Arcler Press
2010 Winston Park Drive,
2nd Floor
Oakville, ON L6H 5R7
Canada
www.arclerpress.com
Tel: 001-289-291-7705
001-905-616-2116
Fax: 001-289-291-7601
Email: orders@arclereducation.com
This book contains information obtained from highly regarded resources. Reprinted material
sources are indicated and copyright remains with the original owners. Copyright for images and
other graphics remains with the original owners as indicated. A Wide variety of references are
listed. Reasonable efforts have been made to publish reliable data. Authors or Editors or Publish-
ers are not responsible for the accuracy of the information in the published chapters or conse-
quences of their use. The publisher assumes no responsibility for any damage or grievance to the
persons or property arising out of the use of any materials, instructions, methods or thoughts in
the book. The authors or editors and the publisher have attempted to trace the copyright holders
of all material reproduced in this publication and apologize to copyright holders if permission has
not been obtained. If any copyright holder has not been acknowledged, please write to us so we
may rectify.
Notice: Registered trademark of products or corporate names are used only for explanation and
identification without intent of infringement.
Arcler Press publishes wide variety of books and eBooks. For more information about
Arcler Press and its products, visit our website at www.arclerpress.com
ABOUT THE AUTHORS
viii
Chapter 9 Regularization ....................................................................................... 167
9.1. Regularized Linear Regression ........................................................ 171
ix
15.2. Tensors.......................................................................................... 255
15.3. Computational Graph and Session ................................................ 255
15.4. Operating With Matrices............................................................... 256
15.5. Variables ....................................................................................... 260
15.6. Placeholders ................................................................................. 260
15.7. Ways Of Creating Tensors ............................................................. 262
15.8. Summary ...................................................................................... 265
Chapter 18 Project: Recognize Handwritten Digits Using Neural Networks ........... 293
18.1. Introduction .................................................................................. 294
18.2. Project Setup ................................................................................ 294
18.3. The Data ....................................................................................... 294
18.4. The Algorithm ............................................................................... 294
x
LIST OF FIGURES
Figure 5.7: Values of J for linear regression with fixed a0 = −3 and varying
a1 for the data in Table 5.2.
Figure 5.8: Value of learning rate α , is too small, requiring too much iterations.
Figure 5.9: Value of learning rate α , is reasonable, requiring few iterations and
no oscillations.
Figure 5.10: Value of learning rate α , is too high, producing relative divergence
in the search for the minimum.
Figure 5.11: Example of local minima and global minima in non-linear cost
function.
Figure 5.12: Plot of dataset.
Figure 5.13: Dataset with prediction.
Figure 5.14: Test of plotJ function.
xi
Figure 5.15: Optimum linear regression model using GD.
Figure 5.16: Evolution of optimization process in 2D plot.
Figure 5.17: Evolution of optimization process in cost function.
Figure 7.1: Scatter plot of profit x population data.
Figure 7.2: Plot of observed and predicted profit x population data.
Figure 7.3: Histogram of input feature 1.
Figure 7.4: Histogram of input feature 2.
Figure 7.5: Scatter plot of profit for multiple input example.
Figure 7.6: Plot of observed and predicted profit for multiple inputs example.
Figure 8.1: Dataset showing hours of study and exam results.
Figure 8.2: Dataset (points) and prediction (curve) showing hours of study and
exam results.
Figure 9.1: Stock price dataset.
xii
Figure 13.3: Point and distance to nearest neighbor (k = 1).
Figure 13.4: Point and distance to nearest neighbors (k = 3).
Figure 14.1: Dataset to be clustered.
Figure 14.2: Clustering dataset with k = 2.
Figure 14.3: Clustering dataset with k = 3.
Figure 14.4: Clustering dataset with k = 4.
Figure 16.1: Hyperbolic tangent activation function.
Figure 16.2: Sigmoid activation function.
Figure 16.3: ReLU activation function.
Figure 16.4: Softplus activation function.
Figure 18.1: General neural network structure for MNIST dataset problem.
xiii
LIST OF TABLES
xvi
LIST OF ABBREVIATIONS
Machine learning is a field of study that has drastically grown in the last
decades. Whenever one enters the uses Internet and uses a search engine or an
automatic translation service, machine learning is present. Modern surveillance
systems are equipped with systems, which can detect and classify objects using
machine-learning techniques. When predicting stock prices or general market
features, economists use this same technique to have accurate results. These are
just to mention a few applications that machine learning has found in different
areas.
This book is intended for under-graduate students and enthusiasts who are
interested in starting to develop their own machine learning models and
analysis. It provides a simple but concise introduction to the main knowledge
streams that are necessary to start into this area, even entering a bit beyond
the basics with an introduction of TensorFlow, a state-of-the-art framework to
develop machine learning models.
Since most applications are currently being developed in Python, the authors
chose to focus on this programming language. The first two chapters provide
a general introduction to machine learning and Python programming language.
Chapter 3 introduces how a machine-learning model can be developed from
zero using a story-based description.
On Chapter 4, main concepts that appear when talking about machine learning
are described. The following chapters go deeper into each modeling technique.
Chapters 5, 6, and 7 deals with the linear regression model and the concepts on
linear algebra to understand more complex techniques. Chapter 8 describes the
logistic regression model, the equivalent of linear regression for classification.
Chapter 9 introduces the concept of regularization to later present neural
networks in Chapters 10 and 11. How Decision Trees works with examples, as
well as the development of Random Forests are mentioned in Chapter 12.
Chapter 13 shows how Principal Component Analysis (PCA) can be used
to simplify a dataset, reducing its dimensions and enabling one to retrieve
information of it. Chapter 14 deals with classification problems by showing
how k Nearest Neighbors model works with implementation in Python as the
previous models.
Chapters 15 and 16 presents to the reader the state-of-the-art framework
TensorFlow, a collection of functionalities which highly enhances the
development of machine learning models, with the incorporation of efficient
computations and intuitive programming.
xx
1
Introduction to Python
CONTENTS
1.1. What Is Python ................................................................................... 2
1.2. What Makes Python Suitable For Machine Learning? .......................... 4
1.3. What Are Other Computational Tools For Machine Learning? ............. 5
1.4. How To Obtain And Configure Python? .............................................. 9
1.5. Scientific Python Software Set ........................................................... 11
1.6. Modules ........................................................................................... 14
1.7. Notebooks ........................................................................................ 15
1.8. Variables And Types .......................................................................... 16
1.9. Operators And Comparison .............................................................. 18
2 Fundamentals of Machine Learning using Python
d = add(a,b) # output is 3
All the commands above are completely legal in a weakly typed
programming language. However, in a strongly-typed language, it would
raise an error. So the following modification would be necessary.
c = concatenate (str(a), b) # first convert a to string then concatenate, output
is “12”
d = add(a, int(b)) # first convert b to an integer then add, output is 3
Python is an object-oriented programming language. In simple terms,
this concept refers to a paradigm where a program is composed by objects
containing information (data) or attributes in the form of fields, and actions or
procedures, often referred to as methods. Such concept states that an object’s
procedures are able to read/write information of the objects with which it is
associated. Such concept may or may not be explicitly incorporated on a
program when using Python. Implicitly it is always used, since any data type
in Python is by definition an object.
Another important concept is that Python contains an automatic
memory management system. This means that the program “knows” when
to allocate or deallocate memory for variables and data arrays. This brings
a great advantage since it optimizes memory usage automatically, whereas
in languages such as C++ one needs to explicitly allocate and deallocate
memory for each variable used in a program. If this process is not properly
done in big programs, computer memory can be drained which harms
performance.
Python is a free-to-use programming language with a comprehensive
standard library and a continuously growing set of libraries for a variety
of applications being developed by the international community of Python
developers.
Other concepts that Python incorporates, as mentioned by Langtangen
(2011):
• Cleanness and simplicity: It was developed to be easy-to-read
and intuitive, with a minimalist syntax. This characteristic highly
enhances maintainability of projects according to their sizes.
• Expressive language: The usage of fewer lines of code leads to
less errors (bugs), which helps developers to develop programs
faster and such products are easier to maintain.
• Interpreted language: Python code does not need to be compiled,
because it can be directly read and run by the interpreter.
4 Fundamentals of Machine Learning using Python
Lastly, another tool which finds its usage in Machine Learning is C++
programming language. C++ is a statically-typed, compiled programming
language. These features, besides others, make C++ a highly-efficient
language regarding runtime. And as runtime, is essential for complex
problems in Machine Learning, C++ may be crucial in such cases. It is
important to notice that Python libraries such as TensorFlow and Torch are
implemented in C++ under the hood for computing efficiency. And many
companies use C++ in their Machine Learning algorithms, though they
may use Python or R for experimentation and prototyping. On the other
hand, the time required for developing the same algorithm in C++ or in
Python is much greater in C++, because of its complexity and difficulty
in implementation. To test a code in C++, one needs to compile it first and
run it. If an error or a bug is found, the developer needs to rewrite the code,
recompile it and run again to see if the problem is solved. In Python, once
the code is written it can be run directly, so any bug or error can be corrected
much faster which makes the development of Machine Learning algorithms
in C++ relatively slower than in Python. So the way to go would be to use
Python for development, and later C++ for production, if performance is an
issue.
Yegulalp (2019) mentions some other tools popular to be used for
Machine Learning (some here are C++ libraries):
• Venerable Shogun (or simply Shogun): Created in 1999 and
written in C++, Shogun is a library for Machine Learning with
native support for Microsoft Windows and Scala language, and
can also be used with Java, Python, C#, Ruby, R, Lua, Octave
and Matlab.
• Accord.Net Framework: Machine Learning and signal processing
framework for .Net as an extension of a previous project called
AForge.net. The tool contains libraries for audio image analysis
and processing. The algorithms for vision processing can be
used for face detection, stitching together images and tracking
moving objects. In addition to these tools, the framework also
incorporates more traditional machine learning algorithms such
as neural networks and decision-tree systems.
• Apache Mahout: A framework connected to Hadoop (https://
hadoop.apache.org/), which can also be used for Hadoop projects
that could be migrated to stand-alone applications. The last
versions reinforced support for Spark framework and support was
8 Fundamentals of Machine Learning using Python
1.5.3. IDLE
According Python official website, IDLE stands for Integrated Development
and Learning Environment. It is integrated because it actually comes
together with Python programming language when downloading it from
12 Fundamentals of Machine Learning using Python
1.5.5. Spyder
According Johansson (2014), Spyder is an IDE for scientific computing with
similarities to MATLAB user interface. It incorporates many advantages
of traditional IDE environments, such as a single view for code editing,
execution and debugging through an arrangement of all windows. Different
calculations or problems can be organized as projects in the IDE environment.
The author mentions the following advantages of Spyder IDE: + Powerful
code editor with syntax highlighting, dynamic code introspection and
incorporation of python degubber. + Console window, variable editor and
history window. + Documentation is completely integrated with the IDE, as
well as help.
1.6. MODULES
Any Python program file (.py) can be correctly called module as per
Python standards. However, in this section we refer to Modules as the set
of functionalities available for Python through the program files available
as packages. For instance, Python Standard Library contains a great set
of modules for cross-platform implementation of functionalities such
as operating system commands, file input or output, string management,
network communication among others.
To use one module, you need to import it. For instance, to use the math
module, which contains a consistent collection of standard mathematical
functionalities, type the following command on a .py file.
import math
As an example, one can create a program that calculates the sin of a
certain angle by typing the following program:
import math
sin_angle = math.sin(math.pi/2)
print(sin_angle)
Notice that math.sin consists in the sine function contained in the math
module, and math.pi contains the value of (pi) stored in the math module.
Another option would be to explicitly import all functionalities of math
module and use them without the need of referring to the module at each
call of a command contained on the module.
from math import *
sin_angle = sin(pi/2)
print(sin_angle)
Here it is no more explicit that the function sin and pi are both
functionalities contained in the math import. Therefore, this syntax though
it may make the code more compact it also removes the benefit of knowing
from where each functionality came from. In large problems, where many
modules may be imported, it can become confusing to know if certain
Introduction to Python 15
1.7. NOTEBOOKS
According Jupyter Notebook Quick Start Guide, notebooks are documents
generated by the Jupyter Notebook APP. The latter is a server-client
application that allows the edition and execution of notebooks through a
web browser. No internet is required since Jupyter runs locally (local server)
or it can be used through the internet if installed in a remote server.
Notebooks may contain code in a variety of languages, regular text
(such as this one you are reading), equations, figures, links, tables, etc. Code
and rich text elements are separated throughout a notebook by cells, which
is the reason why this environment is called cell-based. They are not only
pieces of code to be run by a computer, but human-readable documents
incorporating both the description of an analysis and the results, discussion
and almost any other element that one could find a regular document, but
with the peculiarity of containing live code and its output (if the notebook
was executed). These features make notebook as very useful tool for data
analysis and Machine Learning.
Jupyter notebook app can be started by running the following command
in Anaconda prompt or in a Terminal.
$ jupyter notebook
This command opens a new browser window (or tab depending on the
default browser or your computer) and shows a Notebook Dashboard. It
is sort of a control panel showing local files and notebooks allowing user
to choose one or to shut down their kernels. A kernel is a computational
engine responsible for executing the code stored in a notebook. Each kernel
16 Fundamentals of Machine Learning using Python
and as assert break class continue def del elif else except exec finally for
from global if import in is lambda nonlocal not or pass raise return try while
with yield True False None
A variable is always associated with its type. Because Python is a
dynamically-typed language, one doesn’t need to worry so much about the
data type of a certain variable, as it can even change during the program
without any problem. But it is useful to have an understanding of the different
data types so as to use them more efficiently and effectively. Following are
the basic data types in Python
Integer
x = 1234
print(type(x))
Out: <class ‘int’>
Floating-point number:
x = 1234.0
print(type(x))
Out: <class ‘float’>
Complex number:
x = 1234.0 + 1.0j
print(type(x))
Out: <class ‘complex’>
Strings:
x = “Machine Learning”
print(type(x))
Out: <class ‘str’>
Boolean Type:
x = True
y = False
print(type(y))
Out: <class ‘bool’>
18 Fundamentals of Machine Learning using Python
Symbol Operation
+ (sum)
- (subtraction)
* (multiplication)
/ (division)
** (power)
% (modulus)
// (integer division)
Examples:
Sum
x = 1.0
y = 2.0
print(“x + y = ,” x + y)
Out: x + y = 3.0
Subtraction
x = 1.0
y = 2.0
print(“x – y = ,” x – y)
Out: x – y = –1.0
Division
x = 1.0
y = 2.0
print(“x/y = ,”x/y)
Introduction to Python 19
print(“x//y = ,”x//y)
Out: x//y = 0
Note that, beginning on Python 3.x, the / operator will always perform
a floating-point division. However, on Python 2.x, using the / operator with
two integers will result in an integer division, while if at least one of the
operands is a floating-point than the result will be a floating-point division.
For instance, 1/2 = 0.5 (float) in Python 3.x, but 1/2 = 0 (integer) in Python
2.x.
Boolean operators are: – And (if both are true then it is true, otherwise
is false) – Not (the opposite, if a variable is true, then “not the variable” will
be false) – Or (if any of the variables are true, is true, otherwise is false).
Examples:
x = True
y = False
print(x and y)
Out: False
#------------------------
x = True
y = True
print(x and y)
Out: True
#------------------------
x = True
y = False
20 Fundamentals of Machine Learning using Python
print(x or y)
Out: True
#------------------------
x = True
y = False
print(not x)
Out: False
#------------------------
2
Computing Things with Python
CONTENTS
2.1. Formatting And Printing to Screen ..................................................... 22
2.2. Lists, Dictionaries, Tuples, And Sets................................................... 25
2.3. Handling Files .................................................................................. 32
2.4. Exercises ........................................................................................... 34
2.5. Python Statements ............................................................................ 36
2.6. For And While Loops ........................................................................ 38
2.7. Basic Python Operators .................................................................... 44
2.8. Functions .......................................................................................... 47
22 Fundamentals of Machine Learning using Python
# string concatenation
‘Last night, ‘ + name + ‘ met a friend after ‘ + str(age) + ‘ years.’
# string formatting
f’Last night, {name} met a friend after {age} years.’
For Python 3.x, there are three ways to format string:
• Using % placeholder (“classical” method);
• Using the .format() string method.
• From Python 3.6, using formatted string literals, also known as
f-strings.
There are two conversion methods: str() and repr(). The main difference
between them is that repr() returns the string representation of the object,
explicitly showing quotation marks and special characters, such as escape
characters.
print(‘The name of my friend is %s.’ %’John’)
print(‘The name of my friend is %r.’ %’John’)
The name of my friend is John.
The name of my friend is ‘John’.
Notice how the tab character \t character adds a tab space in the first
string, and is literally printed in the second string.
print(‘This text contains %s.’ %’a \tbig space’)
print(‘This text contains %r.’ %’a \tbig space’)
This text contains a big space.
This text contains ‘a \tbig space’.
The d operator is used when dealing with integers (or the integer part of
the floating number with no rounding).
print(‘My weight is current %s kg.’ %71.95)
print(‘My weight is current %d kg.’ %71.95)
My weight is current 71.95 kg.
My weight is current 71 kg.
To work with floating-points, one should use the f operator. In this case,
the operator is preceded by a number, a dot and a second number (e.g.,
%3.4f). The first number indicates the minimum number of characters the
string should contain. If necessary, this space is padded with whitespace if
the number does not occupy the complete space. The second part (e.g., .4f)
24 Fundamentals of Machine Learning using Python
indicates how many decimals places should be printed (numbers after the
decimal point).
print(‘Floating-point numbers: %5.2f’ %(312.984))
Floating-point numbers: 312.98
print(‘Floating-point numbers: %1.0f’ %(312.984))
Floating-point numbers: 313
print(‘Floating-point numbers: %10.2f’ %(312.984))
Floating-point numbers: 312.98
Groceries | Quantity
Milk | 3.0
Apples | 10.0
More information on how to use .format() method can be found
in official Python documentation: https://docs.python.org/3/library/
string.html#formatstringshttps://docs.python.org/3/library/string.
html#formatstrings
2.2.1. Lists
Lists are constructed by using a square bracket [] and putting inside the
elements that it should contain. These may be numbers, strings, other lists,
in summary any Python object can be contained inside a list.
# Assign a list to the variable mylist
mylist = [0, 1, 2]
Notice that each element inside the list is separated by comma. Another
example consists in mixing different types of elements in the same list.
# Assign a list to the variable mylist
mylist = [‘a string example’, 1.2, [1, 2, 3]]
The function len() can be used to check the length of a list, i.e., the
number of elements contained in the list (note that a list contained inside
another list counts as one element).
len(mylist)
3
An element of the list can be retrieved using indexing. For instance, the
first element of the list is stored at index 0, the second is at index 1, etc.
# Reference to the element 0
mylist[0]
a string example
One can also refer to multiple elements by using the colon (:) operator.
For instance, refer to the 2nd element and all the other following using the
syntax [1:]
# Reference to the element 1 and the ones following
mylist[1:]
[1.2, [1, 2, 3]]
Similarly, referring from the first element up to index 2 can be done as
follows.
# Reference up to index 2
mylist[:2]
[‘a string example’, 1.2]
Lists can be concatenated by using addition symbol (+) and assigning to
a new variable (or the variable itself to modify it).
Computing Things with Python 27
# Concatenate mylist with the list [4] (a list of one element) and
# assign to newlist
newlist = mylist + [4]
newlist
[‘a string example’, 1.2, [1, 2, 3], 4]
A list can be duplicated or repeated n times by using the multiplication sign
(*).
mylist*3
[‘a string example’, 1.2, [1, 2, 3], ‘a string example’, 1.2, [1, 2, 3], ‘a string
example’, 1.2, [1, 2, 3]]
print(‘mylist = ‘, mylist)
mylist = [1, 2, 3, 5]
mylist = [‘dog’, ‘zeus’, ‘corn’, ‘Height’]
mylist.sort()
print(‘mylist = ‘, mylist)
mylist = [‘Height’, ‘corn’, ‘dog’, ‘zeus’]
Matrices can be created by nesting lists. For instance, a matrix containing
2 rows and 3 columns is declared by using 2 nested lists in a superior list,
each one with 3 elements.
mymatrix = [[1, 2, 3], [0, 1, 3]]
In this case, a reference to a complete row is done using one index.
mymatrix[0]
[1, 2, 3]
The selection of one element is done using two indices, the first one to
which nested list (row) and the second to the element inside the nested list
(column).
mymatrix[0][0]
1
2.2.2. Dictionaries
Dictionaries are data-structures similar to lists, but with element mapping.
A mapping associates each element stored in the dictionary to a key, which
differs it to a list, where each element is associated with an index (number).
A mapping doesn’t retain order, since they have objects defined by a key.
Similar to the list, the element stored in a key can be any Python object, from
numbers to strings and other dictionaries, lists, etc.
A dictionary is constructed by using curly brackets {} and defining each
key and the value associated separated by a colon operator (:).
mydict = {‘key1’: value1, ‘key2’: value2, ‘key3’: value3}
# A value can be reffered by using the key
mydict[‘key2’]
value1
As already mentioned, dictionaries can hold any type of Python object,
similarly to Python lists.
Computing Things with Python 29
nested[‘key1’][‘nested1’][‘subnested1’]
value1
2.2.3. Tuples
Tuples can be easily related to lists, with the exception that is immutable,
i.e., their elements cannot be changed, it cannot be appended or popped. It
is normally used to present data which is not supposed to be changing, such
as calendar days, days of week, etc.
Tuples are built using round parenthesis (), and separating elements with
commas.
mytuple = (4,5, 6)
Similar to list, the size of a tuple can be verified with the len() function.
A tuple can be created with elements of different data types, and indexing
works in the same ways as lists.
mytuple = (‘string example’, [5.5, 1, 2], 6)
print(‘size of tuple = ‘, len(mytuple))
print(‘0-th index element = ‘, mytuple[0])
size of tuple = 3
0-th index element = string example
mytuple.count(6)
2
2.2.3.2. Immutability
As already mentioned, tuples are immutable objects, i.e., they cannot be
changes at any way. This includes not only adding or removing elements,
but also changing one or more elements already stored in the tuple.
For instance, trying to set one element of the tuple gives an error.
mytuple[1] = 3.2
Traceback (most recent call last):
File “<stdin>,” line 1, in <module>
TypeError: ‘tuple’ object does not support item assignment
2.2.4. Sets
A set is a special data container in Python with contains only unique elements,
i.e., there is no repetition of elements in a set. It can be constructed using
set().
myset = set()
Elements are added to a set using the .add() method. Trying to add more
than once the same element does not issue an error, but it is not added, since
a set can only contain unique elements.
# add the number 0 to the set
myset.add(0)
print(myset)
# add the number 10 to the set
myset.add(10)
print(myset)
32 Fundamentals of Machine Learning using Python
# Absolute path
myfile = open(‘C:\\Data\\sample.txt’)
It is good practice to use double backslash so python does not recognize
as an escape character.
File reading can be done through the read() method, which reads the
entire content of the file and stores it as a string.
myfile.read()
The content of the file is here
Trying to read again will yield no result, since the cursos has arrived at
the end of the file.
myfile.read()
“”
To go back to the beginning of the file, use the seek(pos) method, where
pos is an integer indicating the position to place the cursos (0 is the beginning
of the file). To read each line and store it in a list, use the readlines() method
instead of read().
myfile.seek(0)
myfile.readlines()
[‘The content of the file is here’]
Always remember, when finished to work on the file, to close it and
release memory using the close() method.
myfile.close()
To open a file and write on it, use the flag w inside the open(filename,flag).
Note that using such flag removes all the content of the file. To append to the
end of the file (instead of cleaning it), use the a flag. On the other hand, for
reading (default), one can explicitly use the r flag.
# open the file in write mode. All the previous content is erased.
myfile = open(‘sample.txt’, ‘w’)
# open the file in append mode. The content is preserved and new content is
appended to the end of the file.
myfile = open(‘sample.txt’, ‘a’)
34 Fundamentals of Machine Learning using Python
Another way of working with files is by using the with block. In this
way, the file only stays open inside the block, and it is closed automatically
once the program exits the with block.
# open the file in read (default) mode and name the handler as f.
with open(‘sample.txt’) as f:
# iterate through each line of the file
for line in f:
print(line) # print the complete line
var = 2 # here the file f is already closed since the program exits the with
block``
2.4. EXERCISES
1) Using Python code, write an equation that uses addition and sub-
traction that is equal to –80.12
2) Write another equation that, using multiplication, division and
exponentiation, is equal to –80.12
3) Consider the equation 4 + 0.5 + 2. What is the type?
4) How can one determine the square root of a number, say 9?
5) How can one determine the square of a number, say 2?
6) Consider the string “Mary.” How can one choose the letter “a”
from this string?
7) Through slicing, reverse the string “Mary.”
8) Show 2 different methods of printing the letter “y” from “Mary.”
9) Show 2 different methods of generating the list [1, 1, 1].
10) Change the last element of the list [1, 1, 1] to be equal 10.
11) Reverse the list [1, 2, 3, 4]
12) Sort the list [1, 4, 2, 3]
13) Let mydict = {‘name’: ‘Julius’, ‘age’: 20, ‘hobby’:’painting’}
Through key indexing, print the element stored in the key hobby.
Computing Things with Python 35
14) Do the same as above for the following object mydict = {‘name’:
{‘firstName’: ‘Julius’, ‘lastName’:’Smith’}, ‘habits’:{‘work’:
‘lawyer’, ‘hobby’: ‘painting’}}
15) Print the word “bye” from the following dictionary newdict =
{‘key1’: [{‘nested_key’: [‘going further’, [‘bye’]] } ]}
16) Can a dictionary be sorted?
17) What is the main difference between tuples and lists?
18) Use a proper data structure to select only the unique elements of
the list list2sort = [100, 90, 55, 80, 65, 73, 20]
19) State the Boolean result of the following code: 3 < 2.
2.4.1. Solutions
1) There are infinite possibilities to solve this problem. One pos-
sible is –20.06 – 70.06 + 10..
2) There are infinite possibilities to solve this problem. One pos-
sible is –1201.8/30*2.
3) float
4) Using fraction power 3**(1/2) or 3**0.5
5) Using power as 2**2
6) Mary[1] or mystring = ‘Mary’; mystring[1]
7) ‘Mary’[::–1] or mystring = ‘Mary’; mystring[::–1]
8) Consider mystring = ‘Mary’. Method 1: mystring[3] or Method
2: mystring[–1]
9) Method 1: [1, 1, 1] or Method 2: [1] * 3
10) mylist = [1, 1, 1]; mylist[–1] = 10
11) mylist = [1, 2, 3, 4]; mylist.reverse()
12) mylist = [1, 4, 2, 3]; mylist.sort()
13) mydict[‘hobby’]
14) mydict[‘habits’][‘hobby’]
15) newdict[‘key1’][0][‘nested_key’][1]
16) A dictionary cannot be sorted since it consists in a mapping of
key-value and not a sequence.
17) A tuple is immutable, a list is mutable.
18) set(list2sort)
19) False
36 Fundamentals of Machine Learning using Python
code in block
code in block
elif (condition2):
code in block
code in block
else:
code in block
code in block
if x:
print(‘This is printed if x is True.’)
else:
print(‘This is printed if x is False.’)
This is printed if x is False.
The elif block can be used following an if block to add other conditions to
be evaluated, rather than only using the else block directly.
state = ‘Alabama’
if state == ‘California’:
print(‘The state is California’)
elif state == ‘Alabama’:
print(‘The state is Alabama.’)
else:
print(‘The state is any other but not California or Alabama.’)
The state is Alabama.
One can use as many elif as it may be necessary in an if block. The first
block to be evaluated to True will be run, and the code will progress from the
code following the block (all other elif or else will be neglected).
number = 10
else:
print(item + ‘ start with any other letter than d or c.’)
dog starts with d.
cat starts with c.
dolphin starts with d.
chicken starts with c.
tiger start with any other letter than d or c.
A for loop can be used to create a cumulative sum or cumulative
multiplication.
mylist = list(range(100))
cum_sum = 0
for item in mylist:
cum_sum += item
cum_sum
4950
A for loop can also be used to access each character of a string, since a string
is actually a sequence of characters.
mystring = ‘The dog is black’
d
o
g
i
s
40 Fundamentals of Machine Learning using Python
b
l
a
c
k
key1
key2
key3
Note that the keys will not necessarily be returned in a sorted order. To
retrieve the values stored on each key, use the .values() method.
for item in mydict.values():
print(item)
dog
cat
monkey
To retrieve both the keys and the values, dictionary unpacking can be
done using the .items() method.
for key_, value_ in mydict.items():
print(key_)
print(value_)
key1
dog
key2
cat
key3
monkey
A while loop is similar to a for loop, with the logic difference that the
code inside the while block will run as long as a certain condition is True.
This type of block must be handled carefully, since infinite loops may be
generated if the condition to exit is never changed to False.
while (condition):
run code
else:
final code
variable x reaches the value 10, then the condition is changed to False in
order to exit the while loop.
condition = True
x=0
while condition:
print(`The value of x is `, x)
if x >= 10:
condition = False
x += 1
The value of x is 0
The value of x is 1
The value of x is 2
The value of x is 3
The value of x is 4
The value of x is 5
The value of x is 6
The value of x is 7
The value of x is 8
The value of x is 9
The value of x is 10
The commands break, continue and pass can be used to add functionality
and deal with specific cases. The break statement exits the current closest
loop.
x=0
while True:
print(`The value of x is `, x)
if x >= 10:
break
x += 1
The value of x is 0
Computing Things with Python 43
The value of x is 1
The value of x is 2
The value of x is 3
The value of x is 4
The value of x is 5
The value of x is 6
The value of x is 7
The value of x is 8
The value of x is 9
The value of x is 10
while True:
print(`The value of x is `, x)
x += 1
if x%2 == 0:
continue
if x >= 10:
break
The value of x is 0
The value of x is 1
The value of x is 2
The value of x is 3
The value of x is 4
The value of x is 5
The value of x is 6
The value of x is 7
The value of x is 8
44 Fundamentals of Machine Learning using Python
The value of x is 9
The value of x is 10
The pass statement do not perform any specific operation. Instead, it is used
as a placeholder in a code block since a code block in Python must contain
at least one line of code.
x=0
while True:
print(`The value of x is `, x)
x += 1
if x%2 == 0:
pass
else:
x += 1
if x >= 10:
break
The value of x is 0
The value of x is 2
The value of x is 4
The value of x is 6
The value of x is 8
• zip
This method helps to iterate over two lists at the same time, by “zipping”
or packing them together, as if one was the index and the other was the value
when using the enumerate function.
list1 = [10, 20, 30, 40]
list2 = [‘a’, ‘b’, ‘c’, ‘d’]
list(zip(list1, list2))
[(10, ‘a’), (20, ‘b’), (30, ‘c’), (40, ‘d’)]
Note that both lists should have the same size. Extra elements on the
larger list will be ignored when performing zip function.
list1 = [10, 20, 30, 40, 50] # this list has 5 elements (50 is the last one)
list2 = [‘a’, ‘b’, ‘c’, ‘d’] # this list has 4 elements
# when doing zip, the number 50 in list1 is ignored since it is considered an
extra element.
list(zip(list1, list2))
[(10, ‘a’), (20, ‘b’), (30, ‘c’), (40, ‘d’)]
The zip generator can be used in a for loop, similarly to the enumerate, to
iterate over two lists at the same time.
list1 = [10, 20, 30, 40, 50] # this list has 5 elements (50 is the last one)
list2 = [‘a’, ‘b’, ‘c’, ‘d’] # this list has 4 elements
# when doing zip, the number 50 in list1 is ignored since it is considered an
extra element.
for value1, value2 in zip(list1, list2):
print(f’The value extracted from list1 is {value1}, and from list2 is
{value2}.’)
The value extracted from list1 is 10, and from list2 is a.
The value extracted from list1 is 20, and from list2 is b.
The value extracted from list1 is 30, and from list2 is c.
The value extracted from list1 is 40, and from list2 is d.
• in operator
Computing Things with Python 47
• input
The input command is used to capture an input from the user when
running the program. It waits such input before continuing to run. The input
is stored as a string, and may be converted to other types by using int() or
float() to convert into integer or floating-point number, respectively.
name = input(‘What is your name? ‘)
print(‘Welcome ‘ + name + ‘!’)
age = input(‘What is your age? ‘)
future_age = int(age) + 10
print(‘In 10 years you will be ‘ + str(future_age) + ‘ years old.’)
What is your name? John
Welcome John!
What is your age? 20
In 10 years you will be 30 years old.
2.8. FUNCTIONS
Similar to mathematics, functions are a set of instruction used to perform a
certain operation and return something (though in Python it will not always
return something explicitly). It is specially useful for commands which are
supposed to be reused, i.e., they will be used multiple times. In this way, a
function avoids repetition and improves readability of code.
48 Fundamentals of Machine Learning using Python
else:
print(‘Wrong data type.’)
return y
A function may not return explicitly anything. For instance, the following
example contains a function which prints a statement at the screen. Notice
that no value is returned (the keyword return is not used).
def print_animal_size(animal, size):
‘’’
print_animal_size(animal,size):
This function prints a sentence on the screen evaluating if the animal is
more than 1 meter tall.
Inputs:
animal(string) – animal name;
size(float) – the animal size in meters
‘’’
if size >= 1:
print(f’The {animal} size is greater than or equal to 1 meter.’)
else:
print(f’The {animal} size is less than 1 meter.’)
print_animal_size(‘leopard’, 0.8)
print_animal_size(‘elephant’, 3.2)
The leopard size is less than 1 meter.
The elephant size is greater than or equal to 1 meter.
3
A General Outlook on Machine Learning
This chapter will give the reader a general notion of what Machine
Learning is for, what type of problems it can generally solve. Here we do not
stick to the formalism inherent of a mathematical field. Rather the intention
of this chapter is to instigate user curiosity and to give him a general idea
of the concepts used on Machine Learning to solve everyday problems. To
make it more trivial, we insert a character called Paul, an regular guy with
some knowledge in programming but nothing special, ordinary intelligence
but with a curiosity on how to automate things.
Paul wakes up, 6:00 am and it is already a beautiful day outside his small
apartment in Amsterdam. As usual he grabs a cup of hot coffee with some
bread and check his e-mail before leaving to work. His inbox contains the
following list of e-mails:
Sender Subject
Lotery1000 You are out 1000th winner!
Amazing Credit Card Sign up now and start shopping!
Adams, Mary Meeting with clients from Japan
Se Manda Cheques Ahora puedes nos contactar
Cordell Jean Les nouvelles qui vous intéressent
Hill, John Request for document
Looking at these first e-mails, Paul takes a deep breath and starts his
morning process of cleaning his inbox, checking the important e-mails and
52 Fundamentals of Machine Learning using Python
at his seat, just before he notices how delayed he is to his work, so almost
falling he sprints to the shower and leave his home almost as a thunder.
When he arrives at his job, Mary Adams (the one of the e-mail, we will
call her Mary from here now) approaches Paul and kindly ask:
“Hey Paul good morning! Have you read the email I sent you?”
Paul asks:
“What e-mail?”
And almost the same time as he is speaking, he remembers the e-mail
Mary sent him, but he didn’t have time to check because he was too involved
in removing the spams. At this moment, Mary gets visibly angry but trying
not so much to control herself she says:
“Well you should have received it! These clients from Japan are very
valuable and they have a problem that needs a solution now! Read the e-mail
and give me a position as soon as possible!”
“Ok…”
Paul arrives at this office with another cup of coffee, opens the laptop
and checks Mary’s e-mail.
“Dear collaborators,
A representative of our client in china, Xing Bank, is visiting us this
week and he has a request and is willing to offer a fine payment to have
an automated system of credit card approval. At the current moment, this
system consists in the following steps:
• A client who wants a credit card from Xing Bank must fill a form
at one of our institutions with information about his employment
situation, salary level, age, etc.
• This form is sent online to a set of experts who collects additional
information of the client such as financial history, relationship
with other banks, etc.
• The experts analyze all this information and come up with the
result: approval or disapproval of credit card submission.
• The result is written in the form and sent back to the head office
for emission of the card. An e-mail is sent to the client stating the
situation (approval or disapproval).
According the representative of Xing Bank, the entire process from
54 Fundamentals of Machine Learning using Python
as he would normally do and say to the machine: Ok now you have a set of
e-mails which are spam and some which are not, learn how to classify them!
Or in the case of the credit card approval, he would need to give historical
data of clients and the results (approval or not) to the machine and say: This
is the data and how it behaves, learn and tell me if I should approve or not a
new client’s credit card.
Paul thinks on all of these ideas and he thinks he can manage to help
the Xing Bank to build their automatic system using Machine Learning. But
first he wants to test the concept by classifying his e-mails. He arrives on the
following final set of attributes:
Result: Is spam (1) or is not spam (0)
Criteria 0: English or Dutch (0) or any other language (1)
Criteria 1: Sent during the day (0) or at night (1)
Criteria 2: I know the person (0) or I don’t know the person (1)
Criteria 3: Contains any of the words lottery, prize, shopping, discount (1)
or it doesn’t (0)
Using this set of criteria, Paul go over his e-mail again and start classifying
them. He gets a piece of paper and writes: Email 1: It is in English (0), it
was sent during the day(0), I don’t know the person (1) but doesn’t contain
“wrong” words (0). Actually the e-mail is not a spam, but appears to be from
a new client (0) so he actually writes:
Email1 -> 0, 0, 1, 0
Result1 -> 0
Paul notices that what he has written is very similar to what he would do
using a programming language. As he knows some Python, he changes from
the piece of paper to the computer, opens a notepad and writes again, using
Python syntax.
Email1 = [0, 0, 1,0]
Result1 = 0
Then he goes to the second e-mail. This time, the e-mail is apparently in
Spanish (1), sent during the day(0) by an unknown person (1) and containing
“wrong” words (1), so Paul writes:
Email2 = [1, 0, 1, 1]
Result2 = 1
A General Outlook on Machine Learning 57
Paul looks at the structure and he sees that the variable Result1 and Result2
can be eliminated by calling the emails as spam or not spam, so that’s what
he do with variables Email1 and Email2:
NotSpam1 = [0, 0, 1, 0]
Spam1 = [1, 0, 1, 1]
The next e-mail is in English, sent by a known person during the day and
containing “wrong” words so,
NotSpam2 = [0, 0, 0, 1]
Getting excited with this job, Paul classifies the 6 last e-mails on his inbox
and come to the following set of data.
NotSpam1 = [0, 0, 1, 0]
NotSpam2 = [0, 0, 0, 1]
NotSpam3 = [0, 1, 0, 0]
Spam1 = [1, 0, 1, 1]
Spam2 = [1, 1, 0, 1]
Spam3 = [0, 0, 1, 1]
To have all emails collected in a single variable, Paul creates a “data” array
with the name of his data:
Data = [NotSpam1, NotSpam2, NotSpam3, Spam1, Spam2, Spam3]
At this point, he remembers that computers cannot read, i.e., it doesn’t know
that an e-mail is a spam just because the variable name is Spam1, so he
creates a variable Marking which will store if the variable is a spam (1) or
not (0),
Marking = [0, 0, 0, 1, 1, 1]
Paul wants to know how he can give now this information to the machine,
so it can learn and start classifying his e-mail. But what if the machine starts
classifying wrong, even having learned with this data? It would be safer to
test the algorithm, which means to train it and then give it a new point which
he knows to which class it belongs (spam or not) and see if the machine
58 Fundamentals of Machine Learning using Python
correctly classified it. To do that, Paul gets another e-mail and store it as
MisteryEmail.
MisteryEmail = [1, 0, 0, 0]
Finally, Paul arrives at the phase of Machine Learning. Now it is the time
to start the actual process that he has dreamed so much, no more wasting
time classifying emails in the morning! Googling, Paul finds very useful
information to build classification systems.
One basic algorithm for classification is called K-Nearest Neighbors
(KNN). Paul finds that, by using a functionality in python he does not need
to implement the algorithm from scratch, rather he can directly test it in his
problem. This is done by importing the library sklearn and the submodule
neighbors using the following line of code.
from sklearn.neighbors import KNeighborsClassifier
and wants to check how well the model predicts the data which was used for
the training itself. After looking in the internet, he finds that a nice table can
be printed using the following code.
for i in range(len(Data)):
print(“Predicted: ,”neigh.predict([Data[i]])[0],” Real: ,”Marking[i])
Out:
Predicted: 0 Real: 0
Predicted: 0 Real: 0
Predicted: 0 Real: 0
Predicted: 1 Real: 1
Predicted: 0 Real: 1
Predicted: 1 Real: 1
From the training data, Paul notices that only 1 value was wrongly
predicted. The system said it is not a spam, while it is. Paul does not let
himself get discouraged by this small errors, as he knows that a perfect
model is very rate and almost impossible in some situation. Instead, he
realizes that he has much to learn about Machine Learning before building a
working system to solve real-life problems.
Paul also notices that the same system he built could be intuitively
extended to other applications. For example, if instead of e-mails, one had
letter? What if it was clients who pay or doesn’t pay? Actually the same
concept applied in the example above could be extended to a variety of
classifications. In summary, the steps used in the case above were:
• For each element (or record), there were a series of features
(characteristics), e.g., Data = [[1,0,0],[0,1,0]] where the set [1,0,0]
is a record and the numbers inside represent features.
• Each feature (characteristic) has a meaning, possibly physical but
also mental or virtual.
• If the feature(s) correspond to classes, it is represented by integers.
They may also be continuous (floating-point numbers).
• The model is created and trained, e.g., using the fit method in
sklearn (model.fit()).
• After training, the model is tested.
• No model is perfect, i.e., the result may be very accurate but it
60 Fundamentals of Machine Learning using Python
CONTENTS
4.1. What Is Machine Learning? ............................................................... 62
4.2. Introduction To Supervised Learning ................................................. 63
4.3. Introduction To Unsupervised Learning ............................................. 71
4.4. A Challenging Problem: The Cocktail Party ....................................... 73
62 Fundamentals of Machine Learning using Python
Example
A certain website classify images as appropriate (non-sexual, non-violent) or
inappropriate, watching users actions and certain characteristics to improve
its classification algorithm. In this scenario, what is the task T?
• Classify images as appropriate or inappropriate;
• Watching users actions;
• The number of image(s) correctly classified;
• None of the above
Solution
According Mitchell’s (Mitchell (1997)) definition of machine learning, the
problem above can be interpreted according the following tasks or properties:
• Experience(E) – Watching users actions and certain characteristics;
• Task (T) – classify images as appropriate or inappropriate;
• performance measure (P) – the number of image(s) correctly
classified;
Elements of Machine Learning 63
Answer: (a)
End of solution
Machine learning tasks can be subdivided into: supervised learning,
unsupervised learning and reinforcement learning or active learning. These
tasks are described with more details in the following sections of this chapter.
The algorithms used in machine learning are intimately related to other
field of science. To mention some:
• Data Mining: While machine learning focuses on prediction
extracted from known features, data mining is concerned with
discovering unknown features in the data. However, methods and
techniques used in machine learning and data mining overlap
significantly.
• Optimization: Learning problems are formulated as an
optimization problem, focusing on minimizing errors between
the model output and the desired targets on a dataset, called loss
functions.
For instance, in a classification problem, the loss function express the
difference between the predicted labels (the ones produced by the model)
and the observed labels. An optimization problem is formulated to minimize
such loss function.
50
40
Price(USD) in 1000
30
20
10
0
0 5 10 15 20
2
Size in m
Table 4.1: Housing Price (USD) x Size (m2) in Buenos Aires (Hypothetical
Data)
2
Price (USD)
Size ( m )
4.0 2.766
4.4 –4.2666
8.0 18.802
12.0 31.773
16.0 23.174
18.0 37.964
19.6 45.777
20.0 42.564
By visualizing the graph above, one could build a line representing a
model able to perform predictions of housing price in Buenos Aires based
on its size in square meter. Without performing any calculation, a possibility
would be to directly connect the first and last points of the data. This is
represented as a line over the data in Figure 4.2.
Elements of Machine Learning 65
50
Observed
Model 1
40
Price(USD) in 1000
30
20
10
0
0 5 10 15 20
2
Size in m
Figure 4.2: Housing price in Buenos Aires, in USD (hypothetical) with initial
model.
A second choice, also not performing a single calculating directly, would
be to connect the lowest and highest values in a line, thus constructing a
second model as represented in Figure 4.3.
50
Observed
Model 1
40
Model 2
Price(USD) in 1000
30
20
10
0
0 5 10 15 20
2
Size in m
Figure 4.3: Housing price in Buenos Aires, in USD (hypothetical) with initial
model (Model 1) and second model (Model 2).
Both of these models represent an algorithm (very simplistic) to try
to predict the housing price. Such type of prediction is also referred to as
66 Fundamentals of Machine Learning using Python
where y is the output, i.e., price in this case, x is the independent variable
(size in square meters) and a1 and a0 are parameters of the model. Adapting
to the current problem, one could also write:
=
p a1s + a0
where p is price and s is the size of the building. For model 1, the parameters
can be calculated using the two connected points, which are represented in
the Table 4.2.
2
Price (USD)
Size ( m )
4.0 2.766
20.0 42.564
Substituting the size and price respectively in the line equation stated
above gives two equations.
2.766
= 4a1 + a0
42.564
= 20a1 + a0
By subtracting the first from the second, one can obtain the value of the
parameter a1 .
42.564 − 2.766= 20a1 − 4a1 + a0 − a0
39.798 = 16a1
39.798 /16 = a1
2.487 = a1
2.766
= 4a1 + a0
2.766 =
4 × 2.487 + a0
−7.182 =
a0
Therefore, the equation (algorithm) that defines Model 1 can be
expressed as:
=p 2.487 s − 7.182
By simply replacing known values for housing size in the above
equation, one could theoretically predict the price of it for the region of
Buenos Aires. However, another problem arises: What about accuracy? Is
this model accurate? How well can it predict the real prices? That is another
point to be considered along this book.
For the second model, Table 4.3 shows the data used.
2
Price (USD)
Size ( m )
4.4 2.267
19.6 45.777
A deduction similar to the one used for Model 1 can be performed to
arrive in the following equation defining this model.
2.267
= 4.4a1 + a0
45.777
= 19.6a1 + a0
Subtracting the first equation from the second,
43.51 = 15.2a1
a1 = 2.862
a0 = −10.318
The final equation for Model 2 thus is,
68 Fundamentals of Machine Learning using Python
=p 2.862 s − 10.318
The models developed above, though theoretically capable of being
used to make prediction of house pricing, are not the best representations of
machine learning algorithms, since they were developed by ignoring all the
information passed by the complete dataset but the two points adopted for
each case. In a machine learning system, the equation would be generated
from the experiences acquired through the observation of all the points in
the dataset, and the parameters would be adjusted not by direct calculation,
but through an optimization algorithm, such as linear least squares (LLS) for
linear systems or other algorithms for non-linear ones.
Another example of supervised learning is the problem of classifying
data. For instance, consider Table 4.4 for Breast Cancer. It shows the
classification of each cancer sample as malignant or benign as a function of
the tumor size.
Y(1)
Malignant?
N(0)
Example
Suppose a company has contracted you, a data scientist, to help it to address
the following problems:
• Problem 1: The company sells a variety of items, and it would like
to predict its profit for the next 3 years based on past information.
• Problem 2: The company has e-mail accounts for each of its
employees. It is requested to analyze each of the e-mail accounts
and for each one decide if it has been hacked or compromised.
How should these problems be treated?
• Both involve supervised learning and are classification problems;
• Both involve supervised learning and are regression problems;
• Both involve supervised learning, but Problem 1 is a classification
problem while Problem 2 is a regression problem;
• Both involve supervised learning, but Problem 1 is a regression
problem while Problem 2 is a classification problem;
• They are not supervised learning problems.
Elements of Machine Learning 71
Solution
Let’s us consider each problem separately so we can correctly answer the
question above.
• For Problem 1, the algorithm consists in using past number of
sell data to predict profit. As profit is a continuous data and not a
class, this problem consists in a regression.
• For Problem 2, the output of the algorithm is hacked/not hacked.
Since these are classes, this is a classification problem.
Answer: (d)
End of solution
4.3.1. Clustering
Suppose you are running a business and you sell stuff to people. Now, people
buy stuff based on some criteria for instance: necessity, status, desire or any
other thing. How can one know what drive’s people purchase decisions?
In this sense, clustering algorithms works by finding and grouping
data (people) according hidden patterns and inferences (people who buy
because of necessity against people who buy for desire). These patterns
and inferences are not known a priori, it is the machine learning algorithm
who undercover it. And it does this through unsupervised learning. The
groups are determined clusters, and the number of them are usually fixed,
so the developer has some control over the granularity of the system. Some
common types of clustering are:
72 Fundamentals of Machine Learning using Python
3 × 9 ×104= 2.7 ×105 values. For a set with “only” 1000 images (considered
a relatively small set), the total number of data available will be:
2.7 ×105 ×1000 = 2.7 ×108
This shows how vital data compression is depending on the problem
being solved.
Exercises
1) Which of the following problems can be considered supervised
learning? (Check all that apply)
a. (____) From a collection of photos and information about what is
on them, train a model to recognize other photos.
b. (____) From a set of molecules and a table informing which are
drugs, train a model to predict if a molecule is also a drug.
c. (____) From a collection of photos showing four different
persons, train a model which will separate the photos in 4 groups,
each one for one person.
d. (____) From a set of clients, group them according similarity to
retrieve patters.
2) According the definition of Mitchell (1997), “A computer
program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance
at tasks in T, as measured by P, improves with experience E.” A
predictive system is built by feeding a lot of blog comments, and
it has to predict the behavior of users. In this context, T represents
which component of the above description?
a. (____) The process of learning from the data;
b. (____) Blog comments;
c. (____) None of those;
d. (____) The probability of predicting the user behavior correctly.
3) Suppose you are working on the algorithm for Question 2, and
you may predict user behavior as: violent, charismatic, happy or
sad. Is this a classification or regression problem?
a. (____) Regression;
b. (____) Classification;
c. (____) None of those;
76 Fundamentals of Machine Learning using Python
CONTENTS
5.1. Introduction ...................................................................................... 78
5.2. Model Structure ................................................................................ 78
5.3. Cost Function ................................................................................... 80
5.4. Linear Regression Using Gradient Descent Method Using Python..... 96
Exercises ................................................................................................ 111
78 Fundamentals of Machine Learning using Python
5.1. INTRODUCTION
Chapter 4 gave a general overview on the elements involved in machine
learning algorithms. On this section, we explore specifically one supervised
learning algorithm – linear regression- constrained to one single feature
(variable) and one output. This model can be described as one of the simplest
yet most didactic algorithms of machine learning.
As the name suggests linear regression with one variable is characterized
by:
• Linear: A function or order 1, i.e., the highest exponential in a
y
y 2 x + 1;
variable is 1 as in: = = 0.4 x ;
20
• Regression: As mentioned in Chapter 4, regression consists
in one type of supervised learning, where the model is trained
(learns) from past data, to predict certain values called target(s)
or output(s);
• With one variable: A machine learning algorithm may involve as
many variables as necessary, from one to thousands or billions,
limited only the computational capacity. In this section, we deal
with problems containing a single variables, i.e., predicting human
weight based on size; house prices based on size; company profit
based on number of sales, etc.
Price (USD)
2
Housing Size ( ft )
648 582
548 425
1069 825
681 488
912 631
425 364
675 441
615 589
417 492
1147 711
80 Fundamentals of Machine Learning using Python
Suppose the complete dataset contains 100 rows of data. The Housing
Size variable is used as an input to predict the Prices variable. Using these
concepts we can use the following notation:
• m = 100 (corresponds to the training examples or records);
• x (s) are the input variables, also called “features”;
• y (s) are the output variables, also called “targets.”
Each pair of data x , y is referred to as one training example. Therefore,
using the data provided in Table 11.1, the pair ( x0 , y0 ) corresponds to
the record (648,582), while the pair ( x9 , y9 ) corresponds to the record
(1,147,711).
In any statistical predictive algorithm, the problem is structures according
the following workflow:
• Define the training set (Table 5.1; Figure 5.1);
• Attribute a learning algorithm;
• Define a hypothesis h (e.g., the Housing Size is linearly
correlated with the Price, thus linear regression is a good choice
for predicting);
• With the hypothesis, perform predictions on the “target” variable.
In a linear regression, the hypothesis is the linear correlation, meaning
that the target can be predicted using the following equation:
h ( x=
) a0 + a1 x
where x is the input variable (housing size), while ai are the parameters
of the model to be adjusted (calibrated). If there is only one input variable
(as in the above case), then there are two parameters, a0 and a1 , and the
problem is called a Univariate linear regression.
Figure 5.3: Prediction of Housing Price using “slightly better” parameter val-
ues.
Though much better, this output is still not the optimal. We could keep
trying different values many times, but obviously this wouldn’t be the
82 Fundamentals of Machine Learning using Python
( ±( ) − ) = minimum
i i
(h ( xi ) − yi ) 2 = minimum
Of course, we don’t want to find only the minimum difference for a point
i , but for the complete set of records available on the training example.
Thus, it would be better to define that we want to minimize the sum of the
squared differences. In mathematical notation this is written as:
min a0 ,a1 (h ( xi ) − yi ) 2
Figure 5.5: Values of J for Housing Price using a1 = 0.74 as a fixed parameter.
Figure 5.6: Surface showing values of J for Housing Price and the approximate
minimum, pointing the values of a0 and a1.
A look on the Figure above shows that the minimum cost lies exactly
on the middle of the inner valley on the contour plot, exactly the valley. The
values found at this position are:
a0 = 202.23
a1 = 0.49
J 3.1×105
=
Notice how the cost value J is much lower than the values found when
trying to adjust the parameters one at a time! Still this value cannot be said
to be the absolute minimum since the graphical analysis may be prone to
small deviations. So how can we determine the real minimum value of the
cost function J mathematically?
First we will work on the classical method to determine the optimum
linear regression. Though straightforward, this method can only be directly
used for linear regression. After we work on a iterative procedure that can
be used for linear and non-linear models, thus being more general than the
first one.
86 Fundamentals of Machine Learning using Python
ei = yi − a0 − a1 xi
n n
where n is the number of target examples. The minimum of the above equation
can be obtained at the inflection point of the function. Mathematically it
means the point where the derivatives (with respect to a0 and a1 ) are equal
to zero. First apply the chain rule to determine the derivative with respect
to a0 :
∂J ∂J ∂x
=
∂a0 ∂x ∂a0
∂J n
= 2∑( yi − a0 − a1 xi )
∂x i =1
Linear Regression With One Variable 87
∂x
= −1
∂a0
∂J n
=−2∑( yi − a0 − a1 xi ) =
0
∂a0 i =1
Divide both sides of the equation by 2n and rearrange to find the optimal
value of a_0:
∑ ∑ ∑
n n n
y
i =1 i i =1 0
a ax
i =1 1 i 0
−2 +2 +2 =
2n 2n 2n 2n
∑
n
y
i =1 i
Remember that the average value y is defined as , then the
equation can be severely simplified. n
− y + a0 + a1 x =0
a0= y − a1 x
Rearranging,
n n
∑xi ( yi − y ) + a1 ∑xi ( x − xi ) =
i =1 i =1
0
n n
a1 ∑xi ( x − xi ) =
−∑xi ( yi − y )
i =1 i =1
88 Fundamentals of Machine Learning using Python
−∑ i =1xi ( yi − y )
n
a1 =
∑ x ( x − xi )
n
i =1 i
∑ x ( yi − y )
n
i =1 i
a1 =
∑ x ( xi − x )
n
i =1 i
Using the above equation, one can directly determine the optimum value
of the parameter a1 and in sequence the value of a0 can be determined.
The fact that Least Squared Method represents a direct, non-iterative
method of determining optimum parameters represent a huge advantage of
iterative methods, which relies in performing a calculation over and over
until a value has converged. However, the arithmetic solution was derived
directly for a linear model (line equation). The same type of solution cannot
be directly used for non-linear models, requiring necessarily an iterative
process to determine the optimal parameters.
In the following section, it is demonstrated how Python can be used to
solve the linear regression problem using as example the Housing Price data
mentioned above.
# loop over the file, filling the lists with x and y vectors
for line in file:
row = line.split(‘,’)
x.append(float(row[0]))
y.append(float(row[1]))
The line,
with open(‘housing_data.csv’,’r’) as file:
opens the file and attributes it to the object file. The following line,
# read header
file.readline()
reads the first line of data and do not store it. This is necessary since the first
row contains the headers and not actual data.
The following lines iterate over each row of the file, transforming data
to floating-point and appending to the lists x and y .
The Least Squares Method required the calculation of the average value
for both x and y . To do that, define a function called average(x), which
receives a list x as input, sum up all the values and divides by the number
of elements in the list, characterizing the average.
# read header
def average(x):
return sum(x) / len(x)
Using this function obtain the average of each variable of the problem:
avg_x = average(x)
avg_y = average(y)
With that, we are ready to obtain the value of the parameter a1 followed
by a0 . The value of these variables can be obtained by generating a list
through list comprehension (or using an “explicit” for loop), and summing
up all the values.
90 Fundamentals of Machine Learning using Python
The SQtotal is the total squared sum, defined as the accumulated squared
difference between each point and the average value.
n
=
SQtotal ∑( y − y )
i =1
i
Linear Regression With One Variable 91
First, let’s obtain SQres and SQtotal in Python by generating a list using
list comprehension and applying the sum() method.
sqres = sum([(yi – (a1*xi + a0))**2 for (xi, yi) in zip(x, y)])
sqtot = sum([(yi – avg_y)**2 for yi in y])
Generate R2 by performing the division of sqres by sqtot as done in the
following code:
R2 = 1 – sqres/sqtot
print(‘Coefficient of Determination R2 = ‘,R2)
Finally, print the final value of the cost function J by referring to the
variable sqres, which stored exactly the same value.
J = sqres
print(‘Final Cost J = {:.3e}’.format(J))
The complete code of this example is stored in Appendix A. The final
values obtained for the parameters a0 , a1 , for the coefficient of determination
R 2 and for the Cost Function J are:
• a0 = 202.283 ;
• a1 = 0.493 ;
• R 2 = 0.875 ;
• J 3.09 ×105 .
=
Notice that there is a small difference from the cost function J value
from this mathematical procedure to the graphical one. Still it is important to
observe that the one above represents the exact global minimum Cost value,
while on the graphical procedure this value is never exactly determined, but
always obtained an approximation.
( )
certain cost function J a0i , a1i value. The objective is to have an algorithm
where at each iteration, the cost function J is modified through modification
of a0i and a1i , until it reaches a minimum, indicated by the convergence.
The update equation of GD method is as follows.
∂
θin +=
1
θin − α J (θ )
∂θi
theta_0 = var0
# var1 is the updated value of theta_1
var1 = theta_1 – alpha * der(J, theta_0,theta_1) # notice that here, theta_0 =
var0. WRONG!
# now update the old values for the next iteration
theta_1 = var1
Let’s develop some intuition on the GD algorithm, for a better
understanding on how it can find the minimum of a function by “following”
the gradient. Remember that the cost function, defining the error of the
model, consists in the squared difference of the observed yobs and the
= a1 x + a0
predicted value y pred
n
=
J ∑( y bs − ( a x + a ))
i =0
o 1 0
2
x y
–5.0 –13.0
0.1 –2.8
5.0 7.0
Figure 5.7: Values of J for linear regression with fixed a0 = −3 and varying
a1 for the data in Table 5.2.
Figure 5.8: Value of learning rate α , is too small, requiring too much itera-
tions.
Linear Regression With One Variable 95
Figure 5.10: Value of learning rate á , is too high, producing relative diver-
gence in the search for the minimum.
96 Fundamentals of Machine Learning using Python
“Local Optimum”
“Global Optimum”
Figure 5.11: Example of local minima and global minima in non-linear cost
function.
To apply the GD method in linear regression, it is necessary to determine
the derivatives of the cost function J with respect to each parameter a0 and
a1 for the parameter updating equation. This derivatives are shown in the
equations below.
∂ n
J ( a0 , a1=
) ∑( y i
obs − a1 x i − a0 )
∂a0 i =0
∂ n
J ( a0 ,=
a1 ) xi ∑( yobs
i
− a1 xi − a0 )
∂a1 i =0
Table 5.3: Company Sales and Advertising Investment from 2009 to 2018
where yobs is the observed output (sales) and y pred is the predicted output.
From the equation of the linear regression, the error can thus be rewritten as:
e = yobs − ( a0 + a1 x ) = yobs − a0 − a1 x
e = error(yobs, ypred)
Inputs:
yobs (float) – observed value
x (float) – input
a0 (float) – parameter of linear model
a1 (float) – parameter of linear model
‘’’
return yobs – a0 – a1*x
It is important, at each step, to define test cases so we are assured that
the code is properly set up. The following code tests the error for a simple
example with known output.
# Test case
a0 = 1; a1 = 1
x=1
ypred = a0 + a1*x
yobs = 2
error_true = yobs – ypred
error_calc = error(yobs, x, a0, a1)
if abs(error_true – error_calc) < 1e–5:
print(“Passed!”)
else:
print(“Wrong error!”)
Passed!
return J
Define a simple test case to evaluate if the SSE is correctly obtained.
# Test case
a0 = 1; a1 = 1;
x = [1, 2, 3]
yobs = [1, 2, 3]
ypred = [2, 3, 4]
errors = [–1, –1, –1]
errors2 = [1, 1, 1] # squared errors from the list above
J_real = 3 # the result of summing up all values from the list above
J_calc = J(yobs,x, a0, a1)
if abs(J_calc – J_real) < 1e–5:
print(“Passed!”)
else:
print(“Wrong SSE!”)
Passed!
For visualization of the error, let us define a function which can plot
the cost function J ( a0 , a1 ) for a range of values in a0 and a1 , and insert a
point where is the current value of J according the selected a0 and a1 . To
generate that visualization we will need numpy library. The template of this
function is
def plotJ(yobs, xobs, a0, a1, [a0min, a0max], [a1min, a1max])
102 Fundamentals of Machine Learning using Python
import numpy as np
def plotJ(yobs, xobs, a0, a1, a0range, a1range):
‘’’
plotJ(yobs, xobs, a0, a1, a0range, a1range)
Plot a 3D surface of the cost function and show some points being
evaluated.
Inputs:
yobs (list) – list of observed values
xobs (list) – list of observed inputs
a0 (list) – parameter of linear model
a1 (list) – parameter of linear model
a0range (list (2,)) – limits of a0 to be plotted
a1range (list (2,)) – limits of a1 to be plotted
‘’’
Jpoint = []
for a0val, a1val in zip(a0, a1):
Jpoint.append(J(yobs, xobs, a0val, a1val))
# generate the plotting of the surface and the point being evaluated
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111, projection = ‘3d’)
ax.plot_surface(A0, A1, Jmat, alpha = 0.5, rstride = 1, cstride = 1)
ax.plot(a0, a1, Jpoint, ‘o-r’)
ax.set_xlabel(‘a0’)
ax.set_ylabel(‘a1’)
ax.set_zlabel(‘J(a0, a1)’)
with respect to the parameters a0 and a1 . They are defined according the
following equations:
∂J n −1 n −1
=−2∑( yobs − a0 − a1 x) =−2∑e ( a0 , a1 )
∂a0 i =0 i =0
∂J n −1 n −1
=−2∑x ( yobs − a0 − a1 x ) =−2∑xe ( a0 , a1 )
∂a1 i =0 i =0
return dJda0,dJda1
As it was done with all previous function, evaluate this one at a certain point
with known value to test the correctness of the code implemented above.
Linear Regression With One Variable 105
xobs = [1, 2, 3]
yobs = [2, 3, 4]
a0 = 1; a1 = 1
dJda0, dJda1 = derJ(yobs, xobs, a0, a1)
if abs(dJda0 + dJda1)< 1e–5:
print(“Passed”)
else:
print(“Wrong implementation”)
Passed
∂J ( a0 ( n ) , a1 ( n ) )
a1 ( n + 1=
) a1 ( n ) − α
∂a1
We define now the update() function, which will receive the actual value
of the derivative, the learning rate and the parameter and apply the above
equation.
def update(aold, alpha, dJ):
anew = aold – alpha * dJ
return anew
The final step of the algorithm implementation consists on creating the
loop to update the parameters of the linear equation until they converge to
a value, assumed to be the optimum set. This is done through the function
shown below.
def solveGD(yobs, xobs, initguess, alpha, MAXITER = 100, verbose =
False):
a0 = initguess[0] + 1000
106 Fundamentals of Machine Learning using Python
a1 = initguess[1] + 1000
a0new = [initguess[0]]
a1new = [initguess[1]]
iter_ = 0
if verbose:
print(f”{‘iter’:5s}\t{‘a0’:5s}\t{‘a1’:5s}\t{‘J’:5s}\t”)
if verbose:
print(f’{iter_:5d}\t{a0:5.2f}\t{a1:5.2f}\t{Jval[–1]:5.2e}\t’)
iter_ += 1
a0 = 0
a1 = 2
optim = solveGD(sales, advertising, [a0, a1], alpha = 0.00001, MAXITER
= 100, verbose = True)
iter a0 a1 J
0 0.00 2.00 3.46e+06
1 0.11 6.88 1.50e+06
2 0.18 10.03 6.74e+05
3 0.23 12.08 3.30e+05
4 0.26 13.40 1.86e+05
5 0.28 14.26 1.26e+05
6 0.29 14.81 1.00e+05
7 0.30 15.17 8.97e+04
8 0.31 15.40 8.53e+04
9 0.31 15.55 8.34e+04
10 0.31 15.65 8.27e+04
11 0.32 15.71 8.23e+04
12 0.32 15.75 8.22e+04
13 0.32 15.78 8.22e+04
14 0.32 15.79 8.21e+04
15 0.32 15.81 8.21e+04
16 0.32 15.81 8.21e+04
17 0.32 15.82 8.21e+04
18 0.32 15.82 8.21e+04
19 0.32 15.82 8.21e+04
20 0.32 15.82 8.21e+04
21 0.32 15.82 8.21e+04
22 0.32 15.83 8.21e+04
23 0.32 15.83 8.21e+04
24 0.32 15.83 8.21e+04
25 0.32 15.83 8.21e+04
108 Fundamentals of Machine Learning using Python
with a final value of the cost function J ≈ 8.2 × 104 . This means a reduction
of two orders of magnitude from the initial guess, which started with
3.5 ×106 . Figure 5.15 illustrates the final result obtained, with the optimum
linear regression according GD algorithm.
a0 = optim[‘a0’]
a1 = optim[‘a1’]
sales_pred = [a0 + a1*x for x in advertising]
plt.plot(advertising, sales,’ob’, label = ‘Observed’)
plt.plot(advertising, sales_pred, ‘--r’,label = ‘Predicted’)
plt.xlabel(‘Advertising’)
plt.ylabel(‘Sales’)
plt.grid()
In the next Figure, observe how the linear regression model evolves
through the iterations, reaching the final optimum slope and displacement
from the origin. It is important to notice that changing the learning rate
will change this convergence. If it is increased above a certain threshold,
it will not converge to the optimum, but will diverge, which is not desired.
Therefore, it is essential to maintains the learning rate inside a window so
the model converges.
plt.plot(advertising,sales,’ob’,label = ‘Observed’)
for a0, a1 in zip(optim[‘history’][‘a0’],optim[‘history’][‘a1’]):
sales_pred = [a0 + a1*x for x in advertising]
plt.plot(advertising,sales_pred,’--’)
plt.xlabel(‘Advertising’)
plt.ylabel(‘Sales’)
plt.grid()
a1max = max(optim[‘history’][‘a1’])
plotJ(sales,advertising,optim[‘history’][‘a0’],optim[‘history’][‘a1’],[–
1,2],[–1,20])
EXERCISES
1) A consulting office received a project of predicting the score of a
hotel booking website based on the number of clicks. The number
of clicks C consists in the how many times an user clicked on
a website feature. The score R related to how well should this
hotel booking website be ranked, since more numbers of click
would mean more users accessing, which may book a hotel or
not. In this sense, it would be interesting to predict the score R
of a certain website, once it is known the number of clicks on it,
which can be easily collected from the server information. The
score R, on the other hand, is obtained through a rather complex
formula and needs theoretically lots of information to be well
established.
Consider the following training set, containing the number of clicks for
different websites and their scores. The score changes annually, so different
records in the table below may refer to the same website. Using a linear
regression as hypothesis ( =
y a0 + a1 x ), and considering N as the number
of training samples,
C (Million) R
2 3.0
4 3.1
1 2.2
4 2.6
For the training set above, what is the number of training samples?
(Hint: should be a number between 0 and 10).
2) In the training of linear regression model, one may often refer to
the “Cost Function.” What it is (conceptually)? How it is defined
mathematically? Give an example.
3) Consider the following linear regression equation (model) with
one variable.
y 2.4 + 3.7 x
=
Can one say that this model is a good predictor for the following dataset?
Why is it good/bad? Give your reasons in measurable terms? (You may
relate to the magnitude of the errors and the observed values, or the cost
function)
112 Fundamentals of Machine Learning using Python
C (Million) R
2 3.0
4 3.1
1 2.2
4 2.6
4) A linear regression model is defined according the values of its
parameters a0 = 0.0 and a1 = 2.2 . What is the predicted value y
for x = 2 ?
5) While developing a model to predict flood levels in a city using
precipitation (rain) levels, an engineer found values for a0 and
a1 such that J ( a0 , a1 ) = 0 . How is that possible? Choose the
correct answer.
(a) This is actually not possible. He made a mistake.
(b) All the data fitted perfectly a straight line, so the model shows no
error.
(c) This always happen if the model has good results (predictions),
since there will be no errors.
(d) This would only happen if all observed values y i = 0 for all
values of i = 0,1, 2,..., N
6) What is the purpose of the Least Squares Method and the Gradient
Descent method in linear regression? What are the differences
between them?
6
A General Review on Linear Algebra
CONTENTS
6.1. Introduction .................................................................................... 114
6.2. Matrices And Vectors ...................................................................... 114
6.3. Addition ......................................................................................... 116
6.4. Multiplication ................................................................................. 117
6.5. Matrix-Vector Multiplication ........................................................... 118
6.6. Matrix-Matrix Multiplication ........................................................... 121
6.7. Inverse And Transpose..................................................................... 123
Exercises ................................................................................................ 127
114 Fundamentals of Machine Learning using Python
6.1. INTRODUCTION
This chapter will provide a broad overview on the main concepts of linear
algebra. This will help the reader to bring back some of the basic concepts
on matrix and vector operation, as well as to help in the next chapters to
develop a better understanding on the concepts introduced. For those who
feel confident when working with Linear Algebra, they may safely skip this
chapter. However, it is advisable to go through it since there are many basic
concepts that may be easily forsaken for one that is not dealing continuously
with linear algebra calculations.
In the first part the concepts of matrices and vectors are reviews. Then
a short outlook on the main operations one can perform with matrices and
vectors is mentioned, always focusing on examples. The last part reviews the
more complex matrix operations, namely inverse and transpose. Though this
section does not bring programming explicitly, it forms the basic concepts
one should have when working with scientific computing.
the first row, second column, therefore it is the element M 1,2 . Similarly,
the entry 90 is placed in the third row, second column. Therefore, it can be
addressed as element M 3,2 . Since this matrix has only 3 rows and 3 columns,
Note that A has the same values of K, but now arranged as a column
vector. In such case, it is said that A is the transposed vector K. However,
we will revise such concept later in this chapter.
Since one of the dimensions of a vector is always one, the dimension of
it may be addressed by a single reference, being it a row or a column vector.
For instance, in the vector A, the element A1 = 211 . Still, it would not be
wrong to refer to the same value as A1,1 , though it is not recommended.
Up to here we have addressed the first element of a matrix or vector as
starting with 1. However there are systems that consider the initial counting
from 0 (zero). For example, Python programming language creates matrices
0-indexed, i.e., starting the counting from 0 instead of 1. On the other side,
116 Fundamentals of Machine Learning using Python
6.3. ADDITION
Addition of matrices can be easily performed, in a very similar way to
addition of single numbers. For each entry i,j in each matrix, sum the
corresponding numbers and generate the resulting matrix. For example,
consider the following two matrices.
1 3 5
A = 10 30 50
100 300 500
2 4 6
B = 20 40 60
200 400 600
If you have followed the reasoning and tries to do this with different
examples, soon a critical property becomes evident:
Matrix addition can only be performed in between matrices with the
same dimension
A General Review on Linear Algebra 117
Therefore, trying to add, for instance a matrix 2x2 with one 3x2 will
simply be mathematically impossible. The number of rows and the number
of columns must be all the same so addition can be done.
6.4. MULTIPLICATION
Notice that, in such case, the dimension of the original matrix is kept
the same (above is 3x3), with the entries being now multiples of the single
number.
Extensively, a division of matrix by singular value can be seen as the
multiplication of the same matrix by the inverse number. For instance, the
following division,
2 6 8
1 4 6 /2
10 20 40
In the above case, the vector (first element) has dimension 1x3 (1 row
and 3 columns), while the matrix (second element) has dimension 3x2 (3
rows and 2 columns). When performing the multiplication, the following
dimensions are multiplied: 1x3 x 3x2. The dimensions 3 (from vector) and 3
(from matrix) are the inner dimensions while the dimensions 1 (from vector)
and 2 (from matrix) are the outer dimensions. Since the inner dimensions are
the same (3), the multiplication can be performed.
The resultant matrix will have the same dimension as the outer ones of
the multiplicands involved. In the above case, the resultant matrix will have
dimensions 1x2.
2 6
[ 2 6 8] × 1 4 = [90 196]
10 20
1 2 3 2
A 4 −3 2 2
=
0 0 1 −2
y2,1 = 4 × −1 + −3 × 2 + 2 × 3 + 2 × 2
y3,1 = 0 × −1 + 0 × 2 + 1× 3 + −2 × 2
The result is shown below.
16
y = 0
−1
Table 6.1: Dataset of Poverty Level and Teen Birth Rate for Four US States
Massachusetts 11 12.5
Mississippi 23.5 37.6
New York 16.5 15.7
Adapted from: Mind On Statistics, 3rd edition apud PennState (n.d.).
After analyzing this data and developing a linear regression model, one
arrived in the following equation.
(Predicted Birth Rate) = 4.3 + 1.4 (Poverty Level)
Using this equation and the data provided, shown that the valued for
Predicted Birth Rate can be obtained using Linear Algebra (i.e., operations
with matrix) instead of performing explicitly the calculation for each record.
Solution
The first step consists in rewriting the scalar equation in matrix one. To do
so, incorporate a bias of 1 (one) being multiplied by 4.3, so the equation
becomes:
(Predicted Birth Rate) = 4.3 (1) + 1.4 (Poverty Level)
In matrix form, the predicted birth rates can be determined through the
solution of the following matrix-vector multiplication.
20.1 1
11 y1pred
1 1.4 2
⋅ = y pred
23.5 1 4.3 3
y pred
16.5 1
1
As already shown, the values of the predicted birth rates y p red ,
3
y 2p red and y p red can be determined by using the rule of multiplication,
in which case generates the following set of equations.
y1pred= 20.1×1.4 + 1× 4.3
element y1,1 , representing the element of the row 1 and column 1 of the
resulting matrix can be determined as follows.
yü =1× 4 + 3 ×1 + 2 × 3 =13
Remember that, since the first matrix has dimensions 2x3 and the second
one has dimensions 3x2, the resulting matrix will have dimensions 2x2 (2
rows and 2 columns). The element y1,2 , representing the element on the
first row, second column can be calculated as,
y1,1 = 1× 5 + 3 × 2 + 2 × 4 = 19
122 Fundamentals of Machine Learning using Python
Similarly, all elements of the matrix can be calculated. This will generate
the following resulting matrix.
13 19
47 65
Another way of structuring the solution of this multiplication is to
construct it as multiple operations between matrix and vectors. In this case,
the second matrix is broken into column-vectors. Thus,
4 5 4 5
1 2 = HSTACK 1 ; 2
3 4 3 4
2 1 3 4 7 13
1 3 ⋅ 1 5 =
6 19
3 4 2 1 10 15
1 5 ⋅ 1 3 =
7 16
The dimensions of the matrices may also differ. If A is m × n and B is
also m × n , then A × B =
C where C is m × m , while B × A = C , where C
has dimensions n × n .
• Associative
Let A, B and C be different matrices, where ±× × is a valid operation
(according inner dimensions). If ±× = , then A × B × C = D × C .
Additionally, if B × C = E , then A × B × C = A × E . In summary, associating
two matrices gives the same answer as performing the multiplication of the
matrices directly.
• Identity multiplication
Consider the identity matrix I , which is composed by the number 1 in
the main diagonal and 0 in all other elements. For instance, I 3,3 .
1 0 0
0 1 0
0 0 1
6.7.1. Inverse
Consider a single, real number r , the inverse of r is
1
= r −1
r
124 Fundamentals of Machine Learning using Python
Application Example
Let C be an invertible 3x3 matrix.
1 0 1
C = 0 1 4
1 2 1
Solution
From the definition of inverse, one obtains the following expression.
C ⋅ C −1 =
I
Expanding the matrices above and calling cij as the elements of the
matrix C −1 at row i and column j ,
1 0 1 c1,1 c1,2 c1,3 1 0 0
0 1 4 ⋅ c
2,1 c2,2 c2,3 = 0 1 0
1 2 1 c3,1 c3,2 c3,3 0 0 1
A General Review on Linear Algebra 125
To obtain the value of all the 9 elements of the matrix C −1 , one needs to
solve a set of 9 coupled equations. There is a variety of methods to simplify
the obtaining of such elements without “manually” solving the set of
equations, which would require several substitutions and recurrences. The
final result to be obtained is as follows.
0.875 −0.25 0.125
C = −0.5
−1
0. 0.5
0.125 0.25 −0.125
6.7.2. Transpose
To transpose a matrix means to change its “direction,” transforming rows
into columns and columns into rows. For example, let A be a generic
matrix with the form:
a1,1 a1,2 ... a1,m
a a2,2 ... a2,m
A=
2,1
Application Example
Obtain BT from matrix B .
4 5 6
B = 1 0 1
9 8 7
Solution
c1,1 = b1,1 c1,2 = b2,1 c1,3 = b3,1 c2,1 = b1,2 c2,2 = b2,2 c2,3 = b3,2 c3,1 = b1,3
c3,2 = b2,3 c3,3 = b3,3
In summary, the matrix C is,
4 1 9
C = 5 0 8
6 1 7
A General Review on Linear Algebra 127
EXERCISES
1) Consider the following two matrices.
1 −5
A=
−3 4
−2 0
C=
4 1
What is A − C
1 −5
−7 3
−1 −5
1 5
−2 0
−12 4
a) None of above.
Answer: a
1
1
2) Let d = 2 . What is d ⋅ ?
3
3
3
6
9
1/ 3
2 / 3
1
128 Fundamentals of Machine Learning using Python
−2
−1
0
a) None of above.
Answer: b
3) Consider a 3-dimensional column vector u , defined as,
7
u = 6
5
What is u T ?
[7 6 5]
[1 0 1]
[5 6 7]
Answer: a
4) A certain industrial process consists in manipulating an electric
signal to generate a certain output. According the engineering
department, the parameter vector v used in the equation describing
such process is,
v = [ 6.4 3.2]
The equation representing the dynamics of this process, in matrix form
is:
=
o vT ⋅ i
where o is the output vector, vT is the vector v transposed and i is the input
vector. Let i be the following set of input signals.
A General Review on Linear Algebra 129
1 4 2 8
i=
0 5 2 10
What is o ?
[ 4.2 10.3 21.9 12.5]
CONTENTS
7.1. Gradient Descent For Multiple Variables Linear Regression ............. 134
7.2. Normal Equation ............................................................................ 137
7.3. Programming Exercise: Linear Regression With Single
Input And Multiple Inputs............................................................. 140
132 Fundamentals of Machine Learning using Python
The above table can help to understand some primer concepts in linear
regression modeling using multiple features. The vector x ( ) is the input
i
vector (features) of the ith training example. For instance, consider the
Linear Regression With Multiple Inputs/Features 133
vector x ( ) .
2
x ( 2) = [ 2 1 300 40]
T
Note that the value of the price in i = 2 is not added to the vector x ( 2) ,
since this variable is not an input, but an output. Also, note that the operator
T is used to denote transpose operation. This is because, by convention, the
features vector is represented as a column-vector.
One can also refer to single elements of the input vector by using a
subscript index, as in x12 . In 1-index based system (in contrast with 0-index
based), this would indicate the 1st element of the vector x 2 , which is the
value 2 . In 0-index based system, x12 would indicate the 2nd element of the
vector, while x02 would refer to the first one.
In linear regression with one input, the model was written as,
=
y a0 + a1 x
Intuitively, in multiple input linear regression the model is written by
incorporating the other variables as multiplying each one by a parameter a j
, where j refers to each feature.
y = a0 + a1 x1 + a2 x2 + a3 x3 + ... + a j x j
Observe that the parameter a0 is not directly multiplied with any input
variable. This is because a0 is considered a bias of the model, i.e., even if
all the features have value 0 (zero), the output y may not be zero if a0 ≠ 0
. For consistency, add a dummy variable x0 = 1 so each parameter a j is
systematically multiplied by a feature x j , where j = 0,,1, 2,.. .
y= a0 x0 + a1 x1 + ... + a j x j
Notice the use of the equation with the example mentioned in the
beginning of this chapter (house pricing), at i = 2 .
y = a01 + 2a1 + 1a2 + 300a3 + 40a4
134 Fundamentals of Machine Learning using Python
Some observations can be stated from the equation above and generalized
to any linear regression model:
• The value of the five parameters is initially unknown. They
can be determined by coupling five linearly independent
equations, in which case it will yield a unique exact solution to
the model, with perfect fit. When this happen, it is said that the
system is Determined. With less than 5 records, the system is
Undetermined and there are infinite solutions, i.e., no solution
can be stated to be correct. Generalizing, for a linear model with
j features, it is necessary at least j + 1 records i to obtain a
solution, with the special case that if the number of records is
exactly equal to j + 1 , then there is a single, perfect solution.
• In the case that there are more than the necessary points to obtain
a determined system, then the model can be adjusted to minimize
the errors, characterizing an Overdetermined system. In this
case, the solution found is not perfect (mathematically there is no
exact solution) unless some of the points are colinear, summing
j + 1 linearly independent points.
In matrix form, the multiple linear regression model can be written as
follows.
Y = AX
where Y is a column-vector of all the output records, A is a row-vector
of parameters and X is a matrix of features, with records being stored in
columns and features in rows.
x1(1) x2(1) x3(1) ... xn(1)
( 2)
x x2( 2) x3( 2) ... xn( 2)
X = 1
... ... ... ... ...
x( j ) x2( j ) x3( j ) ... xn( j )
1
regression model follows the same principle as in the single variable linear
regression, with the additional requirement of calibrating more parameters
at each iteration.
As in the single variable case, the cost function J consists in the sum
of the squared errors, i.e., differences between predicted and observed. In
matrix form,
m
=J ∑( AX − Y )
i =1
2
where x ' is the rescaled variable, min ( x ) is the minimum value of the
variable x in the training set, max ( x ) is the maximum value of x in the
training set.
For example, consider the house age feature (in years). If the maximum
age is 40 years and the minimum is 2 years, then in the rescaled house age 40
becomes 1, 2 becomes 0 and a house with age 20 becomes (20–2)/(40–2)=
18/38 = 9/19 or approximately 0.474.
Note that in this case, the range will not be exactly [0,1] or [–1,1]. It will
depend on the value of the average and the spreading of the data (difference
between maximum and minimum).
Example: Again, consider the house age feature (in years). Training data
is shown below.
x = [ 40,30, 20, 22, 25,32,38, 21]
min ( x ) = 20
1
mean ( x=
) ( 40 + 30 + 20 + 22 + 25 + 32 + 38 + 21=) 28.5
8
Thus the normalized value x ' for 40 years, for instance, becomes 0.575
(observe that, in min-max that would be equals to 1).
7.1.1.3. Standardization
Consists in transforming the data so it has zero-mean and unit variance.
Linear Regression With Multiple Inputs/Features 137
x − mean ( x )
x' =
std ( x )
where mean ( x ) is the original mean of the data and std ( x ) is the original
standard deviation.
The features matrix, in this case, consists of m rows, where each row is
a sample i (the sample x (i ) ). In this way, the cost function can be rewritten
using matrix multiplication.
J ( A) =
( XA − Yobs )T ( XA − Yobs )
Rewriting,
(
J ( A ) =( XA)T − Yobs
T
) ( XA − Y ) obs
Both XA and Yobs are vectors, so the order is not relevant as long as the
dimensions are corresponding. Performing a simplification of the equation
above one obtains,
J ( A ) =AT X T XA − 2( XA)T Yobs + Yobs
T
Yobs
7.3.1. Introduction
In this section, we develop a Python program to solve the linear regression
model for (a) one input and (b) multiple inputs. An example dataset illustrates
the performance and predictability of the model.
Before reading these examples, it is strongly recommended to go though
the basics of Python coding so as to get a better understanding of the code
structure and how the algorithm is translated into code.
The problem is structured in a series of files contained in Appendix A.3.
The heading of each file is the suggested file name as it is referred along this
section.
Files included in this exercise (Appendix A.3)
• example1.py – A Python script that steps through the exercise for
single input;
• example1_multi.py – Python script that steps through the exercise
for multiple inputs.
In the first part (linear regression with single input), we step through
the file example1.py, which generates a hypothetical dataset, process it and
use Python functions to fit a linear regression model to the dataset. In a
similar way, example1_multi.py performs similar operations to fit a linear
Linear Regression With Multiple Inputs/Features 141
np.random.seed(999)
X = np.stack((np.ones((100,)), np.random.exponential(5, 100))).T
y = X.dot([–4, 1.2]) + np.random.rand(100)*10
with PdfPages(‘example1_fig1.pdf’) as pdf:
plt.figure()
plt.plot(X[:,1], y, ‘b.’)
plt.xlabel(‘population (Thousands)’)
plt.ylabel(‘profit (1000 USD)’)
plt.grid()
pdf.savefig()
plt.show()
J = 1/(2*len(y))*np.sum((X.dot(a) – y)**2)
return J
a – optimum parameters
‘’’
iter_ = 0 # initial iteration step
J_history = [] # a list to store the cost at each iteration
J_history.append(computeCost(X, y, a)) # store the cost at each iteration
# update the-parameters
a[0] = a[0] – alpha*dJda0
a[1] = a[1] – alpha*dJda1
iter_ += 1
J_history.append(computeCost(X, y, a)) # store the cost at each iteration
return a,J_history
The next step consists in evaluating the algorithm implemented above
on the dataset provided. To do so, initialize the parameters a as a vector of
zeros (one value for a0 and other for a1 ). Then set the number of iterations
MAX_ITER and the learning rate ALPHA.
Linear Regression With Multiple Inputs/Features 145
plt.grid()
pdf.savefig()
plt.show()
J = 1/(2*len(y))*np.sum((X.dot(a) – y)**2)
return J
J_old = computeCost(X,y,a)*1000
while iter_ < MAX_ITER and abs(J_old-J_history[–1]) > 1e–5:
dJda = np.zeros((X.shape[1],))
for i in range(X.shape[1]):
dJda[i] = np.sum((X.dot(a)-y)*X[:,i])
# update the-parameters
a -= alpha*dJda
iter_ += 1
J_old = J_history[–1]
J_history.append(computeCost(X, y, a)) # store the cost at each iteration
return a, J_history
The rest of the code resembles very similarly the linear regression with single
input, with small differences. The complete code can be read in Appendix
A.3, under example1_multi.py. Figure 7.5 shows a comparison with the final
prediction and the observed values.
152 Fundamentals of Machine Learning using Python
Figure 7.6: Plot of observed and predicted profit for multiple inputs example.
CONTENTS
8.1. Logistic Regression Model Structure................................................ 156
8.2. Concept Exercise: Classifying Success In Exam According
Hours Of Study ............................................................................ 158
8.3. Programming Exercise: Implementation Of The Exam Results
Problem In Python From Scratch .................................................. 160
8.4. Bonus: Logistic Regression Using Keras........................................... 165
156 Fundamentals of Machine Learning using Python
p
l log b
= = β 0 x0 + β1 x1 + β 2 x2 + ... + β n xn
1− p
According the above formula, once the values of the parameters β are
known, one can estimate the probability p of an event by knowing the
predictors x . The base b may be any value, though it is usually employed
e , 10 or 2.
Hours 0.45 0.75 1.00 1.25 1.75 1.50 1.30 2.00 2.30 2.50 2.80 3.00 3.30 3.50 4.00 4.30 4.50 4.80 5.00 6.00
Pass 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1
Suppose that, after running a program, the teacher obtained the following
results for the logistic regression model (not exact).
Table 8.2: Logistic Regression Results for Hours of Study and Exam
where h is hours of study (in decimals, i.e., 1.8 is 1:48). Because β1 = 1.5
, each hour of study adds log-odds of passing by 1.5 or odds or e1.5 = 4.5 .
Since the x-intercept is 2.7, it means that the logistic model estimates
even odds (50% success probability) for a student who studied 2.7 hours. 2
hours of study would give the student the following probability of passing
the exam:
1
p= −(1.5×2 − 4 )
= 0.27
1+ e
On the other hand, for a student who dedicated 5 hours of study,
1
p= −(1.5×5 − 4 )
= 0.97
1+ e
Table 8.3 summarizes the probabilities of passing the exam for some
hours of study according the logistic regression model. Notice that the
model result is significant at 5% level.
8.3.1. Dataset
For this exercise we will use the already mentioned dataset comprehending
the hours of study and the results (pass or fail) of 20 students.
# hours of study
X = np.array([0.45, 0.75, 1.00, 1.25, 1.75, 1.50, 1.30, 2.00, 2.30, 2.50, 2.80,
3.00, 3.30,
3.50, 4.00, 4.30, 4.50, 4.80, 5.00, 6.00]).reshape(–1, 1)
8.3.2. Algorithm
We will implement the logistic regression equation (also known as sigmoid)
using matrix notation for better performance and readability. Consider x
the input vector containing the feature(s) and bias as a 1x2 vector. The
component z can be defined as the matrix multiplication of the parameters
θ and the input vector x
z =θT x
8.3.5. Prediction
When the logistic regression is already trained, it can be used to perform
predictions on the data, i.e., answer the question: What is the probability of
the output given input(s) …? Similarly, it can be assumed that probabilities
higher than a threshold, say 0.5 (50%), can be considered as success (100%)
while probabilities lower than 0.5 (50%) are considered failure (0%). The
threshold is not fixed and will depend on the problem.
def predict_probability(X, theta):
162 Fundamentals of Machine Learning using Python
return g(theta, X)
def predict(X, theta, threshold):
return predict_probability(X, theta) >= threshold
if self.fit_intercept:
X = self.__add_intercept(X)
# weights initialization
Classification Using Logistic Regression Model 163
self.theta = np.zeros(X.shape[1])
i=0
while i < self.max_iterations:
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h – y)) / y.size
self.theta -= self.learning_rate * gradient
Figure 8.2: Dataset (points) and prediction (curve) showing hours of study and
exam results.
ypred = model.predict(X).reshape(–1, 1)
9
Regularization
CONTENTS
9.1. Regularized Linear Regression ........................................................ 171
168 Fundamentals of Machine Learning using Python
2.8
2.7
Stock Price
2.6 Observed
2.5
2.4
0 0.5 1 1.5 2 2.5
x
2.8
2.7
Stock Price
Observed
2.6
Linear (Observed)
2.5
2.4
0 0.5 1 1.5 2 2.5
x
2.9
2.8
2.7
Stock Price
Observed
2.6
Poly. (Observed)
2.5
2.4
0 0.5 1 1.5 2 2.5
x
2.9
2.8
2.7
Stock Price
Observed
2.6
Poly. (Observed)
2.5
2.4
0 0.5 1 1.5 2 2.5
x
The choice of λ is vital for the correct fitting of the model. For instance,
choosing a very big value for λ may fail to eliminate overfitting and
gradient descent algorithm may fail to converge. Additionally, it may cause
underfitting, where the model will fail to fit training data well. This occurs
because the cost function will understand that the parameter reduction task
is of highest priority, while the data fitting is not so important. It may even
cause the parameters to become almost equal to zero, in which case, the
model becomes a flat line.
Regularization 171
9.1.1. Exercises
1) When using a logistic regression model to predict certain data,
what statements are true?
a) One can decrease the overfitting possibility by adding more
features to the training set.
b) Regularization always improves model performance.
c) A new feature always result in equal or better model performance.
2) A logistic regression model is trained at two different
circumstances, one using λ = 10 and the other using λ = 0 . The
results obtained for the fitted parameters are (not necessarily in
the same order as it was presented the λ values) equal to:
Parameters in training 1:
[93.21 33.42]
Parameters in training 2:
[ 2.31 0.79]
CONTENTS
10.1. The Essential Block: Neurons ........................................................ 174
10.2. How To Implement A Neuron Using Python And Numpy.............. 176
10.3. Combining Neurons to Build a Neural Network ........................... 177
10.4. Example Of Feedforward Neural Network..................................... 178
10.5. How To Implement A Neural Network Using Python
and Numpy .................................................................................. 179
10.6. How To Train A Neural Network ................................................... 183
10.7. Example Of Calculating Partial Derivatives ................................... 187
10.8. Implementation Of A Complete Neural Network With Training
Method ........................................................................................ 189
174 Fundamentals of Machine Learning using Python
where x1 and x2 are the input signals, W is a weight matrix which performs
the multiplication of each input by a weight.
x1 − > x1 × w1
x2 − > x2 × w2
y= f ( x1 × w1 + x2 × w2 + b )
10.1.1. Example
Let the 2-input neuron shown above have the following weight and bias
values.
W = [ 0,1]
176 Fundamentals of Machine Learning using Python
b=2
WX T + b = 0 × 3 + 1× 4 + 2
WX T + b =6
( )
y f WX T + b= f ( 6=
= ) 0.997
The above shows that, given the input vector X = [3, 4] , the neuron
outputs a signal of 0.997.
The neural network built with neurons, which receives input and gives
output(s) is also called a feedforward neural network, or FFNN.
def sigmoid(z):
return 1/(1+np.exp(-z))
Next, let’s implement the Neuron as a Python class, thus using an
object-oriented approach to construct this basic unit of the neural network.
The Neuron class has an initialization function, which receives the initial
weights and bias, and a process function, which produces the neuron output
after assembling the inputs with the weights and bias with the sigmoid
transformation.
Introduction to Neural Networks 177
class Neuron:
def __init__(self, weights, bias):
self.weights = weights
self.bias = bias
x1 h1
o1
x2 h2
h=
1 h=
2 f ( 3=
) 0.952
o1 = f (W h1 , h2 ]T + b )
=o1 f ( 0.952 × 0 + 0.952 ×1 + ( −1) )
f ( 0.048 ) =
o1 =− 0.488
The above calculation shows that, for a given input of X = [3, 4] , the
above neural network output the value 0.488. It is important to notice that
a neural network is not restricted to the architecture shown above. Rather,
it can theoretically have any number of inputs, hidden layers and output
neurons. Still, the basic idea is always the same, feed the input to the network
and receive the outputs at the other end of it.
Introduction to Neural Networks 179
return out
Now we can test the implementation above using the input already evaluated
X = [3, 4] . expected result is 0.488.
mynetwork = NeuralNetwork()
X = [3, 4]
print(mynetwork.process(X)) # [0.4881...]
def predict(self, x): # equivalent to the old process function with a better
naming
# create an empty list to store the outputs of the network
output = []
182 Fundamentals of Machine Learning using Python
mynetwork = NeuralNetworkv2()
# add 2 neurons in the hidden layer
mynetwork.add(2, input_shape = 2)
# add 1 output neuron
mynetwork.add(1)
X = [[3, 4]]
print(mynetwork.predict(X)) # [0.4881...]
# Testing for multiple inputs...
Introduction to Neural Networks 183
where x ' is the normalized value, x is the regular value, min ( x ) is the
minimum value and max ( x ) is the maximum value. Doing the normalization
and applying the probability to the gender column, one obtains the following
table.
184 Fundamentals of Machine Learning using Python
1
MSE =
4
(
[(1 − 0) 2 + (0 − 0) 2 + (0 − 0) 2 + 1 − 0) 2 = 0.5
The following code shows the implementation of the MSE loss function
using Python and Numpy. Notice that this function receives both the
observed and predicted outputs in order to calculate the loss.
def loss(y_obs, y_pred):
‘’’
loss(y_obs, y_pred) calculates the MSE between observed (y_obs)
and predicted (y_pred) values.
‘’’
return np.mean((y_obs – y_pred)**2)
To obtain the partial derivative of MSE with respect to one weight, say
w1 , it is necessary to make use of the chain rule so as to break down this
derivative into two parts.
∂MSE ∂MSE ∂y pred
=
∂w1 ∂y pred ∂w1
∂y pred
To obtain ∂w1 , apply the chain rule knowing that the weight w1 only
affects h1 thus,
∂y pred ∂y pred ∂h1
=
∂w1 ∂h1 ∂w1
∂y pred
=w5 × f ' ( w5 h1 + w6 h2 + b3 )
∂h1
The above comes from the fact that,
y p red= f ( w5 h1 + w6 h2 + b3 )
h2 f ( w3 x1 + w4 x2 + b=
= 2) 0.77
o1 f ( w5 h1 + w6 h2 + b=
= 3) f ( 0.77 + 0.77
= ) 0.823
It seems that the neural network outputs y pred = 0.823 which indicates
high favor towards female gender (with 82.3% of confidence), while the
subject is actually a male. Calculate the derivative of the loss function with
respect to w1 as follows.
∂MSE ∂MSE ∂y pred ∂h1
=
∂w1 ∂y pred ∂h1 ∂w1
∂MSE
−2 ( yobs − y pred ) =
= −2 ( 0 − 0.823) =
1.646
∂y pred
∂y pred
= w5 × f ' ( w5 × h1 + w6 h2 + b3 ) = f ' ( 0.77 + 0.77 ) = f (1.54 ) × (1 − f (1.54 ) ) = 0.1453
∂h1
∂h1
=x1 × f ' ( w1 x1 + w2 x2 + b1 ) =0.5161× f ' ( 0.5161 + 0.6907 ) =0.09147
∂w1
Assembling all the partial derivatives to obtain the gradient of the loss with
respect to the weight w1 .
188 Fundamentals of Machine Learning using Python
∂MSE
=1.646 × 0.1453 × 0.09147 =0.02187
∂w1
The above shows that, if w1 increases, the loss value also increases, thus
getting worst. Therefore, it shows an indication that w1 should decrease, so
as to reduce the error. Let’s prove that by using smaller value of w1 , say – 1.
h1 = f ( w1 x1 + w2 x2 + b1 ) = f ( −0.5161 + 0.6907 + 0 ) = 0.54
h2 f ( w3 x1 + w4 x2 + b=
= 2) 0.77
o1 f ( w5 h1 + w6 h2 + b=
= 3) f ( 0.75 + 0.77
= ) 0.79
Notice how the probability output at o1 has reduced (from 82% to 79%),
which proves that reduction of the weight also reduced the loss. However,
it is more interesting to implement an algorithm that systematically updated
the weights in a way to find the minimum loss, rather than trial and error.
One iterative method that may be implemented is the Stochastic Gradient
Descent (SGD) method, which consists in an algorithm which iteratively
updates the neural network parameters and it converges to an optimum
point (as long as it is correctly configured). This algorithm makes use of the
following update equation.
∂MSE
w < −w −η
∂w
where η > 0 is a learning rate which can control the size of the modification
of the parameter. From the above equation it can be seen that: + If the
derivative is positive, then the parameter decreases; + If the derivative is
negative, then the parameter increases;
The above equation is applied at each iteration for all the weights
and bias of the neural network. It should reduce the loss at each iteration
(monotonically or not, i.e., oscillating) until it reaches a minimum, improving
the network prediction capability.
The SGD algorithm is Stochastic since the optimization will be done
by selecting one sample at a time, randomly. The complete algorithm for
optimization consists in:
• Selecting one sample from the dataset at a time;
Introduction to Neural Networks 189
class TrainableNet:
‘’’
TrainableNet is a neural network class limited to:
– 2 inputs
– 1 hidden layer with 2 neurons (h1, h2)
– 1 output layer with 1 neuron (o1)
‘’’
def __init__(self):
# Weights
self.w = np.ones((6,))
# Biases
self.b = np.ones((3,))
190 Fundamentals of Machine Learning using Python
# Neuron o1
dypred_dw5 = h1 * deriv_sigmoid(sum_o1)
dypred_dw6 = h2 * deriv_sigmoid(sum_o1)
dypred_db3 = deriv_sigmoid(sum_o1)
# Neuron h1
dh1_dw1 = x[0] * deriv_sigmoid(sum_h1)
dh1_dw2 = x[1] * deriv_sigmoid(sum_h1)
dh1_db1 = deriv_sigmoid(sum_h1)
# Neuron h2
dh2_dw3 = x[0] * deriv_sigmoid(sum_h2)
dh2_dw4 = x[1] * deriv_sigmoid(sum_h2)
dh2_db2 = deriv_sigmoid(sum_h2)
# Neuron h2
self.w[2] -= learn_rate * dL_dypred * dypred_dh2 * dh2_dw3
self.w[3] -= learn_rate * dL_dypred * dypred_dh2 * dh2_dw4
self.b[1] -= learn_rate * dL_dypred * dypred_dh2 * dh2_db2
192 Fundamentals of Machine Learning using Python
# Neuron o1
self.w[4] -= learn_rate * dL_dypred * dypred_dw5
self.w[5] -= learn_rate * dL_dypred * dypred_dw6
self.b[2] -= learn_rate * dL_dypred * dypred_db3
y_true_all = np.array([
1,
0,
0,
1,
])
in the dataset. The next code is used to train the network and check the
prediction again.
# Train the neural network
mynetwork.train(X, y_true_all)
print([mynetwork.predict(x) for x in X]) # [0.96, 0.06, 0.013, 0.94]
Notice how the network performance improved. It predicted with more
than 90% of confidence the two records which are real females and the two
which are males.
In this chapter, the following concepts were presented:
• what is a neuron and how it works;
• the architecture of neural networks;
• sigmoid function;
• loss function, specifically the Mean Squared Error (MSE);
• How training works and that training is loss minimization;
• Use of backpropagation to calculate partial derivatives;
• How to use stochastic gradient descent (SGD) method to train a
neural network.
11
Introduction to Decision Trees and
Random Forest
CONTENTS
11.1. Decision Trees .............................................................................. 196
11.2. Random Forest .............................................................................. 205
11.3. Programming Exercise – Decision Tree From Scratch
With Python ................................................................................. 207
196 Fundamentals of Machine Learning using Python
2
y
0
0 1 2 3 4
x
DataSet
x<2 x≥2
o +
2
y
0
0 1 2 3 4
x
Therefore, the task given to a decision tree consists in, given a sample
data, with unknown label, how to classify new samples. Consider now
another example, where data contains three labels.
4
2
y
0
0 1 2 3 4
x
DataSet
x<2 x≥2
◦ +
y<2 y≥2
◦
Event Probability
Pick circle, Classify circle 25%
Pick circle, Classify cross 25%
Pick cross, Classify circle 25%
Pick cross, Classify cross 25%
One can notice that, from the fours events mentioned above, two of
them consists in a wrong classification, summing to 50% of probability of
misclassifying a datapoint. Therefore, Gini Impurity is 0.5.
Consider there are C total classes and P(i) is the probability of selecting
one datapoint with class i . Gini Impurity formula then consists in,
C
=G ∑ p ( i ) * (1 − p ( i ) )
i =1
In the example above mentioned, there are two classes (C = 2), and the
probability of picking a cross is the same as picking a circle thus p(i) = 0.5.
The Gini Impurity is therefore,
G= p (1) × (1 − p (1) ) + p ( 2 ) × (1 − p ( 2 ) )
The above example illustrates that, when the data is perfectly splited,
Gini impurity is 0 (zero), the lowest and best possible value. This can only
be achieved when every data point is correctly labeled at every class.
Consider now an example of inaccurate data classification. Suppose the
threshold at x is slightly displaced to the left side, thus leaving one data
point which is supposed to be a circle together with the crosses. In such case,
at the left side there is still only circle points, thus Gl = 0 where the index l
indicates left side. For the right ( r ) side,
1 1 5 5
Gr = × 1 − + × 1 − = 0.278
6 6 6 6
The splitting quality is measured by weighting Gini Impurity of each
side (branch) by the quantity of points in it. In the above example, the left
branch contains 4 points while right contains 6. Thus,
G= 0.4 × 0 + 0.6 × 0.278= 0.167
The total amount of impurity removed can be calculated by the difference
between the original impurity and the new one.
0.5 − 0.167 =
0.33
The value above, referred to as Gini Gain, defines the best split to be
chosen. In summary, the higher the Gini Gain, better is the split considered.
For instance, it can be noticed that the Gini Gain for the perfect split is 0.5
which is higher than 0.33, thus showing that the perfect split is better than
this inaccurate one.
2
y
0
0 1 2 3 4
x
First obtain the Gini Impurity for the initial arrangement, i.e., with no
split.
4 9 4 9 5 8
Ginit = ∑ p ( i ) (1 − p ( i ) ) = + +
13 13 13 13 13 13
Ginit = 0.663
Now obtain the Gini Impurity for the split shown above.
x = 0.55
G l = 2 / 3*1/ 3 + 1/ 3*2 / 3 = 0.44
Table 11.3: Summary of All Possible Splits and Their Gin Impurity
2
y
0
0 1 2 3 4
x
p
p is selected, in most cases with size p or . In this way, randomness
3
is added t the problem, avoiding correlation between trees, which tends
to improve forecast performance. The selection of features is also called
Feature Bagging.
Introduction to Decision Trees and Random Forest 207
S.No. Sepal length Sepal width Petal length Petal Width Species
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
# for better understanding, change the labels from numeric to strings
representing real names.
translation_dict = {0.0: ‘iris sesota’, 1.0: ‘iris versicolor’, 2.0: ‘iris virginica’}
iris_df.head()
S.No. Sepal length Sepal width Petal length Petal Width Species
0 5.1 3.5 1.4 0.2 iris sesota
1 4.9 3.0 1.4 0.2 iris sesota
Introduction to Decision Trees and Random Forest 209
def __init__(self,column,value):
‘’’
initialization of column and value variables->
Example-> if (sepal length >= 2cm) ->
sepal_length == col and 1cm = value
‘’’
self.column = column
self.value = value
def match(self,data):
Introduction to Decision Trees and Random Forest 211
‘’’
check if the data satisfy the criteria
returns true or false
‘’’
value = data[self.column]
return value >= self.value
def __repr__(self):
‘’’
Auxiliary method to print formatted question.
‘’’
condition = “>=”
return f”Is {cols[self.column]} {condition} {str(self.value)}?”
print(‘Testing the class Question...’)
def count_values(rows):
‘’’
create a dictionary with the unique labels
dictionary key is specie and values are the frequencies
‘’’
count = dict()
count[label] += 1
return count
print(‘Testing count function...’)
count_values(data)
Testing count function...
def partition(rows,question):
‘’’
split data based on Boolean value (true or false)
‘’’
# lists initialization
true_row,false_row = [], []
return true_row,false_row
print(‘Testing partition function...’)
#thus true_rows contains only sepal legnth values greater than 5cm
len(true_rows)
Testing partition function...
128
# impurity initialization
impurity = 1
for label in count:
#probablity of a unique label
probab_of_label=count[label]/float(len(rows))
impurity-=probab_of_label**2
return impurity
print(‘Testing Gini Impurity...’)
214 Fundamentals of Machine Learning using Python
0.6666666666666665
def entropy(rows):
entropy = 0
count=count_values(rows)
p=count[label]/float(len(rows))
entropy-=p*np.log2(p)
return entropy
print(‘Testing Entropy of whole dataset....’)
entropy(data)
Testing Entropy of whole dataset....
1.584962500721156
Define a function to obtain the information gain. This is a way of
quantifying accuracy through data splitting and measuring the amount of
information gained.
def info_gain_gini(current, left_branch, right_branch):
Introduction to Decision Trees and Random Forest 215
size_left = len(left_branch)
size_right = len(right_branch)
p = size_left/size_left + size_right
size_left = len(left_branch)
size_right = len(right_branch)
p =size_left/size_left + size_right
return current – p*entropy(left_branch) – (1 – p)*entropy(right_branch)
best_gain = 0
best_question = None
# number of features
features = len(rows[0]) – 1
question=Question(col, val)
# divide data based on the question
true_rows,false_rows = partition(rows, question)
if len(true_rows)== 0 or len(false_rows) == 0:
# check if the splitting produces no separation and ignore it.
continue
# Gini Gain
gain = info_gain_gini(current, true_rows, false_rows)
#if this gain is better than the actual best one then replace it.
if gain >= best_gain:
best_gain, best_question = gain, question
#iterate through each unique class of each feature and return the best
gain and best question
return best_gain, best_question
print(‘Testing best split...’)
a, b = best_split(data)
print(b)
print(a)
Testing best split...
Is Petal length >= 6.9?
99.99552572706934
Up to this point it was adapted a top-down approach to solve each part
of the decision tree algorithm separately. The next step consists in gathering
together all this small parts and compile them together in a Decision Tree
class which will be used to perform the classification on the data.
Introduction to Decision Trees and Random Forest 217
class DecisionNode:
# Class to contain the nodes in a tree
def __init__(self, question, true_branch, false_branch):
# question contains the column and values of the question attributed
to node
self.question = question
self.true_branch = true_branch
self.false_branch = false_branch
class Leaf:
# Each leaf is a node of a tree.
def __init__(self, rows):
self.predictions = count_values(rows)
The following function is used to recursively build a tree.
def build_tree(rows):
tree = build_tree(data)
Testing to build the tree...
CONTENTS
12.1. Introduction .................................................................................. 220
12.2. Mathematical Concepts ................................................................ 220
12.3. Principal Component Analysis Using Python ................................ 227
220 Fundamentals of Machine Learning using Python
12.1. INTRODUCTION
This chapter presents the very basic concepts behind Principal Component
Analysis (PCA). This statistical tool is widely used in problems that require
data compression due to high dimensionality of it, such as image processing.
It is also used when it is necessary to find patterns of high dimension data.
In the first part of this chapter, some basic mathematical concepts that
are important to understand before using PCA. The following concepts will
be presented:
• standard deviation;
• covariance;
• eigenvectors;
• eigenvalues.
Some examples will be used to illustrate how PCA works. The examples
are a modification of the one presented by Smith (2002).
∑
n
i =1 i
x
xm =
n
Where n is the number of samples. Unfortunately, the mean only does
not provide enough information to understand the dataset. To understand
this, consider the following two samples y and z , both having the same
mean (2.5) with the exactly same amount of records, but being clearly
different.
y = [1, 2,3, 4]
z = [ 0.5,1,1.5, 7 ]
The sample z is clearly more spread (going from 0.5 to 7) than the
sample y , though both show the same mean. The standard deviation
provides a way of measuring such spread, thus helping to better understand
a sample.
The standard deviation can be formally defined as “The average distance
from the data mean to any point.”
The formula to compute the standard deviation consists in subtract each
point from the mean (distance), square it, sum all and divide by n–1, taking
the positive square root. Thus,
∑
n
( x − xm ) 2
σ= i =1
n −1
Applying the above formula on the samples y and z , one can obtain
the standard deviation of such samples.
y
( y − ym ) ( y − ym ) 2
1 –1.5 2.25
2 0.5 0.25
3 0.5 0.25
4 1.5 2.25
222 Fundamentals of Machine Learning using Python
Total 5.00
Total/(n–1) 1.67
Square root 1.29
z
( z − zm ) ( z − zm ) 2
0.5 –2 4
1 –1.5 2.25
1.5 1 1.25
7 4.5 20.25
Total 27.25
Total/(n–1) 9.17
Square root 3.03
12.2.2. Variance
Another common measure for sample spreading is the variance, which
consists mathematically in the squared standard deviation.
∑
n
( x − xm ) 2
σ 2
= i =1
n −1
Exercises
Determine the mean, standard deviation and variance of the following
samples.
[12 21 32 45 54 66 78 86 91]
[12 14 16 19 81 86 88 90 92]
[12 15 16 17 21 23 90 91 92]
Principal Component Analysis 223
12.2.3. Covariance
The statistical measurements mentioned above (mean, standard deviation and
variance) are related to a single sample, and can be calculated independently.
They are 1-dimensional values which can describe datasets such as house
sizes, product prices, exam scores, etc.
However, many datasets have more than one dimension, and one may
want to determine if there is any relation between the dimensions of the
dataset. As an example, consider a dataset is composed by the weight of
students and their exam scores. One could aim at investigating if the weight
of a student is somehow related with his exam scores.
Covariance metrics serves such purpose. It is always used between two
dimensions, or samples, to quantify the relationship between two samples.
If the covariance is applied on a dataset against itself, then it produces the
variance. As an example, for a dataset composed by three samples ( x , y
and z ), then covariance can be determined between x and y , x and z , and
between y and z . Measuring the covariance of x and x would produce
the variance of x .
Using the variance formula, expand the squared difference between
samples and mean to obtain,
∑ ( x − xm ) ( x − xm )
n
σ 2
= i =1
n −1
Covariance can be mathematically expressed in a similar formula, where
one of the differences of mean and sample comes from the second dataset,
thus,
∑ ( x − xm ) ( y − ym )
n
cov ( x, y ) = i =1
n −1
Putting up in words, covariance can be expressed as, “For each point,
multiply the difference between point x and the average of x with the
difference of y and the average of y. Add all these products and divide by
the dataset size minus 1.”
Suppose the following example. Students were interviewed regarding
their weight and their last exam scores. Therefore, the dataset consists in 2
dimensions, the weight w of each student and the exam scores m .
224 Fundamentals of Machine Learning using Python
Sample w m ( w − wm ) ( m − mm ) ( m − mm )( w − wm )
1 81.83 0.80 5.42 0.26 1.41
2 88.11 1.00 11.70 0.46 5.38
3 63.57 0.00 –12.84 –0.54 6.93
4 86.88 0.90 10.47 0.36 3.77
5 73.64 0.50 –2.77 –0.04 0.11
6 73.18 0.40 –3.22 –0.14 0.45
7 75.96 0.50 –0.45 –0.04 0.02
Principal Component Analysis 225
( )
where ci , j = cov dimi , dim j . The covariance matrix contains n rows and
n columns. For example, building the covariance matrix for the dataset ( x
, y , z ) gives.
cov ( x, x ) cov ( x, y ) cov ( x, z )
C = cov ( y, x ) cov ( y, y ) cov ( y, z )
cov ( z , x ) cov ( z , y ) cov ( z , z )
Exercises
Obtain the covariance between the sample x and y below. Mention what
this value indicates about the data.
Record x y
1 –2.85 10.18
2 –0.65 3.84
3 –0.11 11.63
4 –0.74 16.78
5 –2.89 4.03
6 –0.76 5.92
7 –1.62 13.81
8 0.53 8.67
9 –0.72 4.26
Obtain the covariance matrix for the following dataset composed by
three samples x , y and z .
Record x z
y
1 9 2 7
2 –1 0 0
3 2 15 –1
4 –11 16 –2
5 1 12 1
6 2 –4 –1
7 –12 14 –3
8 –5 2 6
9 7 17 1
The following sections will guide the reader on how to apply PCA on a
simple dataset.
print(‘\n’*2)
x y
–2.16 –3.64
4.28 0.37
6.91 5.02
4.76 0.95
6.44 3.02
–0.20 –0.80
2.95 4.19
0.46 1.20
3.45 5.61
2.62 2.89
Figure 12.1 shows a plot of the original data.
Principal Component Analysis 229
print(‘| x | y |’)
for i in range(len(x)):
print(f’| {xs[i]:.2f} | {ys[i]:.2f} |’)
print(‘\n’*2)
x y
–5.11 –5.52
1.33 –1.51
3.96 3.14
230 Fundamentals of Machine Learning using Python
1.81 –0.93
3.49 1.13
–3.15 –2.68
–0.00 2.31
–2.49 –0.68
0.50 3.73
–0.33 1.01
Following, calculate the covariance matrix. Its dimension is 2x2 because
the dataset has 2 dimensions.
cov_mat = np.cov(x,y)
print(cov_mat)
[[8.39489351 5.84121688] [5.84121688 8.07036548]]
As expected, the values outside of the main diagonal are positive, which
indicates that when x increases, y also increases, as it can be seen in Figure
13.1.
Obtain the eigenvectors and eigenvalues of the covariance matrix. This
is one of the most important steps, since they provide relevant information
about the data.
eigenval, eigvec = np.linalg.eig(cov_mat)
print(‘eigenvec’)
print(eigvec)
print(‘eigenval’)
print(eigenval)
eigenvec [[ 0.71685718 –0.69722004] [ 0.69722004 0.71685718]]
eigenval [14.07609971 2.38915927]
Since the dataset is two-dimensional, there are two eigenvalues and
two eigenvectors, each one two-dimensional. To choose the principal, most
important vector one has to look at the values of the eigenvalues. The highest
eigenvalue is the most important one. If the first eigenvalue is the highest
one, then the first eigenvector is also the most relevant.
Principal Component Analysis 231
where X is the original 2-dimensional dataset and X m are the mean values.
Naturally, if not all eigenvectors are used to recover the original data, then
the obtained one is not exactly the same as the one initially obtained.
232 Fundamentals of Machine Learning using Python
Exercises
1 What is a covariance matrix?
2. What is the main step to compress the data in PCA?
13
Classification Using k-Nearest Neighbor
Algorithm
CONTENTS
13.1. Introduction .................................................................................. 234
13.2. Principles and Definition of KNN ................................................. 234
13.3. Algorithm ..................................................................................... 235
13.4. Example and Python Solution ....................................................... 237
234 Fundamentals of Machine Learning using Python
13.1. INTRODUCTION
k-Nearest Neighbor (kNN) is a relatively simple algorithm, easy to implement
and still a powerful tool to understand and solve classification problems. In
this chapter the reader will develop an understanding of the principles used
to develop the kNN algorithm, when it can be used. The latter part of the
chapter shows a simple example and how to implement it with Python.
13.3. ALGORITHM
Initially, a set with known labels is presented in the kNN model, such as the
one shown in the Figure 13.1.
1 dog
cat
0.8 pig
0.6
Width
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Height
0.6 ???
Width
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Height
Figure 13.2: Point with unknown class plotted with the kNN dataset.
236 Fundamentals of Machine Learning using Python
0.6
Width
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Height
0.6
Width
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Height
13.4.1. Example
In the example above, it was illustrated, without direct calculation but only
graphically, how kNN algorithm works. In this part, let’s use the same
example but using the relatively simple mathematics behind kNN to perform
the classification. The dataset, as illustrated in Figure 13.1, is the one shown
in Table 13.1. This is an illustrative example and the values given of height
and width are to be considered processed values, i.e., not real-world ones.
where i is the point already known and j is the point with unknown class.
Table 13.3 summarizes the calculated distances for all the dataset.
Class of i Dist
xi xj
Class of i
cat
dog
dog
pig
dog
dog
cat
cat
pig
pig
cat
pig
cat
The above steps are the same independent of the k value chosen. However,
from here on, one needs to set the value of k so the algorithm chooses the
number of neighbors that can vote. The class is attributed according vote
majority. As an example, consider that we choose k = 3, then only the three
neighbors with the least distance to be point can vote.
Class of i
cat
dog
dog
Since the above set is constituted by one cat and two dogs, the attributed
class to the unknown point is “dog.” Notice that one should be able to
validate the experimental data and double check if the algorithm performs
reasonable classifications. This is because the point could easily be a cat,
since this class has the least distance (it would be classified as cat if k = 1).
Another important point to mention is, though we have selected k = 3, it
is usually best practice not to choose even values or multiples of the number
of classes to k. That can help to avoid ties.
given on Table 4.1 as our training data, i.e., the dataset which will guide the
kNN to perform the classification.
Define the dataset as a list of tuples in Python. Each tuple contains the
height, width and class of each animal.
data = [(0.1, 0.1, ‘cat’),
(0.2, 0.5, ‘cat’),
(0.3, 0.2, ‘cat’),
(0.4, 0.3, ‘cat’),
(0.5, 0.6, ‘cat’),
(0.6, 0.1, ‘pig’),
(0.7, 0.2, ‘pig’),
(0.8, 0.5, ‘pig’),
(0.9, 0.3, ‘pig’),
(0.6, 0.9, ‘dog’),
(0.7, 0.5, ‘dog’),
(0.8, 0.6, ‘dog’),
(0.9, 0.7, ‘dog’)]
Separate the data into the input features (X_train) and the labels (y_
train). The input features are the first two elements of each tuple, and the
label is the last element.
X_train = [(x[0], x[1]) for x in data]
y_train = [x[2] for x in data]
Define the testing set, i.e., the point with the unknown label as shown in
Table 14.2.
x_test = (0.6, 0.6)
# retrieve the order of the indexes that would sort the distances
sort_idx = sorted(range(len(distances)), key=lambda k: distances[k])
count_votes = dict()
# initialize the majority votes to zero
best_vote = 0
242 Fundamentals of Machine Learning using Python
if count_votes[vote] > best_vote: # check if the label has the majority and
attribute it
best_vote = count_votes[vote]
winner = vote
ypred = winner # set the final winner according the majority of votes
print(ypred) # dog
The final value of ypred for this example is “dog,” which confirms that
the code is correct and the algorithm works properly. The following code
shows a compilation of the complete exercise (except for the data) above
using functional programming approach.
def sorted_distances(X_train, y_train, x_test):
distances.sort()
votes = y_train[:k]
count_votes = dict()
best_vote = 0
winner = ‘’
for vote in votes:
if vote not in count_votes:
count_votes[vote] = 0
count_votes[vote] += 1
ypred = winner
return ypred
ypred = predict_class(y_train, k)
return ypred
print(‘Function test’)
print(kNN(X_train, y_train, x_test, k = 3))
244 Fundamentals of Machine Learning using Python
knn.fit(features, labels)
print(“How to do predictions”)
print(knn.predict(features_test))
print(“Target”)
print(labels_test)
14
Introduction to KMeans Clustering
CONTENTS
14.1. How Kmeans Works? .................................................................... 246
14.2. Kmeans Algorithm ........................................................................ 247
246 Fundamentals of Machine Learning using Python
In most of the examples showed along this book the discussion focused
on supervised learning algorithms. In such cases, the model knows the labels
(correct values) that it should predict, and it tries to reproduce such values
to its best.
On the other side there are unsupervised learning algorithms. They do
not use a set of known labels to “classify” the data. Rather it uses different
assumptions about the data to find patterns on it and create clusters, i.e.,
groups of data. Kmeans is one of such algorithms.
Step 1:
Initially, a number K of centroids c is placed in the n-dimensional space
containing the datasets. Ideally, the clusters should be widely spread over
the space, not dislocated or grouped together.
Then each data point is assigned to the cluster nearest to it. One way
of defining the distance is by using the Euclidean distance metrics. Other
options are the Cosine and the Manhattan distances. Mathematically, each
point x is assigned to the cluster based on,
argmindist ci, x( )
where dist () is the Euclidean distance between the centroid ci and the point
x . Let the sample attributed to cluster ci to be S
Step 2:
The position of each centroid is updated, by dislocating it to the central
position among S . This is done by taking the average value of each
dimension and attributing it to the cluster ci as follows.
1
ci =
S
∑Si
Then go back to step 1.
The iteration is performed until a stopping criteria is met. This may be:
• datapoints do not change clusters;
• maximum number of iterations;
248 Fundamentals of Machine Learning using Python
14.2.1. Choice of K
In many situations, the best way of choosing the amount of centroids K is
by trial and error. Using this methodology, one evaluates the algorithm with
a variety of K values and compares the results to which one makes the most
sense. Still, there are some techniques that may be used to estimate the value
of K.
A popular metrics to choose the value of K is the mean distance between
data points and cluster centroids. With 1 cluster the mean distance is the
greatest, with continuous decrease until it reaches zero when the number of
centroids is the same as the number of points. This shows that this method
cannot be used isolated. Rather, it is coupled with other metrics such as the
elbow point, which shows when the rate of mean distance sharply decreases.
This point can be used to roughly estimate K.
Besides this one, techniques for validating K are usable, such as cross-
validation, the silhouette method, information criteria, the information
theoretic jump method and G-means algorithm.
n_samples = 1500
random_state = 999
0
y
−5
−10 −5 0 5 10
x
0
y
−5
−10 −5 0 5 10
x
0
y
−5
−10 −5 0 5 10
x
0
y
−5
−10 −5 0 5 10
x
CONTENTS
15.1. Installing Tensorflow Library ......................................................... 254
15.2. Tensors.......................................................................................... 255
15.3. Computational Graph and Session ................................................ 255
15.4. Operating With Matrices............................................................... 256
15.5. Variables ....................................................................................... 260
15.6. Placeholders ................................................................................. 260
15.7. Ways Of Creating Tensors ............................................................. 262
15.8. Summary ...................................................................................... 265
254 Fundamentals of Machine Learning using Python
15.2. TENSORS
A tensor consists in the basic unit defined in TensorFlow. It is a generalization
of scalars, vectors and matrices. In this sense, a scalar is a tensor of rank zero
(0), while a vector is a tensor with rank 1. By analogy, it can be deduced
that a matrix is a tensor with rank 2. In general, a tensor is by definition a
n-dimensional array, where n is any integer value starting 0.
Consider the following examples of tensors:
• 10 – a tensor of rank 0, the shape of it is [];
• [10, 20] – a tensor of rank 1, the shape of it is [2].
• [[10, 20, 30], [100, 200, 300]] – tensor with rank 2. Its shape is
[2,3];
• [[[10, 20, 30]], [[1, 2, 3]]] – tensor with rank 3, and shape [2,1,3].
1 2 3 1 4 7
4 5 6 ⋅ 2 5 8
7 8 9 3 6 9
It can be performed using TensorFlow module through the following
steps.
• Import TensorFlow module
import tensorflow as tf
• Define two edges of the graph by using tf.constant() instances.
matrix1 = tf.constant([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix2 = tf.constant([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
• Define the node tf.matmul() which will perform the matrix
multiplication.
matrix_prod = tf.matmul(matrix1, matrix2)
• Obtain the result by running the node in a TensorFlow session
with tf.Session() as sess:
result1 = sess.run(matrix_prod)
print(result1)
Gathering all together,
import tensorflow as tf
matrix1 = tf.constant([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix2 = tf.constant([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
print(result1)
[[ 14 32 50]
[ 32 77 122]
258 Fundamentals of Machine Learning using Python
[ 50 122 194]]
In the following example, tf.add() node is used to perform matrix
addition matrices defined above.
import tensorflow as tf
print(result1)
[[ 2 6 10]
[ 6 10 14]
[10 14 18]]
Nodes can also be connected sequentially. For instance consider the
following problem which consists in obtaining the solution of,
1 2 3 1 4 7 10 20 30
4 5 6 ⋅ 2 5 8 + −10 −20 −30
7 8 9 3 6 9 10 20 30
matrix3 = tf.constant([[10, 20, 30], [–10, –20, –30], [10, 20, 30]])
matrix_prod = tf.matmul(matrix1, matrix2)
matrix_add = tf.add(matrix_prod, matrix3)
with tf.Session() as sess:
result1 = sess.run(matrix_add)
print(result1)
[[24 52 80]
[22 57 92]
[60 142 224]]
The dimension of the result in a TensorFlow operation may be different
from the dimension of the inputs. For example, the node tf.matrix_
determinant() can be used to calculate the determinant of a matrix (tensor
rank 2), generating a scalar (tensor rank 0), as shown in the following
example.
import tensorflow as tf
matrix1 = tf.constant([[1., 2., 3.], [–2., –5., –6.], [3., 2., 1.]])
matrix_det = tf.matrix_determinant(matrix1)
with tf.Session() as sess:
result1 = sess.run(matrix_det)
print(result1)
8.000002
TensorFlow contains all basic math operators. The following list shows
some of the most common operators, which may require one or more
arguments.
• tf.add(a, b)
• tf.substract(a, b)
• tf.multiply(a, b)
• tf.div(a, b)
• tf.pow(a, b)
• tf.exp(a)
260 Fundamentals of Machine Learning using Python
• tf.sqrt(a)
15.5. VARIABLES
A Variable in TensorFlow contains data which may change along processes.
Instead of being an edge (as constants are), variables are stored as nodes.
A way to create a variable is to use the get_variable method.
tf.get_variable(name = “,”values,dtype,initializer)
Inputs:
- name(str) – Name of the variable
- values(list/array) – Dimension of the tensor
- dtype – Type of data. Optional
- initializer – (Optional)How to initialize the tensor
When an initializer is set, there is no need of setting the values since
the dimensions are directly set from the initializer. In the following code, a
two dimensional variable is created. The default value of a variable is set at
random by TensorFlow.
# Create a variable
var = tf.get_variable(“var,” [1, 3])
print(var)
<tf.Variable ‘var:0’ shape=(1, 3) dtype=float32_ref>
In the example above, the variable var is a vector of one row and 3
columns. To fix the initial values of the variable to zero, one can use the
zeros_initializer function.
# Create an initializer to fix variables to zero
init = tf.zeros_initializer
# Create a variable
var = tf.get_variable(“var,” [1, 3], initializer = init)
print(var)
<tf.Variable ‘var:0’ shape=(1, 3) dtype=float32_ref>
15.6. PLACEHOLDERS
A placeholder is a node used to feed a tensor. Its purpose consists in
Computing with TensorFlow: Introduction and Basics 261
initializing data which will be fed into tensor. The placeholder is fed using
the feed_dict instruction when running a session. Therefore, a placeholder
can only be fed withing a session.
The following syntax is used to create a placeholder.
tf.placeholder(dtype, shape = None, name = None )
Inputs:
- dtype – Data type
- shape – Dimension of the placeholder. Optional.
- name – Name of the placeholder. Optional
The following example shows the use of a placeholder to feed input data
into a linear equation.
x = tf.placeholder(float)
y = 2*x + 1
sess = tf.Session()
result = sess.run(y, feed_dict = {x:[0, 5, 10, 15, 20]})
array([ 1., 11., 21., 31., 41.], dtype=float32)
A placeholder can hold tensors of any shape. For instance, a 2D tensor
can be fed in place of the 1D array shown previously.
x = tf.placeholder(float,[None,3])
y = 2*x + 1
data = [[1, 2, 3], [10, 20, 30]]
sess = tf.Session()
result = sess.run(y, feed_dict = {x:data})
array([[ 3., 5., 7.],
[21., 41., 61.]], dtype=float32)
The output is a 2x3 matrix, so the None used in the definition of the
placeholder, could be safely replaced by 2 in the code above, yielding the
same result.
x = tf.placeholder(float,[2,3])
y = 2*x + 1
data = [[1, 2, 3], [10, 20, 30]]
sess = tf.Session()
262 Fundamentals of Machine Learning using Python
x = tf.zeros([3, 2])
sess = tf.Session()
sess.run(x)
[[0. 0.]
[0. 0.]
[0. 0.]]
It may be a set of ones by using the tf.ones(size) function.
x = tf.ones([3, 2])
sess = tf.Session()
sess.run(x)
[[1. 1.]
[1. 1.]
[1. 1.]]
A tensor may be filled with a unique number through the tf.fill(size,value)
function, where value is the number (or object) used to fill each element of
the tensor.
x = tf.fill([3, 2], 5)
sess = tf.Session()
sess.run(x)
[[5 5]
[5 5]
[5 5]]
The method tf.diag(elements) creates a matrix with diagonal elements equal
to elements.
x = tf.diag([1, 2, 3])
sess = tf.Session()
sess.run(x)
[[1 0 0]
[0 2 0]
[0 0 3]]
To create a tensor consists in a sequence of numbers, use the
tf.range(start,limit,delta) function, where start is the initial value, limit is the
264 Fundamentals of Machine Learning using Python
last value (exclusive) and delta is the step size, i.e., the difference between
each value.
x = tf.range(start = 3, limit = 30, delta = 3)
sess = tf.Session()
sess.run(x)
array([ 3, 6, 9, 12, 15, 18, 21, 24, 27], dtype=int32)
For a sequence of evenly spaced values, use tf.linspace(start,stop,n) where n
is the number of points to be added in the sequence.
h = tf.linspace(3, 30, 5)
sess = tf.Session()
sess.run(h)
array([ 3. , 9.75, 16.5, 23.25, 30. ], dtype=float32)
A random sequence of values according the uniform distribution can
be constructed using the tf.random_uniform(value,minval,maxval), where
minval is the minimum value of the distribution and maxval is the maximum
value. The dimension is given in value.
r = tf.random_uniform([3,2], minval = 0, maxval = 3)
sess = tf.Session()
sess.run(r)
array([[1.0970614, 1.5497539 ],
[2.7323077, 0.7938584 ],
[0.24447012, 0.46585643]], dtype=float32)
To obtain a sequence of normally distributed values, using the tf.random_
normal(value,mean,stddev), where the dimension is given by vaue, the mean
and the standard deviation are the given by mean and stddev respectively.
r = tf.random_normal([3,2], mean = 2, stddev = 1)
sess = tf.Session()
sess.run(r)
array([[1.1963735, 1.0302888],
[1.3054788, 1.1521074],
[1.1329939, 2.3905213]], dtype=float32)
Computing with TensorFlow: Introduction and Basics 265
15.8. SUMMARY
The structure of TensorFlow library is based on three entities:
• Graph: Environment encapsulating operations (nodes) and
tensors (edges)
• Tensors: Data (or value) stored in the edge of the graph. It flows
along the graph passing through operations.
• Sessions: Environment that executes the operations.
Table 15.1 shows example on how to create constant tensors.
Dimension Example
0 tf.constant(1, tf.int16)
1 tf.constant([1, 2, 3], tf.int16)
2 tf.constant([[1, 2, 3], [4, 5, 6]], tf.int16)
3 tf.constant([[[1, 2], [3, 4], [4, 5]]], tf.int16)
Operation Example
tf.add(a, b)
a+b
tf.subtract(a, b)
a −b
tf.multiply(a, b)
a×b
a tf.div(a, b)
b
tf.pow(a, b)
ab
tf.exp(a)
ea
tf.sqrt(a)
a
266 Fundamentals of Machine Learning using Python
Description Code
Create a session tf.Session()
Run session tf.Session().run()
Evaluate variable_name.eval()
Close session tf.Session().close()
Session in block with tf.Session() as sess:
16
TensorFlow: Activation Functions and
Optimization
CONTENTS
16.1. Activation Functions ..................................................................... 268
16.2. Loss Functions .............................................................................. 273
16.3. Optimizers.................................................................................... 274
16.4. Metrics ......................................................................................... 277
268 Fundamentals of Machine Learning using Python
In the last chapter, the basics of TensorFlow elements were presented, and
many examples showed how TensorFlow works. In this chapter, we will
discuss Activation Functions, what are them, and which types are already
implemented in TensorFlow. We will also discuss Loss Function, Optimizers,
and Metrics. By the end of this chapter, all these concepts should be clear
to the reader.
In summary, the following concepts are discussed along this chapter:
• Activation functions;
• Loss functions;
• Optimizers;
• Metrics.
Otherwise,
(
( x ) a ex −1
f= )
where a is a tunable hyperparameter, constrained at a >= 0
TensorFlow: Activation Functions and Optimization 273
Leaky ReLU
A small modification of ReLU, allowing a small gradient when x < 0 , i.e.,
when the neuron is inactive.
x if x > 0
f ( x) =
0.01x otherwise
vec1 = tf.constant([–3., –1., 0., 1., 3.])
elu = tf.nn.leaky_relu(vec1)
sess = tf.Session()
sess.run(elu)
array([–0.03, –0.01, 0. , 1. , 3. ], dtype=float32)
• tf.contrib.losses.compute_weighted_loss
• tf.contrib.losses.cosine_distance
• tf.contrib.losses.get_losses
• tf.contrib.losses.get_regularization_losses
• tf.contrib.losses.get_total_loss
• tf.contrib.losses.log_loss
• tf.contrib.losses.mean_pairwise_squared_error
• tf.contrib.losses.mean_squared_error
• tf.contrib.losses.sigmoid_cross_entropy
• tf.contrib.losses.softmax_cross_entropy
• tf.contrib.losses.sparse_softmax_cross_entropy
• tf.contrib.losses.log
16.3. OPTIMIZERS
Once the loss function is know, it is necessary to use an algorithm to find
the value of the parameters of the model which minimizes it. This algorithm
is the optimizer. In most algorithms, an initial guess on the parameters are
given, and the optimizer evaluates at each iteration the direction to change
each parameter (increasing or decreasing) so as to reduce the loss function.
After many iterations, the values of the parameters start to stabilize around
the optimum values.
As it occurs with other deep learning frameworks, TensorFlow provides
a variety of optimizers that changes the values of the parameters in a model
to minimize the loss function. Actually the purpose of the optimizer is to
find the direction to which change the parameters so to become nearer to
the minimum. Choosing the best optimizer and the amount of change in the
variables is not direct and may rely on heuristic rules.
According Manaswi (2018), adaptive techniques such as adadelta and
adagrad are able to perform faster convergence of complex neural networks.
In general, Adam is the best optimizer, which motivates it to be the default
one in certain frameworks, such as keras. This optimizer outperforms other
adaptive techniques though being computationally costly. When the dataset
is big, containing many sparse matrices, SGD, NAG and momentum may
not be the best options. In such cases, the adaptive learning rate methods are
the best choice.
TensorFlow: Activation Functions and Optimization 275
3 x1 + 2 x2 − x3 =
1
2 x1 − 2 x2 + 4 x3 =
−2
− x1 + 1/ 2 x2 − x3 =0
The following code demonstrates how to solve this system using
TensorFlow and Adam optimizer.
First define the matrix A containing all the coefficients of the system of
equations above.
A = tf.constant([[3, 2, –1], [2, –2, 4],[–1, 0.5, –1]])
Then create the TensorFlow variable x which will be the one to be
determined through optimization.
x = tf.Variable(initial_value = tf.zeros([3, 1]), name = “x,” dtype =
tf.float32 )
Define the matrix multiplication Ax by creating a node on the
TensorFlow graph which perform such operation.
model = tf.matmul(A, x)
Create the result vector, consisting of the values in the right-hand side of
the system of equations.
result = tf.constant([[1.], [–2.], [0.]])
Once the result is defined, implement the loss function, which calculates
the mean squared difference between the expected result and the one
obtained through the matrix multiplication.
loss = tf.losses.mean_squared_error(result, model)
Create an instance of the optimizer using Adam algorithm
(AdamOptimizer) with a learning rate of 1.5.
optimizer = tf.train.AdamOptimizer(1.5)
Tell the optimizer that the task is to minimize the loss variable.
train = optimizer.minimize(loss)
In the rest of the code, create the session and run it a couple of iterations
until convergence of x is observed. The complete code of the above example
is shown below.
276 Fundamentals of Machine Learning using Python
optimizer = tf.train.AdamOptimizer(1.5)
train = optimizer.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer( ))
print(‘Initial value of x:’, sess.run(x), ‘ loss: ‘, sess.run(loss))
for step in range(1000):
sess.run(train)
if step % 100 == 0:
xval = sess.run(x).flatten()
print(‘step ‘, step, ‘x1: ‘, round(xval[0], 3), ‘x2: ‘, round(xval[1], 3),
‘x3: ‘, round(xval[2], 3),’ loss: ‘, sess.run(loss))
step 0 x1: –1.5 x2: 1.5 x3: –1.5 loss: 38.354153
step 100 x1: 0.375 x2: –0.492 x3: –0.915 loss: 0.03166012
step 200 x1: 0.741 x2: –1.38 x3: –1.551 loss: 0.0053522396
step 300 x1: 0.923 x2: –1.816 x3: –1.867 loss: 0.00047268326
step 400 x1: 0.983 x2: –1.96 x3: –1.971 loss: 2.2427732e–05
step 500 x1: 0.997 x2: –1.994 x3: –1.995 loss: 5.682331e–07
step 600 x1: 1.0 x2: –1.999 x3: –1.999 loss: 7.4286013e–09
step 700 x1: 1.0 x2: –2.0 x3: –2.0 loss: 4.7748472e–11
step 800 x1: 1.0 x2: –2.0 x3: –2.0 loss: 7.058058e–13
step 900 x1: 1.0 x2: –2.0 x3: –2.0 loss: 1.3310834e–12
Some common optimizers in TensorFlow are (all from tf.train):
• GradientDescentOptimizer
• AdadeltaOptimizer
• AdagradOptimizer
TensorFlow: Activation Functions and Optimization 277
• AdagradDAOptimizer
• MomentumOptimizer
• AdamOptimizer
• FtrlOptimizer
• ProximalGradientDescentOptimizer
• ProximalAdagradOptimizer
• RMSPropoptimizer
16.4. METRICS
Metrics provides a mathematical value to measure model accuracy.
TensorFlow contains different metrics, such as accuracy, logarithmic loss,
area under curve (AUC), ROC, among others.
Some common metrics available in TensorFlow are (from tf.contrib.
metrics):
• streaming_root_mean_squared_error
• streaming_covariance
• streaming_person_correlation
• streaming_mean_cosine_distance
• streaming_percentage_less
• streaming_false_negatives
17
Introduction to Natural Language
Processing
CONTENTS
17.1. Definition Of Natural Language Processing .................................. 280
17.2. Usage Of NLP............................................................................... 280
17.3. Obstacles In NLP .......................................................................... 281
17.4. Techniques Used In NLP ............................................................... 281
17.5. NLP Libraries ................................................................................ 282
17.6. Programming Exercise: Subject/Topic Extraction Using NLP .......... 282
17.7. Text Tokenize Using NLTK ............................................................. 286
17.8. Synonyms From Wordnet .............................................................. 288
17.9. Stemming Words With NLTK......................................................... 290
17.10. Lemmatization Using NLTK ........................................................ 291
280 Fundamentals of Machine Learning using Python
import nltk
Step 1: Tokenize text
The topic of a Wikipedia article will be used to show how this algorithm
works. It is chosen one from Linux (https://en.wikipedia.org/wiki/Linux), so
it is expected the algorithm will return as a topic the word Linux or at least
something relatable.
The module urlib.request is used to read the web page and obtain the pure
html content.
import urllib.request
response = urllib.request.urlopen(‘https://en.wikipedia.org/wiki/Linux’)
html = response.read()
print(html)
b’<!DOCTYPE html>\n<html class=”client-nojs” lang=”en” dir=”ltr”>\
n<head>\n<meta charset=”UTF–8”/>\n<title>Linux – Wikipedia</
title>\n...
Notice the b character at the beginning of the html output. It indicates
that this is a binary text. To clean HTML tags and to a general cleaning of the
raw text we use BeautifulSoup library, as shown in the code below.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, “html5lib”)
text = soup.get_text(strip = True)
print (text[:100])
Linux – Wikipediadocument.documentElement.className=document.
documentElement.className.replace(/(^|\
Convert the text into tokens by using split() method in Python, which
splits a string into whitespaces. Notice that up to this point nltk is not being
used.
tokens = [t for t in text.split()]
print (tokens)
284 Fundamentals of Machine Learning using Python
the 465
of 269
and 219
Linux 197
on 193
to 166
a 159
for 113
original 110
in 100
is 99
with 71
as 66
operating 57
from 57
system 52
software 51
that 51
distributions 50
also 48
Introduction to Natural Language Processing 285
Notice that, understandably, words which are not very meaningful but
important to create links in sentences are the ones with highest frequencies.
It would be necessary to look on the above table to understand further in its
topic due to the presence of many so-called “stop words,” i.e., words that are
important to create links and bring sense, but not relevant to extract the topic
of the text. Fortunately, nltk library has a functionality to remove such “stop
words.” To use it, first it is necessary to download the set of “stop words”
from the language under interest (English in this case).
from nltk.corpus import stopwords
nltk.download(‘stopwords’)
stopwords.words(‘english’)[:10]
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”]
The list above shows only the first 10 stop words stored in the nltk
stopwords functionality. We use this to remove such words from the original
text, so they won’t appear in the frequency count table.
clean_tokens = tokens[:]
if token in stopwords.words(‘english’):
clean_tokens.remove(token)
With the new list of tokens without stop words (clean_tokens), build
again the frequency table and visualize the words which appear with highest
frequency in the text.
freq = nltk.FreqDist(clean_tokens)
df = pd.DataFrame.from_dict(freq,orient=’index’)
df.sort_values(by=0,ascending=False).head(20)
Linux 197
original 110
286 Fundamentals of Machine Learning using Python
operating 57
system 52
software 51
distributions 50
also 48
fromthe 42
Archived 41
originalon 40
use 32
used 31
kernel 29
July 27
RetrievedJune 26
desktop 25
GNU 24
December 24
distribution 24
September 23
Notice now how the word “Linux” has appeared at first place. Also, the
next words are “original,” “operating” and “system” which tells us some
information about what is the article about. That is a very simple, if not the
simplest NLP algorithm one can use.
mytext = “Good morning, Alex, how are you? I hope everything is well.
Tomorrow will be a nice day, see you buddy.”
print(sent_tokenize(mytext))
[‘Good morning, Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day, see you buddy.’]
To use this algorithm, it is necessary to download the
PunktSentenceTokenizer which is part of the nltk.tokenize.punkt module.
This is done through the command: nltk.download(‘punkt’).
The whole text, as a single string, is divided into sentences by a recognition
of the dots in nltk. Notice how the following text, which incorporates the tile
“Mr.” (with a dot) is still correctly divided.
import nltk
nltk.download(‘punkt’)
mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”
print(sent_tokenize(mytext))
[‘Good morning, Mr. Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day.’]
Similarly, words can be tokenized by using the word_tokenize
functionality from nltk library. Notice the result as this is applied to the
sentence presented above.
import nlt
288 Fundamentals of Machine Learning using Python
nltk.download(‘punkt’)
mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”
print(word_tokenize(mytext))
[‘Good’, ‘morning’, ‘,’, ‘Mr.’, ‘Alex’, ‘,’, ‘how’, ‘are’, ‘you’, ‘?’, ‘I’, ‘hope’,
‘everything’, ‘is’, ‘well’, ‘.’, ‘Tomorrow’, ‘will’, ‘be’, ‘a’, ‘nice’, ‘day’, ‘.’]
Notice how this algorithm recognizes that the word “Mr.” contains the
dot at the end, thus not removing it.
NLTK library does not work only with English natural language. In
the following example, sentence tokenizer is used to split a sentence in
Portuguese.
import nlt
nltk.download(‘punkt’)
mytext = “Bom dia, Sr. Alex, como o senhor está? Espero que esteja bem.
Amanhã será um ótimo dia.”
print(sent_tokenize(mytext))
[‘Bom dia, Sr. Alex, como o senhor está?’, ‘Espero que esteja bem.’,
‘Amanhã será um ótimo dia.’]
NLTK is able to automatically recognize the language being used in
the example above. This is evident since it does not split at the word “Sr.”
(meaning Mr. in English).
syn = wordnet.synsets(“patient”)
print(syn[0].definition())
print(syn[0].examples())
The output of the above code is,
a person who requires medical care
[‘the number of emergency patients has grown rapidly’]
To obtain synonyms from a words using WordNet, one can use the
synsets(word_to_be_inquired) function from wordnet module and obtain
each lemma, as shown in the code below.
from nltk.corpus import wordnet
synonyms = []
synonyms.append(lemma.name())
print(synonyms)
[‘car’, ‘auto’, ‘automobile’, ‘machine’, ‘motorcar’, ‘car’, ‘railcar’, ‘railway_
car’, ‘railroad_car’, ‘car’, ‘gondola’, ‘car’, ‘elevator_car’, ‘cable_car’,
‘car’]
In a similar way, antonyms can be retrieved by doing a slight modification
of the code above.
from nltk.corpus import wordnet
290 Fundamentals of Machine Learning using Python
antonyms = []
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms)
[‘ugly’]
print(PorterStemmer().stem(‘building’))
The result is,
build
Other stemming algorithm worth mentioning is the Lancaster stemming
algorithm. There are slightly different results from both algorithms in
different words.
NLTK also supports stemming of languages other than English, using
the SnowballStemmer class. The supported languages can be visualized by
checking the SnowballStemmer.languages property.
from nltk.stem import SnowballStemmer
Introduction to Natural Language Processing 291
print(SnowballStemmer.languages)
(‘arabic’, ‘danish’, ‘dutch’, ‘english’, ‘finnish’, ‘french’, ‘german’,
‘hungarian’, ‘italian’, ‘norwegian’, ‘porter’, ‘portuguese’, ‘romanian’,
‘russian’, ‘spanish’, ‘swedish’)
The code below shows an example usage of SnowballStemmer to stem word
from a non-English language, for example Portuguese.
from nltk.stem import SnowballStemmer
portuguese_stemmer = SnowballStemmer(‘portuguese’)
print(portuguese_stemmer.stem(“trabalhando”))
trabalh
import nltk
nltk.download(‘wordnet’)
lemmatizer = WordNetLemmatizer()
CONTENTS
18.1. Introduction .................................................................................. 294
18.2. Project Setup ................................................................................ 294
18.3. The Data ....................................................................................... 294
18.4. The Algorithm ............................................................................... 294
294 Fundamentals of Machine Learning using Python
18.1. INTRODUCTION
Developments in the field of machine learning have allowed models to
exceed capabilities that are commonly attributed only to living-beings or
even humans. One of such capabilities is object recognition. Though still
inferior to human vision, machines are able to recognize objects, faces text,
and even emotions through neural network frameworks.
In this chapter, we implement a simple digit recognition algorithm using
TensorFlow library and Python programming language. The algorithm
is able to recognize hand-draw images from 0 (zero) to 9 (nine) through
classification of the hand-written digit.
mnist = tf.keras.datasets.mnist
Also, the values stored in the input matrices ranges from 0 to 255, meaning
color intensity. For better performance of the neural network, it is necessary
to perform normalization of such values so they fit in the range 0–1.
The following code is used to do matrix flattening (matrix to vector) and
normalization of the values in the input matrices.
# Matrix -> vector transformation
train_images = train_images.reshape(train_images.shape[0], –1)
# Matrix normalization
train_images = (train_images – np.min(train_images))/(np.max(train_
images) – np.min(train_images))
Figure 18.1: General neural network structure for MNIST dataset problem.
To transform the labels stored in train_labels and test_labels into
probabilities, on can one-hot encode it, as it is done in the following code.
import numpy as np
train_labels_encode = np.zeros((train_labels.shape[0], len(set(train_
labels))))
Project: Recognize Handwritten Digits Using Neural Networks 297
train_labels_encode[np.arange(train_labels.shape[0]), train_labels] = 1
test_labels_encode = np.zeros((test_labels.shape[0], len(set(test_labels))))
test_labels_encode[np.arange(test_labels.shape[0]),test_labels] = 1
Define some constant to build the neural network.
n_input = 784 # input layer (28x28 pixels)
n_hidden1 = 128 # 1st hidden layer
n_hidden2 = 64 # 2nd hidden layer
n_hidden3 = 32 # 3rd hidden layer
n_output = 10 # output layer (0–9 digits)
The first hidden layer contains 128 neurons, with subsequent layers
containing 64, 32, and 10 neurons respectively. Notice that, from the above
values, only n_input and n_output are fixed for this problem. The others can
be “freely” modified, though excessive neurons may cause overfitting and
too few neurons may lead to underfitting.
Besides the structure of the network, define the parameters for the
training process. These are the learning rate, the number of iterations, the
batch size, and dropout value.
learning_rate = 1e–3
n_iterations = 1000
batch_size = 128
dropout = 0.5
The structure of the neural network is built using placeholders for the input
and output vectors, and variables for the weights and biases.
X = tf.placeholder(“float,” [None, n_input])
Y = tf.placeholder(“float,” [None, n_output])
keep_prob = tf.placeholder(tf.float32)
Each training step, the network should evaluate how “good” it is. This
value also helps to see when the training should be stopped due to the
convergence of the performance. This is done by evaluating the accuracy as
shown in the following code.
correct_pred = tf.equal(tf.argmax(output_layer, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
Project: Recognize Handwritten Digits Using Neural Networks 299
With everything configure, initialize the session and perform the training
of the neural network. Due to the number of data points, at each iteration a
subset of it is chosen and the optimization is processed. After modification
of the parameters, then check the accuracy and print it to screen to observe
how the training process evolves.
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
complete_idx = np.arange(train_images.shape[0])
# train on mini batches
for i in range(n_iterations):
sess.run(train_step, feed_dict={
X: batch_x, Y: batch_y, keep_prob: dropout
})
str(i),
“\t| Loss =,”
str(minibatch_loss),
“\t| Accuracy =,”
str(minibatch_accuracy)
)
Iteration 0 | Loss = 2.3467183 | Accuracy = 0.109375
Iteration 100 | Loss = 1.8820465 | Accuracy = 0.375
Iteration 200 | Loss = 1.2935321 | Accuracy = 0.671875
Iteration 300 | Loss = 1.0126756 | Accuracy = 0.6953125
Iteration 400 | Loss = 0.7320422 | Accuracy = 0.84375
Iteration 500 | Loss = 0.69052655 | Accuracy = 0.859375
Iteration 600 | Loss = 0.5260591 | Accuracy = 0.8671875
Iteration 700 | Loss = 0.3844887 | Accuracy = 0.9296875
Iteration 800 | Loss = 0.4345752 | Accuracy = 0.90625
Iteration 900 | Loss = 0.46503332 | Accuracy = 0.890625
The accuracy should increase with optimization progress, as it can be
seen above. A bad choice of the learning rate may cause it to oscillate and
even diverge. The loss decreases, with perfect loss being equal to 0, while
perfect accuracy is 1.
With the trained model, one can evaluate the accuracy of it on the test
dataset.
test_accuracy = sess.run(accuracy, feed_dict={X: test_images.reshape(–
1,n_input), Y: test_labels_encode, keep_prob: 1.0})
print(“\nAccuracy on test set:,” test_accuracy)
0.9284
Notice that an accuracy of almost 93/% is achieved using the
configuration above. Actually, state-of-the-art algorithms have already
achieved approximately 98/% accuracy for this dataset. Still being able to
correctly predict 9 out of 10 images can be considered a high performance,
especially given the relatively simple neural network structure used in this
example.
Appendix A
428 316
962 730
686 617
506 492
518 518
527 423
1070 746
1113 777
902 737
685 540
315 290
558 478
755 512
744 538
817 582
844 541
725 625
499 361
284 374
272 365
1199 705
950 740
483 480
1146 752
223 324
636 420
284 308
845 605
499 437
1049 704
1191 871
964 671
1150 776
1138 813
416 331
513 372
Appendix A 303
889 613
243 410
457 426
434 411
406 423
654 433
971 721
571 488
1168 709
1076 739
721 638
278 386
962 586
563 395
216 275
647 576
839 608
1145 809
1026 754
743 502
502 357
252 287
916 696
214 350
692 518
248 332
462 372
972 630
799 631
1035 738
636 482
1159 748
481 369
435 –
319 450
1139 821
304 Fundamentals of Machine Learning using Python
1190 707
867 682
222 387
572 562
278 290
751 608
580 519
# loop over the file, filling the lists with x and y vectors
for line in file:
row = line.split(‘,’)
x.append(float(row[0]))
y.append(float(row[1]))
def average(x):
return sum(x) / len(x)
avg_x = average(x)
avg_y = average(y)
a1 = a1_num/a1_den
a0 = avg_y – a1*avg_x
Appendix A 305
print(‘Parameter a0 = ‘,a0)
print(‘Parameter a1 = ‘,a1)
‘’’
Determine R2 (Coefficient of determination).
A value between 0 and 1, where 1 means perfect model, and
0 is a model with low accuracy.
‘’’
R2 = 1 – sqres/sqtot
‘’’
The final Cost value is the Sum of Squared Residues, obtained in
variable sqres
‘’’
J = sqres
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
306 Fundamentals of Machine Learning using Python
J = 1/(2*len(y))*np.sum((X.dot(a) – y)**2)
return J
#%% ---------Part 3: Gradient Descent---------------------
def gradientDescent(X,y,a,alpha,MAX_ITER,verbose=False):
‘’’
gradientDescent(X,y,theta,alpha,MAX_ITER)
Computer the optimum a values (parameters) using GD algorithm.
Inputs:
Appendix A 307
# update the-parameters
a[0] = a[0] – alpha*dJda0
a[1] = a[1] – alpha*dJda1
308 Fundamentals of Machine Learning using Python
iter_ += 1
J_history.append(computeCost(X,y,a)) # store the cost at each iteration
return a,J_history
# GD parameters
MAX_ITER = 1500
ALPHA = 0.0002
# run GD
a, J_history = gradientDescent(X,y,a,ALPHA,MAX_ITER,True)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
plt.show()
J = 1/(2*len(y))*np.sum((X.dot(a) – y)**2)
Appendix A 311
return J
#%% ---------Part 3: Gradient Descent---------------------
def gradientDescent(X,y,a,alpha,MAX_ITER,verbose=False):
‘’’
gradientDescent(X,y,theta,alpha,MAX_ITER)
Computer the optimum a values (parameters) using GD algorithm.
Inputs:
X – input vector (1D numpy array)
y – output vector (1D numpy array)
a – initial guess of parameters (2x1 numpy array)
alpha – learning rate (float)
MAX_ITER – maximum number of iterations (int)
verbose – True to print the iteration process, False otherwise
Returns:
a – optimum parameters
‘’’
iter_ = 0 # initial iteration step
J_history = [] # a list to store the cost at each iteration
J_history.append(computeCost(X,y,a)) # store the cost at each iteration
J_old = computeCost(X,y,a)*1000
while iter_ < MAX_ITER and abs(J_old-J_history[–1])>1e–5:
dJda = np.zeros((X.shape[1],))
for i in range(X.shape[1]):
dJda[i] = np.sum((X.dot(a)-y)*X[:,i])
# update the-parameters
a -= alpha*dJda
iter_ += 1
J_old = J_history[–1]
J_history.append(computeCost(X,y,a)) # store the cost at each iteration
return a,J_history
# GD parameters
MAX_ITER = 1500
ALPHA = 0.002
# feature normalization
X[:,1:] = (X[:,1:]-np.min(X[:,1:],axis=0))/(np.max(X[:,1:],axis=0)-np.
min(X[:,1:],axis=0))
# run GD
a, J_history = gradientDescent(X,y,a,ALPHA,MAX_ITER,True)
print(‘Final parameters’)
print(‘a = ‘,a)
a = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
print(‘Parameters obtained using Normal Equation’)
print(‘a = ‘,a)
314 Fundamentals of Machine Learning using Python
class LogisticRegression:
def __init__(self, learning_rate=0.02, max_iterations=1000, fit_
intercept=True):
self.learning_rate = learning_rate
self.max_iterations = max_iterations
self.fit_intercept = fit_intercept
if self.fit_intercept:
X = self.__add_intercept(X)
# weights initialization
self.theta = np.zeros(X.shape[1])
Appendix A 315
i=0
while i < self.max_iterations:
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
gradient = np.dot(X.T, (h – y)) / y.size
self.theta -= self.learning_rate * gradient
if __name__ == ‘__main__’:
# hours of study
X = np.array([0.45, 0.75, 1.00, 1.25, 1.75, 1.50, 1.30, 2.00, 2.30, 2.50,
2.80, 3.00, 3.30,
3.50, 4.00, 4.30, 4.50, 4.80, 5.00, 6.00]).reshape(–1, 1)
316 Fundamentals of Machine Learning using Python
start = time()
model.fit(X, y, verbose = True)
end = time()
print(‘time taken = ‘ + str(round(end – start, 2)) + ‘ s’)
plt.ylabel(‘Result’)
plt.title(‘Plot of dataset’)
plt.legend()
pdf.savefig()
plt.show()
A Artificial Intelligence 4
Assumption 156
Absolute minimum 85 Automatic hyperparameter 5
Accuracy metrics 146 Automatic memory management
Activation function 174, 175, 176, system 3
178, 268 Auxiliary variable 41
Activation Function 267, 268 Avoid confusion 156
Activation potential 268
Activity Segmentation 246 C
Algebraic manipulation 157
Calibration 134, 185
Algorithm 58, 60
Client request submission 54
Algorithm implementation 105
Coefficient 157, 159, 170
Algorithm progress 142
Column 132, 133, 134
Algorithm recognizes 288
Column vector 115, 128
Algorithms apply grammatical 281
Community integrates 5
Application Programming Interface
Complex cancer analysis 70
(APIs) 254
Complexity 174
Appropriate value 94, 106
Complex matrix operation 114
Approximate minimum 84, 85
Comprehension 89, 91
Approximation 82, 83, 91
322 Fundamentals of Machine Learning using Python