Statistics Machine Learning Python
Statistics Machine Learning Python
Statistics Machine Learning Python
Python
Release 0.5
1 Introduction 1
1.1 Python ecosystem for data-science . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data analysis methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Python language 9
2.1 Import libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Execution control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 List comprehensions, iterators, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Regular expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 System programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Scripts and argument parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.11 Modules and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 Object Oriented Programming (OOP) . . . . . . . . . . . . . . . . . . . . . . . . 35
2.13 Style guide for Python programming . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.14 Documenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Scientific Python 39
3.1 Numpy: arrays and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Pandas: data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Data visualization: matplotlib & seaborn . . . . . . . . . . . . . . . . . . . . . . . 65
4 Statistics 79
4.1 Univariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Lab: Brain volumes study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3 Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.4 Multivariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.5 Time series in python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
i
5.5 Linear models for classification problems . . . . . . . . . . . . . . . . . . . . . . 223
5.6 Non-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.7 Resampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.8 Ensemble learning: bagging, boosting and stacking . . . . . . . . . . . . . . . . . 263
5.9 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
5.10 Lab: Faces recognition using various learning models . . . . . . . . . . . . . . . . 287
ii
CHAPTER
ONE
INTRODUCTION
Important links:
• Web page
• Github
• Latest pdf
• Official deposit for citation.
This document describes statistics and machine learning in Python using:
• Scikit-learn for machine learning.
• Pytorch for deep learning.
• Statsmodels for statistics.
1
Statistics and Machine Learning in Python, Release 0.5
1.1.2 Anaconda
Anaconda is a python distribution that ships most of python tools and libraries
Installation
1. Download anaconda (Python 3.x) http://continuum.io/downloads
2. Install it, on Linux
bash Anaconda3-2.4.1-Linux-x86_64.sh
export PATH="${HOME}/anaconda3/bin:$PATH"
Install additional packages. Those commands install qt back-end (Fix a temporary issue to run
spyder)
conda list
2 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.5
Environments
• A conda environment is a directory that contains a specific collection of conda packages
that you have installed.
• Control packages environment for a specific purpose: collaborating with someone else,
delivering an application to your client,
• Switch between environments
List of all environments
:: conda info –envs
1. Create new environment
2. Activate
3. Install new package
Miniconda
Anaconda without the collection of (>700) packages. With Miniconda you download only the
packages you want with the conda command: conda install PACKAGENAME
1. Download anaconda (Python 3.x) https://conda.io/miniconda.html
2. Install it, on Linux
bash Miniconda3-latest-Linux-x86_64.sh
export PATH=${HOME}/miniconda3/bin:$PATH
1.1.3 Commands
python: python interpreter. On the dos/unix command line execute wholes file:
python file.py
Interactive mode:
python
ipython
For neuroimaging:
4 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.5
1.1.4 Libraries
scipy.org: https://www.scipy.org/docs.html
Numpy: Basic numerical operation. Matrix operation plus some basic solvers.:
import numpy as np
X = np.array([[1, 2], [3, 4]])
#v = np.array([1, 2]).reshape((2, 1))
v = np.array([1, 2])
np.dot(X, v) # no broadcasting
X * v # broadcasting
np.dot(v, X)
X - X.mean(axis=0)
import scipy
import scipy.linalg
scipy.linalg.svd(X, full_matrices=False)
Matplotlib: visualization:
import numpy as np
import matplotlib.pyplot as plt
#%matplotlib qt
x = np.linspace(0, 10, 50)
sinus = np.sin(x)
plt.plot(x, sinus)
plt.show()
6 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.5
• Linear model.
• Non parametric statistics.
• Linear algebra: matrix operations, inversion, eigenvalues.
8 Chapter 1. Introduction
CHAPTER
TWO
PYTHON LANGUAGE
# import a function
from math import sqrt
sqrt(25) # no longer have to reference the module
# define an alias
import numpy as np
# Numbers
10 + 4 # add (returns 14)
10 - 4 # subtract (returns 6)
10 * 4 # multiply (returns 40)
10 ** 4 # exponent (returns 10000)
10 / 4 # divide (returns 2 because both types are 'int')
10 / float(4) # divide (returns 2.5)
5 % 4 # modulo (returns 1) - also known as the remainder
9
Statistics and Machine Learning in Python, Release 0.5
# Boolean operations
# comparisons (these return True)
5 > 3
5 >= 3
5 != 3
5 == 5
Out:
True
Out:
True
2.3.1 Lists
Different objects categorized along a certain ordered sequence, lists are ordered, iterable, mu-
table (adding or removing objects changes the list size), can contain multiple data types.
# create a list
simpsons = ['homer', 'marge', 'bart']
# examine a list
simpsons[0] # print element 0 ('homer')
len(simpsons) # returns the length (3)
# sort a list in place (modifies but does not return the list)
simpsons.sort()
simpsons.sort(reverse=True) # sort in reverse
simpsons.sort(key=len) # sort by a key
# return a sorted list (but does not modify the original list)
sorted(simpsons)
sorted(simpsons, reverse=True)
sorted(simpsons, key=len)
# examine objects
id(num) == id(same_num) # returns True
id(num) == id(new_num) # returns False
num is same_num # returns True
num is new_num # returns False
num == same_num # returns True
num == new_num # returns True (their contents are equivalent)
# conatenate +, replicate *
[1, 2, 3] + [4, 5, 6]
["a"] * 2 + ["b"] * 3
Out:
2.3.2 Tuples
Like lists, but their size cannot change: ordered, iterable, immutable, can contain multiple data
types
# create a tuple
digits = (0, 1, 'two') # create a tuple directly
digits = tuple([0, 1, 'two']) # create a tuple from a list
zero = (0,) # trailing comma is required to indicate it's a␣
˓→tuple
# examine a tuple
digits[2] # returns 'two'
len(digits) # returns 3
digits.count(0) # counts the number of instances of that value (1)
digits.index(1) # returns the index of the first instance of that value (1)
# concatenate tuples
digits = digits + (3, 4)
# create a single tuple with elements repeated (also works with lists)
(3, 4) * 2 # returns (3, 4, 3, 4)
# tuple unpacking
bart = ('male', 10, 'simpson') # create a tuple
2.3.3 Strings
# create a string
s = str(42) # convert another data type into a string
s = 'I like you'
# examine a string
s[0] # returns 'I'
len(s) # returns 10
# concatenate strings
s3 = 'The meaning of life is'
s4 = '42'
s3 + ' ' + s4 # returns 'The meaning of life is 42'
s3 + ' ' + str(42) # same thing
# string formatting
# more examples: http://mkaz.com/2012/10/10/python-string-format/
'pi is {:.2f}'.format(3.14159) # returns 'pi is 3.14'
Out:
'pi is 3.14'
Out:
first line
second line
Out:
Sequence of bytes are not strings, should be decoded before some operations
print(s.decode('utf-8').split())
Out:
2.3.5 Dictionaries
Dictionaries are structures which can contain multiple data types, and is ordered with key-value
pairs: for each (unique) key, the dictionary outputs one value. Keys can be strings, numbers, or
tuples, while the corresponding values can be any Python object. Dictionaries are: unordered,
iterable, mutable
# examine a dictionary
(continues on next page)
Out:
Error 'grandma'
2.3.6 Sets
Like dictionaries, but with unique keys only (no corresponding values). They are: unordered, it-
erable, mutable, can contain multiple data types made up of unique elements (strings, numbers,
or tuples)
# create a set
languages = {'python', 'r', 'java'} # create a set directly
snakes = set(['cobra', 'viper', 'python']) # create a set from a list
# examine a set
len(languages) # returns 3
'python' in languages # returns True
# set operations
languages & snakes # returns intersection: {'python'}
languages | snakes # returns union: {'cobra', 'r', 'java', 'viper',
˓→'python'}
try:
languages.remove('c') # try to remove a non-existing element (throws an␣
˓→error)
except KeyError as e:
print("Error", e)
Out:
Error 'c'
[0, 1, 2, 9]
2.3.7 Iterators
Cartesian product
import itertools
Out:
[['a', 1], ['a', 2], ['b', 1], ['b', 2], ['c', 1], ['c', 2]]
x = 3
# if statement
if x > 0:
print('positive')
# if/else statement
if x > 0:
print('positive')
else:
print('zero or negative')
# if/elif/else statement
if x > 0:
print('positive')
elif x == 0:
print('zero')
else:
print('negative')
Out:
positive
positive
positive
positive
2.4.2 Loops
Loops are a set of instructions which repeat until termination conditions are met. This can
include iterating through all values in an object, go through a range of values, etc
# for loop
fruits = ['apple', 'banana', 'cherry']
for i in range(len(fruits)):
print(fruits[i].upper())
# use range when iterating over a large sequence to avoid actually creating the␣
˓→integer list in memory
v = 0
for i in range(10 ** 6):
v += 1
Out:
APPLE
BANANA
CHERRY
APPLE
BANANA
CHERRY
Process which affects whole lists without iterating through loops. For more: http://
python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html
# set comprehension
fruits = ['apple', 'banana', 'cherry']
unique_lengths = {len(fruit) for fruit in fruits} # {5, 6}
# dictionary comprehension
fruit_lengths = {fruit:len(fruit) for fruit in fruits} # {'apple': 5, 'banana': 6,
˓→ 'cherry': 6}
Out:
quote = """Tick-tow
our incomes are like our shoes; if too small they gall and pinch us
but if too large they cause us to stumble and to trip
"""
# use enumerate if you need to access the index value within the loop
for index, fruit in enumerate(fruits):
print(index, fruit)
# for/else loop
for fruit in fruits:
if fruit == 'banana':
print("Found the banana!")
break # exit the loop and skip the 'else' block
else:
# this block executes ONLY if the for loop completes without hitting
# 'break'
print("Can't find the banana")
# while loop
count = 0
while count < 5:
print("This will print 5 times")
count += 1 # equivalent to 'count = count + 1'
Out:
dad homer
mom marge
size 6
(continues on next page)
key = 'c'
try:
dct[key]
except:
print("Key %s is missing. Add it with empty value" % key)
dct['c'] = []
print(dct)
Out:
2.6 Functions
Functions are sets of instructions launched when called upon, they can have multiple input
values and a return value
add(2, 3)
add("deux", "trois")
# default arguments
def power_this(x, power=2):
return x ** power
power_this(2) # 4
power_this(2, 3) # 8
# return values can be assigned into multiple variables using tuple unpacking
min_num, max_num = min_max(nums) # min_num = 1, max_num = 3
Out:
2.6. Functions 23
Statistics and Machine Learning in Python, Release 0.5
this is text
3
3
import re
Out:
Method/Attribute Purpose
match(string) Determine if the RE matches at the beginning of the string.
search(string) Scan through a string, looking for any location where this RE matches.
findall(string) Find all substrings where the RE matches, and returns them as a list.
finditer(string) Find all substrings where the RE matches, and returns them as an itera-
tor.
regex.sub("SUB-", "toto")
Out:
'toto'
Out:
'helloworld'
import os
Out:
/home/ed203246/git/pystatsml/python_lang
Temporary directory
import tempfile
tmpdir = tempfile.gettempdir()
Join paths
Create a directory
# list containing the names of the entries in the directory given by path.
os.listdir(mytmpdir)
Out:
['plop']
# Write
lines = ["Dans python tout est bon", "Enfin, presque"]
# Read
## read one line at a time (entire file does not have to fit into memory)
f = open(filename, "r")
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()
## read one line at a time (entire file does not have to fit into memory)
f = open(filename, 'r')
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()
## use list comprehension to duplicate readlines without reading entire file at␣
˓→once
f = open(filename, 'r')
[line for line in f]
f.close()
Out:
/tmp/foobar/myfile.txt
Walk
import os
WD = os.path.join(tmpdir, "foobar")
Out:
import tempfile
import glob
tmpdir = tempfile.gettempdir()
Out:
['/tmp/foobar/myfile.txt']
['myfile']
import shutil
shutil.copy(src, dst)
shutil.rmtree(dst)
shutil.move(src, dst)
except (FileExistsError, FileNotFoundError) as e:
pass
Out:
• For more advanced use cases, the underlying Popen interface can be used directly.
• Run the command described by args.
• Wait for command to complete
• return a CompletedProcess instance.
• Does not capture stdout or stderr by default. To do so, pass PIPE for the stdout and/or
stderr arguments.
import subprocess
# Capture output
out = subprocess.run(["ls", "-a", "/"], stdout=subprocess.PIPE, stderr=subprocess.
˓→STDOUT)
Out:
0
['.', '..', 'bin', 'boot', 'cdrom']
Process
A process is a name given to a program instance that has been loaded into memory
and managed by the operating system.
Process = address space + execution context (thread of control)
Process address space (segments):
• Code.
• Data (static/global).
• Heap (dynamic memory allocation).
• Stack.
Execution context:
• Data registers.
• Stack pointer (SP).
• Program counter (PC).
• Working Registers.
OS Scheduling of processes: context switching (ie. save/load Execution context)
Pros/cons
• Context switching expensive.
• (potentially) complex data sharing (not necessary true).
• Cooperating processes - no need for memory protection (separate address
spaces).
• Relevant for parrallel computation with memory allocation.
Threads
• Threads share the same address space (Data registers): access to code, heap
and (global) data.
• Separate execution stack, PC and Working Registers.
Pros/cons
• Faster context switching only SP, PC and Working Registers.
• Can exploit fine-grain concurrency
• Simple data sharing through the shared address space.
• Precautions have to be taken or two threads will write to the same memory at
the same time. This is what the global interpreter lock (GIL) is for.
• Relevant for GUI, I/O (Network, disk) concurrent operation
In Python
• The threading module uses threads.
import time
import threading
startime = time.time()
# Will execute both in parallel
thread1.start()
thread2.start()
# Joins threads back to the parent process
thread1.join()
thread2.join()
print("Threading ellapsed time ", time.time() - startime)
print(out_list[:10])
Out:
Multiprocessing
import multiprocessing
startime = time.time()
p1.start()
p2.start()
p1.join()
p2.join()
(continues on next page)
Out:
import multiprocessing
import time
startime = time.time()
p1.start()
p2.start()
p1.join()
p2.join()
print(out_list[:10])
Out:
[0, 1, 2, 3, 4, 5, 0, 6, -1, 7]
Multiprocessing with shared object ellapsed time 0.3832252025604248
import os
import os.path
import argparse
import re
import pandas as pd
if __name__ == "__main__":
# parse command line options
output = "word_count.csv"
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input',
help='list of input files.',
nargs='+', type=str)
parser.add_argument('-o', '--output',
help='output csv file (default %s)' % output,
type=str, default=output)
options = parser.parse_args()
if options.input is None :
parser.print_help()
raise SystemExit("Error: input files are missing")
else:
filenames = [f for f in options.input if os.path.isfile(f)]
# Match words
regex = re.compile("[a-zA-Z]+")
count = dict()
for filename in filenames:
fd = open(filename, "r")
for line in fd:
for word in regex.findall(line.lower()):
if not word in count:
count[word] = 1
else:
count[word] += 1
fd = open(options.output, "w")
# Pandas
df = pd.DataFrame([[k, count[k]] for k in count], columns=["word", "count"])
df.to_csv(options.output, index=False)
2.10 Networking
# TODO
2.10.1 FTP
Out:
2.10. Networking 33
Statistics and Machine Learning in Python, Release 0.5
2.10.2 HTTP
# TODO
2.10.3 Sockets
# TODO
2.10.4 xmlrpc
# TODO
A module is a Python file. A package is a directory which MUST contain a special file called
__init__.py
To import, extend variable PYTHONPATH:
export PYTHONPATH=path_to_parent_python_module:${PYTHONPATH}
Or
import sys
sys.path.append("path_to_parent_python_module")
The __init__.py file can be empty. But you can set which modules the package exports as the
API, while keeping other modules internal, by overriding the __all__ variable, like so:
parentmodule/__init__.py file:
import parentmodule.submodule1
import parentmodule.function1
Sources
• http://python-textbok.readthedocs.org/en/latest/Object_Oriented_Programming.html
Principles
• Encapsulate data (attributes) and code (methods) into objects.
• Class = template or blueprint that can be used to create objects.
• An object is a specific instance of a class.
• Inheritance: OOP allows classes to inherit commonly used state and behaviour from other
classes. Reduce code duplication
• Polymorphism: (usually obtained through polymorphism) calling code is agnostic as to
whether an object belongs to a parent class or one of its descendants (abstraction, modu-
larity). The same method called on 2 objects of 2 different classes will behave differently.
import math
class Shape2D:
def area(self):
raise NotImplementedError()
# Inheritance + Encapsulation
class Square(Shape2D):
def __init__(self, width):
self.width = width
def area(self):
return self.width ** 2
class Disk(Shape2D):
def __init__(self, radius):
self.radius = radius
def area(self):
return math.pi * self.radius ** 2
# Polymorphism
print([s.area() for s in shapes])
s = Shape2D()
(continues on next page)
Out:
[4, 28.274333882308138]
NotImplementedError
See PEP 8
• Spaces (four) are the preferred indentation method.
• Two blank lines for top level function or classes definition.
• One blank line to indicate logical sections.
• Never use: from lib import *
• Bad: Capitalized_Words_With_Underscores
• Function and Variable Names: lower_case_with_underscores
• Class Names: CapitalizedWords (aka: CamelCase)
2.14 Documenting
Parameters
----------
a : float
First operand.
b : float, optional
Second operand. The default is 2.
Example
-------
>>> my_function(3)
5
"""
# Add a with b (this is a comment)
return a + b
print(help(my_function))
Out:
my_function(a, b=2)
This function ...
Parameters
----------
a : float
First operand.
b : float, optional
Second operand. The default is 2.
Returns
-------
Sum of operands.
Example
-------
>>> my_function(3)
5
None
"""
Created on Thu Nov 14 12:08:41 CET 2019
@author: firstname.lastname@email.com
Some description
"""
2.14. Documenting 37
Statistics and Machine Learning in Python, Release 0.5
2.15 Exercises
Create a function that acts as a simple calulator If the operation is not specified, default to
addition If the operation is misspecified, return an prompt message Ex: calc(4,5,"multiply")
returns 20 Ex: calc(3,5) returns 8 Ex: calc(1, 2, "something") returns error message
Given a list of numbers, return a list where all adjacent duplicate elements have been reduced
to a single element. Ex: [1, 2, 2, 3, 2] returns [1, 2, 3, 2]. You may create a new list or
modify the passed in list.
Remove all duplicate values (adjacent or not) Ex: [1, 2, 2, 3, 2] returns [1, 2, 3]
THREE
SCIENTIFIC PYTHON
NumPy is an extension to the Python programming language, adding support for large, multi-
dimensional (numerical) arrays and matrices, along with a large library of high-level mathe-
matical functions to operate on these arrays.
Sources:
• Kevin Markham: https://github.com/justmarkham
Computation time:
import numpy as np
l = [v for v in range(10 ** 8)] s = 0 %time for v in l: s += v
arr = np.arange(10 ** 8) %time arr.sum()
Create ndarrays from lists. note: every element must be the same type (will be converted if
possible)
import numpy as np
Out:
np.zeros(10)
np.zeros((3, 6))
np.ones(10)
(continues on next page)
39
Statistics and Machine Learning in Python, Release 0.5
Out:
int_array = np.arange(5)
float_array = int_array.astype(float)
arr1.dtype # float64
arr2.ndim # 2
arr2.shape # (2, 4) - axis 0 is rows, axis 1 is columns
arr2.size # 8 - total number of elements
len(arr2) # 2 - size of first dimension (aka axis)
Out:
3.1.3 Reshaping
Out:
(2, 5)
[[0. 1.]
[2. 3.]
[4. 5.]
[6. 7.]
[8. 9.]]
Add an axis
a = np.array([0, 1])
a_col = a[:, np.newaxis]
print(a_col)
#or
a_col = a[:, None]
Out:
[[0]
[1]]
Transpose
print(a_col.T)
Out:
[[0 1]]
arr_flt = arr.flatten()
arr_flt[0] = 33
print(arr_flt)
print(arr)
Out:
[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[0. 1. 2. 3. 4.]
[5. 6. 7. 8. 9.]]
arr_flt = arr.ravel()
arr_flt[0] = 33
print(arr_flt)
print(arr)
Out:
[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[33. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]
Numpy internals: By default Numpy use C convention, ie, Row-major language: The matrix is
stored by rows. In C, the last index changes most rapidly as one moves through the array as
stored in memory.
For 2D arrays, sequential move in the memory will:
• iterate over rows (axis 0)
– iterate over columns (axis 1)
For 3D arrays, sequential move in the memory will:
• iterate over plans (axis 0)
– iterate over rows (axis 1)
x = np.arange(2 * 3 * 4)
print(x)
Out:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
x = x.reshape(2, 3, 4)
print(x)
Out:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
print(x[0, :, :])
Out:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
print(x[:, 0, :])
Out:
[[ 0 1 2 3]
[12 13 14 15]]
print(x[:, :, 0])
Out:
[[ 0 4 8]
[12 16 20]]
arr[1, :]
arr[:, 2]
Out:
[[0. 1. 2. 3. 4.]
[5. 6. 7. 8. 9.]]
array([2., 7.])
Ravel
print(x.ravel())
Out:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
a = np.array([0, 1])
b = np.array([2, 3])
Horizontal stacking
np.hstack([a, b])
Out:
array([0, 1, 2, 3])
Vertical stacking
np.vstack([a, b])
Out:
array([[0, 1],
[2, 3]])
Default Vertical
np.stack([a, b])
Out:
array([[0, 1],
[2, 3]])
3.1.6 Selection
Single item
Out:
3.0
Slicing
Syntax: start:stop:step with start (default 0) stop (default last) step (default 1)
Out:
[[1. 2. 3.]
[6. 7. 8.]]
arr2[0, 0] = 33
print(arr2)
print(arr)
Out:
[[33. 2. 3.]
[ 6. 7. 8.]]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]
print(arr[0, ::-1])
# The rule of thumb here can be: in the context of lvalue indexing (i.e. the␣
˓→indices are placed in the left hand side value of an assignment), no view or␣
˓→copy of the array is created (because there is no need to). However, with␣
˓→regular values, the above rules for creating views does apply.
Out:
[ 4. 3. 2. 33. 0.]
Out:
[[33. 2. 3.]
[ 6. 7. 8.]]
[[44. 2. 3.]
[ 6. 7. 8.]]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]
print(arr2)
arr2[0] = 44
print(arr2)
print(arr)
Out:
[33. 6. 7. 8. 9.]
[44. 6. 7. 8. 9.]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]
However, In the context of lvalue indexing (left hand side value of an assignment) Fancy autho-
rizes the modification of the original array
arr[arr > 5] = 0
print(arr)
Out:
[[0. 0. 2. 3. 4.]
[5. 0. 0. 0. 0.]]
Out:
nums = np.arange(5)
nums * 10 # multiply each element by 10
nums = np.sqrt(nums) # square root of each element
np.ceil(nums) # also floor, rint (round to nearest int)
np.isnan(nums) # checks for NaN
nums + np.arange(5) # add element-wise
np.maximum(nums, np.array([1, -2, 3, -4, 5])) # compare element-wise
# random numbers
np.random.seed(12234) # Set the seed
np.random.rand(2, 3) # 2 x 3 matrix in [0, 1]
np.random.randn(10) # random normals (mean 0, sd 1)
np.random.randint(0, 2, 10) # 10 randomly picked 0 or 1
Out:
array([0, 0, 0, 1, 1, 0, 1, 1, 1, 1])
3.1.8 Broadcasting
Rules
Starting with the trailing axis and working backward, Numpy compares arrays dimensions.
• If two dimensions are equal then continues
• If one of the operand has dimension 1 stretches it to match the largest one
• When one of the shapes runs out of dimensions (because it has less dimensions than
the other shape), Numpy will use 1 in the comparison process until the other shape’s
dimensions run out as well.
a = np.array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
print(a + b)
Out:
[[ 0 1 2]
[10 11 12]
[20 21 22]
[30 31 32]]
a - a.mean(axis=0)
Out:
(a - a.mean(axis=0)) / a.std(axis=0)
Out:
Examples
Shapes of operands A, B and result:
A (2d array): 5 x 4
B (1d array): 1
Result (2d array): 5 x 4
A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4
A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
(continues on next page)
A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5
3.1.9 Exercises
• For each column find the row index of the minimum value.
• Write a function standardize(X) that return an array whose columns are centered and
scaled (by std-dev).
Total running time of the script: ( 0 minutes 0.012 seconds)
It is often said that 80% of data analysis is spent on the cleaning and small, but important,
aspect of data manipulation and cleaning with Pandas.
Sources:
• Kevin Markham: https://github.com/justmarkham
• Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html
Data structures
• Series is a one-dimensional labeled array capable of holding any data type (inte-
gers, strings, floating point numbers, Python objects, etc.). The axis labels are col-
lectively referred to as the index. The basic method to create a Series is to call
pd.Series([1,3,5,np.nan,6,8])
• DataFrame is a 2-dimensional labeled data structure with columns of potentially different
types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It
stems from the R data.frame() object.
import pandas as pd
import numpy as np
print(user3)
Out:
Concatenate DataFrame
Out:
Out:
user1.append(user2)
Join DataFrame
Out:
name height
0 alice 165
1 john 180
2 eric 175
3 julie 171
print(merge_inter)
Out:
Out:
Reshaping by pivoting
Out:
Out:
3.2.3 Summarizing
Descriptive statistics
users.describe(include="all")
Meta-information
Out:
(6, 5)
df = users.copy()
df.iloc[0] # first row
df.iloc[0, :] # first row
df.iloc[0, 0] # first item of first row
df.iloc[0, 0] = 55
df = users[users.gender == "F"]
print(df)
Out:
Get the two first rows using iloc (strictly integer position)
Use loc
try:
df.loc[[0, 1], :] # Failed
except KeyError as err:
print(err)
Out:
˓→pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-
˓→reindex-listlike"
Reset index
Out:
3.2.6 Sorting
df = users[:2].copy()
Out:
alice 19
john 26
Out:
alice 19
john 26
for i in range(df.shape[0]):
df.loc[i, "age"] *= 10 # df is modified
Out:
users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]
3.2.9 Sorting
df = users.copy()
print(df)
Out:
print(df.describe())
Out:
age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000
print(df.describe(include='all'))
print(df.describe(include=['object'])) # limit to one (or more) types
Out:
print(df.groupby("job").mean())
print(df.groupby("job")["age"].mean())
print(df.groupby("job").describe(include='all'))
Out:
age height
job
engineer 33.000000 NaN
manager 58.000000 NaN
scientist 44.000000 171.000000
student 22.333333 173.333333
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
name ... height
count unique top freq mean std ... std min 25% 50% ␣
˓→ 75% max
job ...
engineer 1 1 peter 1 NaN NaN ... NaN NaN NaN NaN ␣
˓→ NaN NaN
manager 1 1 paul 1 NaN NaN ... NaN NaN NaN NaN ␣
˓→ NaN NaN
scientist 1 1 julie 1 NaN NaN ... NaN 171.0 171.0 171.0 ␣
˓→171.0 171.0
student 3 3 eric 1 NaN NaN ... 7.637626 165.0 170.0 175.0 ␣
˓→177.5 180.0
[4 rows x 44 columns]
Groupby in a loop
Out:
df = users.append(users.iloc[0], ignore_index=True)
Out:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Missing data
df.describe(include='all')
Out:
name 0
age 0
gender 0
job 0
height 2
dtype: int64
df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()
print(df)
Out:
df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")
Out:
3.2.13 Renaming
Rename columns
df = users.copy()
df.rename(columns={'name': 'NAME'})
Rename values
Assume random variable follows the normal distribution Exclude data outside 3 standard-
deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample
lies within 3 sd: 99.73% (68.27 + 2 * 15.73)
size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())
Out:
248.48963819938044
Median absolute deviation (MAD), based on the median, is a robust non-parametric statistics.
https://en.wikipedia.org/wiki/Median_absolute_deviation
Out:
173.80000467192673 178.7023568870694
csv
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv
˓→'
salary = pd.read_csv(url)
Excel
pd.read_excel(xls_filename, sheet_name='users')
# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
df.to_excel(writer, sheet_name='salary', index=False)
pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')
SQL (SQLite)
import pandas as pd
import sqlite3
Connect
conn = sqlite3.connect(db_filename)
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv
˓→'
salary = pd.read_csv(url)
Push modifications
cur = conn.cursor()
values = (100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()
Out:
3.2.16 Exercises
Data Frame
Missing data
df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None
1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column
with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets:
original, imputed.
Total running time of the script: ( 0 minutes 1.125 seconds)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(9, 3))
x = np.linspace(0, 10, 50)
sinus = np.sin(x)
plt.plot(x, sinus)
plt.show()
plt.figure(figsize=(9, 3))
# Rapid multiplot
plt.figure(figsize=(9, 3))
cosinus = np.cos(x)
plt.plot(x, sinus, "-b", x, sinus, "ob", x, cosinus, "-r", x, cosinus, "or")
plt.xlabel('this is x!')
plt.ylabel('this is y!')
plt.title('My First Plot')
plt.show()
# Step by step
plt.figure(figsize=(9, 3))
plt.plot(x, sinus, label='sinus', color='blue', linestyle='--', linewidth=2)
plt.plot(x, cosinus, label='cosinus', color='red', linestyle='-', linewidth=2)
plt.legend()
plt.show()
Load dataset
import pandas as pd
try:
salary = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_
˓→table.csv'
salary = pd.read_csv(url)
df = salary
print(df.head())
Legend outside
Linear model
# Prefer vectorial format (SVG: Scalable Vector Graphics) can be edited with
# Inkscape, Adobe Illustrator, Blender, etc.
plt.plot(x, sinus)
plt.savefig("sinus.svg")
plt.close()
# Or pdf
plt.plot(x, sinus)
plt.savefig("sinus.pdf")
plt.close()
Box plots are non-parametric: they display variation in samples of a statistical population with-
out making any assumptions of the underlying statistical distribution.
i = 0
for edu, d in salary.groupby(['education']):
sns.kdeplot(x="salary", hue="management", data=d, fill=True, ax=axes[i],␣
˓→palette="muted")
axes[i].set_title(edu)
i += 1
ax = sns.pairplot(salary, hue="management")
FOUR
STATISTICS
4.1.1 Libraries
Data
import numpy as np
import pandas as pd
Plots
Statistics
• Basic: scipy.stats
• Advanced: statsmodels. statsmodels API:
– statsmodels.api: Cross-sectional models and methods. Canonically imported using
import statsmodels.api as sm.
– statsmodels.formula.api: A convenience interface for specifying models using for-
mula strings and DataFrames. Canonically imported using import statsmodels.
formula.api as smf
– statsmodels.tsa.api: Time-series models and methods. Canonically imported us-
ing import statsmodels.tsa.api as tsa.
79
Statistics and Machine Learning in Python, Release 0.5
import scipy.stats
import statsmodels.api as sm
#import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import jarque_bera
%matplotlib inline
Datasets
Salary
try:
salary = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_
˓→table.csv'
salary = pd.read_csv(url)
Iris
Mean
The estimator 𝑥
¯ on a sample of size 𝑛: 𝑥 = 𝑥1 , ..., 𝑥𝑛 is given by
1 ∑︁
𝑥
¯= 𝑥𝑖
𝑛
𝑖
80 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
Variance
Note here the subtracted 1 degree of freedom (df) in the divisor. In standard statistical practice,
𝑑𝑓 = 1 provides an unbiased estimator of the variance of a hypothetical infinite population.
With 𝑑𝑓 = 0 it instead provides a maximum likelihood estimate of the variance for normally
distributed variables.
Standard deviation
√︀
Std(𝑋) = Var(𝑋)
√︀
The estimator is simply 𝜎𝑥 = 𝜎𝑥2 .
Covariance
Correlation
Cov(𝑋, 𝑌 )
Cor(𝑋, 𝑌 ) =
Std(𝑋) Std(𝑌 )
The estimator is
𝜎𝑥𝑦
𝜌𝑥𝑦 = .
𝜎𝑥 𝜎𝑦
The standard error (SE) is the standard deviation (of the sampling distribution) of a statistic:
Std(𝑋)
SE(𝑋) = √ .
𝑛
• Generate 2 random samples: 𝑥 ∼ 𝑁 (1.78, 0.1) and 𝑦 ∼ 𝑁 (1.66, 0.1), both of size 10.
• Compute 𝑥¯, 𝜎𝑥 , 𝜎𝑥𝑦 (xbar, xvar, xycov) using only the np.sum() operation. Explore
the np. module to find out which numpy functions performs the same computations and
compare them (using assert) with your previous results.
Caution! By default np.var() used the biased estimator (with ddof=0). Set ddof=1 to use
unbiased estimator.
n = 10
x = np.random.normal(loc=1.78, scale=.1, size=n)
y = np.random.normal(loc=1.66, scale=.1, size=n)
xbar = np.mean(x)
assert xbar == np.sum(x) / x.shape[0]
xycov = np.cov(x, y)
print(xycov)
ybar = np.sum(y) / n
assert np.allclose(xycov[0, 1], np.sum((x - xbar) * (y - ybar)) / (n - 1))
assert np.allclose(xycov[0, 0], xvar)
assert np.allclose(xycov[1, 1], np.var(y, ddof=1))
[[ 0.01025944 -0.00661557]
[-0.00661557 0.0167 ]]
82 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
With Pandas
Columns’ means
iris.mean()
SepalLength 5.843333
SepalWidth 3.057333
PetalLength 3.758000
PetalWidth 1.199333
dtype: float64
iris.std()
SepalLength 0.828066
SepalWidth 0.435866
PetalLength 1.765298
PetalWidth 0.762238
dtype: float64
With Numpy
Columns’ std-dev. Numpy normalizes by N by default. Set ddof=1 to normalize by N-1 to get
the unbiased estimator.
X.std(axis=0, ddof=1)
Normal distribution
The normal distribution, noted 𝒩 (𝜇, 𝜎) with parameters: 𝜇 mean (location) and 𝜎 > 0 std-dev.
Estimators: 𝑥
¯ and 𝜎𝑥 .
The normal distribution, noted 𝒩 , is useful because of the central limit theorem (CLT) which
states that: given certain conditions, the arithmetic mean of a sufficiently large number of iter-
ates of independent random variables, each with a well-defined expected value and well-defined
variance, will be approximately normally distributed, regardless of the underlying distribution.
mu = 0 # mean
variance = 2 #variance
sigma = np.sqrt(variance) #standard deviation",
x = np.linspace(mu - 3 * variance, mu + 3 * variance, 100)
_ = plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))
The chi-square or 𝜒2𝑛 distribution with 𝑛 degrees of freedom (df) is the distribution of a sum of
the squares of 𝑛 independent standard normal random variables 𝒩 (0, 1). Let 𝑋 ∼ 𝒩 (𝜇, 𝜎 2 ),
then, 𝑍 = (𝑋 − 𝜇)/𝜎 ∼ 𝒩 (0, 1), then:
• The squared standard 𝑍 2 ∼ 𝜒21 (one df).
∑︀𝑛
• The distribution of sum of squares of 𝑛 normal random variables: 𝑖 𝑍𝑖2 ∼ 𝜒2𝑛
The sum of two 𝜒2 RV with 𝑝 and 𝑞 df is a 𝜒2 RV with 𝑝 + 𝑞 df. This is useful when sum-
ming/subtracting sum of squares.
The 𝜒2 -distribution is used to model errors measured as sum of squares or the distribution of
the sample variance.
The 𝐹 -distribution, 𝐹𝑛,𝑝 , with 𝑛 and 𝑝 degrees of freedom is the ratio of two independent 𝜒2
variables. Let 𝑋 ∼ 𝜒2𝑛 and 𝑌 ∼ 𝜒2𝑝 then:
𝑋/𝑛
𝐹𝑛,𝑝 =
𝑌 /𝑝
The 𝐹 -distribution plays a central role in hypothesis testing answering the question: Are two
variances equals?, is the ratio or two errors significantly large ?.
84 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
Let 𝑀 ∼ 𝒩 (0, 1) and 𝑉 ∼ 𝜒2𝑛 . The 𝑡-distribution, 𝑇𝑛 , with 𝑛 degrees of freedom is the ratio:
𝑀
𝑇𝑛 = √︀
𝑉 /𝑛
The distribution of the difference between an estimated parameter and its true (or assumed)
value divided by the standard deviation of the estimated parameter (standard error) follow a
𝑡-distribution. Is this parameters different from a given value?
Examples
• Test a proportion: Biased coin ? 200 heads have been found over 300 flips, is it coins
biased ?
• Test the association between two variables.
– Exemple height and sex: In a sample of 25 individuals (15 females, 10 males), is
female height is different from male height ?
– Exemple age and arterial hypertension: In a sample of 25 individuals is age height
correlated with arterial hypertension ?
Steps
1. Model the data
2. Fit: estimate the model parameters (frequency, mean, correlation, regression coeficient)
3. Compute a test statistic from model the parameters.
4. Formulate the null hypothesis: What would be the (distribution of the) test statistic if the
observations are the result of pure chance.
5. Compute the probability (𝑝-value) to obtain a larger value for the test statistic by chance
(under the null hypothesis).
Biased coin ? 2 heads have been found over 3 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=3, P = the frequency of number of heads over the number
of flip: 2/3.
3. Compute a test statistic, same as frequency.
4. Under the null hypothesis the distribution of the number of tail is:
86 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
1 2 3 count #heads
0
H 1
H 1
H 1
H H 2
H H 2
H H 2
H H H 3
8 possibles configurations, probabilities of differents values for 𝑝 are: 𝑥 measure the number of
success.
• 𝑃 (𝑥 = 0) = 1/8
• 𝑃 (𝑥 = 1) = 3/8
• 𝑃 (𝑥 = 2) = 3/8
• 𝑃 (𝑥 = 3) = 1/8
plt.figure(figsize=(5, 3))
plt.bar([0, 1, 2, 3], [1/8, 3/8, 3/8, 1/8], width=0.9)
_ = plt.xticks([0, 1, 2, 3], [0, 1, 2, 3])
plt.xlabel("Distribution of the number of head over 3 flip under the null␣
˓→hypothesis")
Text(0.5, 0, 'Distribution of the number of head over 3 flip under the null␣
˓→hypothesis')
3. Compute the probability (𝑝-value) to observe a value larger or equal that 2 under the null
Biased coin ? 60 heads have been found over 100 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=100, P=60/100.
3. Compute a test statistic, same as frequency.
4. Compute a test statistic: 60/100.
5. Under the null hypothesis the distribution of the number of tail (𝑘) follow the binomial
distribution of parameters N=100, P=0.5:
(︂ )︂
100
𝑃 𝑟(𝑋 = 𝑘|𝐻0 ) = 𝑃 𝑟(𝑋 = 𝑘|𝑛 = 100, 𝑝 = 0.5) = 0.5𝑘 (1 − 0.5)(100−𝑘) .
𝑘
100 (︂ )︂
∑︁ 100
𝑃 (𝑋 = 𝑘 ≥ 60|𝐻0 ) = 0.5𝑘 (1 − 0.5)(100−𝑘)
𝑘
𝑘=60
60 (︂ )︂
∑︁ 100
=1− 0.5𝑘 (1 − 0.5)(100−𝑘) , the cumulative distribution function.
𝑘
𝑘=1
0.01760010010885238
88 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
The one-sample 𝑡-test is used to determine whether a sample comes from a population with a
specific mean. For example you want to test if the average height of a population is 1.75 𝑚.
Assumptions
In testing the null hypothesis that the population mean is equal to a specified value 𝜇0 = 1.75,
one uses the statistic:
difference of means √
𝑡= 𝑛 (4.13)
std-dev √of noise
𝑡 = effect size 𝑛 (4.14)
¯ − 𝜇0 √
𝑥
𝑡= 𝑛 (4.15)
𝑠𝑥
4 Compute the probability of the test statistic under the null hypotheis. This require to have
the distribution of the t statistic under 𝐻0 .
Example
Given the following samples, we will test whether its true mean is 1.75.
Warning, when computing the std or the variance, set ddof=1. The default value, ddof=0, leads
to the biased estimator of the variance.
x = [1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87]
print(xbar)
90 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
1.816
2.3968766311585883
The :math:`p`-value is the probability to observe a value 𝑡 more extreme than the observed
one 𝑡𝑜𝑏𝑠 under the null hypothesis 𝐻0 : 𝑃 (𝑡 > 𝑡𝑜𝑏𝑠 |𝐻0 )
alpha=.8, label="p-value")
_ = plt.legend()
See: http://www.ats.ucla.edu/stat/mult_pkg/whatstat/
4.1.6 Pearson correlation test: test association between two quantitative variables
Test the correlation coefficient of two quantitative variables. The test calculates a Pearson cor-
relation coefficient and the 𝑝-value for testing non-correlation.
Let 𝑥 and 𝑦 two quantitative variables, where 𝑛 samples were obeserved. The linear correlation
coeficient is defined as :
∑︀𝑛
(𝑥𝑖 − 𝑥¯)(𝑦𝑖 − 𝑦¯)
𝑟 = √︀∑︀𝑛 𝑖=1 √︀∑︀𝑛 .
𝑖=1 (𝑥𝑖 − 𝑥¯)2 ¯)2
𝑖=1 (𝑦𝑖 − 𝑦
√ 𝑟
Under 𝐻0 , the test statistic 𝑡 = 𝑛 − 2 √1−𝑟 2
follow Student distribution with 𝑛 − 2 degrees of
freedom.
n = 50
x = np.random.normal(size=n)
y = 2 * x + np.random.normal(size=n)
92 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
0.8297883544365898 9.497428029783463e-14
The two-sample 𝑡-test (Snedecor and Cochran, 1989) is used to determine if two population
means are equal. There are several variations on this test. If data are paired (e.g. 2 measures,
before and after treatment for each individual) use the one-sample 𝑡-test of the difference. The
variances of the two samples may be assumed to be equal (a.k.a. homoscedasticity) or unequal
(a.k.a. heteroscedasticity).
Assumptions
Assume that the two random variables are normally distributed: 𝑦1 ∼ 𝒩 (𝜇1 , 𝜎1 ), 𝑦2 ∼
𝒩 (𝜇2 , 𝜎2 ).
3. 𝑡-test
difference of means
𝑡= (4.16)
standard dev of error
difference of means
= (4.17)
its standard error
𝑦¯1 − 𝑦¯2 √
= √︀∑︀ 𝑛−2 (4.18)
𝜀2
𝑦¯1 − 𝑦¯2
= (4.19)
𝑠𝑦¯1 −𝑦¯2
𝑠2𝑦1 𝑠2𝑦
𝑠2𝑦¯1 −𝑦¯2 = 𝑠2𝑦¯1 + 𝑠2𝑦¯2 = + 2 (4.20)
𝑛1 𝑛2
thus (4.21)
√︃
𝑠2𝑦1 𝑠2𝑦
𝑠𝑦¯1 −𝑦¯2 = + 2 (4.22)
𝑛1 𝑛2
To compute the 𝑝-value one needs the degrees of freedom associated with this variance estimate.
It is approximated using the Welch–Satterthwaite equation:
(︂ )︂2
𝑠2𝑦1 𝑠2𝑦2
𝑛1 + 𝑛2
𝜈≈ 𝑠4𝑦1 𝑠4𝑦2
.
𝑛21 (𝑛1 −1)
+ 𝑛22 (𝑛2 −1)
94 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
If we assume equal variance (ie, 𝑠2𝑦1 = 𝑠2𝑦1 = 𝑠2 ), where 𝑠2 is an estimator of the common
variance of the two samples:
then
√︃ √︂
𝑠2 𝑠2 1 1
𝑠𝑦¯1 −𝑦¯2 = + =𝑠 +
𝑛1 𝑛2 𝑛1 𝑛2
Therefore, the 𝑡 statistic, that is used to test whether the means are different is:
𝑦¯ − 𝑦¯2
𝑡= √︁1 ,
𝑠 · 𝑛11 + 1
𝑛2
𝑦¯1 − 𝑦¯2 √
𝑡= √ · 𝑛 (4.25)
𝑠 2
√
≈ effect size · 𝑛 (4.26)
difference of means √
≈ · 𝑛 (4.27)
standard deviation of the noise
Example
Given the following two samples, test whether their means are equal using the standard t-test,
assuming equal variance.
height = np.array([ 1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68,␣
˓→ 1.87,
Ttest_indResult(statistic=3.5511519888466885, pvalue=0.00228208937112721)
Analysis of variance (ANOVA) provides a statistical test of whether or not the means of several
(k) groups are equal, and therefore generalizes the 𝑡-test to more than two groups. ANOVAs
are useful for comparing (testing) three or more means (groups or variables) for statistical
significance. It is conceptually similar to multiple two-sample 𝑡-tests, but is less conservative.
Here we will consider one-way ANOVA with one independent variable, ie one-way anova.
Wikipedia:
• Test if any group is on average superior, or inferior, to the others versus the null hypothesis
that all four strategies yield the same mean response
• Detect any of several possible differences.
• The advantage of the ANOVA 𝐹 -test is that we do not need to pre-specify which strategies
are to be compared, and we do not need to adjust for making multiple comparisons.
• The disadvantage of the ANOVA 𝐹 -test is that if we reject the null hypothesis, we do not
know which strategies can be said to be significantly different from the others.
Assumptions
1. The samples are randomly selected in an independent manner from the k populations.
2. All k populations have distributions that are approximately normal. Check by plotting
groups distribution.
3. The k population variances are equal. Check by plotting groups distribution.
Is there a difference in Petal Width in species from iris dataset. Let 𝑦1 , 𝑦2 and 𝑦3 be Petal Width
in three species.
Here we assume (see assumptions) that the three populations were sampled from three random
variables that are normally distributed. I.e., 𝑌1 ∼ 𝑁 (𝜇1 , 𝜎1 ), 𝑌2 ∼ 𝑁 (𝜇2 , 𝜎2 ) and 𝑌3 ∼ 𝑁 (𝜇3 , 𝜎3 ).
96 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
3. 𝐹 -test
Explained variance
𝐹 = (4.28)
Unexplained variance
Between-group variability 𝑠2
= = 2𝐵 . (4.29)
Within-group variability 𝑠𝑊
where 𝑦¯𝑖· denotes the sample mean in the 𝑖th group, 𝑛𝑖 is the number of observations in the 𝑖th
group, 𝑦¯ denotes the overall mean of the data, and 𝐾 denotes the number of groups.
The “unexplained variance”, or “within-group variability” is
∑︁
𝑠2𝑊 = (𝑦𝑖𝑗 − 𝑦¯𝑖· )2 /(𝑁 − 𝐾),
𝑖𝑗
where 𝑦𝑖𝑗 is the 𝑗th observation in the 𝑖th out of 𝐾 groups and 𝑁 is the overall sample size.
This 𝐹 -statistic follows the 𝐹 -distribution with 𝐾 − 1 and 𝑁 − 𝐾 degrees of freedom under the
null hypothesis. The statistic will be large if the between-group variability is large relative to
the within-group variability, which is unlikely to happen if the population means of the groups
all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, 𝐹 = 𝑡2 where 𝑡 is the
Student’s 𝑡 statistic.
Iris dataset:
# Group means
means = iris.groupby("Species").mean().reset_index()
print(means)
# Plot groups
ax = sns.violinplot(x="Species", y="SepalLength", data=iris)
ax = sns.swarmplot(x="Species", y="SepalLength", data=iris,
color="white")
ax = sns.swarmplot(x="Species", y="SepalLength", color="black", data=means,␣
˓→size=10)
# ANOVA
lm = smf.ols('SepalLength ~ Species', data=iris).fit()
sm.stats.anova_lm(lm, typ=2) # Type 2 ANOVA DataFrame
Computes the chi-square, 𝜒2 , statistic and 𝑝-value for the hypothesis test of independence of
frequencies in the observed contingency table (cross-table). The observed frequencies are tested
against an expected contingency table obtained by computing expected frequencies based on
the marginal sums under the assumption of independence.
Example: 20 participants: 10 exposed to some chemical product and 10 non exposed (exposed
= 1 or 0). Among the 20 participants 10 had cancer 10 not (cancer = 1 or 0). 𝜒2 tests the
association between those two variables.
# Dataset:
# 15 samples:
# 10 first exposed
exposed = np.array([1] * 10 + [0] * 10)
# 8 first with cancer, 10 without, the last two with.
cancer = np.array([1] * 8 + [0] * 10 + [1] * 2)
98 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.5
Observed table:
---------------
cancer 0 1
exposed
0 8 2
1 2 8
Statistics:
-----------
Chi2 = 5.000000, pval = 0.025347
Expected table:
---------------
[[5. 5.]
[5. 5.]]
cancer_marg = crosstab.sum(axis=1)
cancer_freq = cancer_marg / cancer_marg.sum()
print('Expected frequencies:')
print(np.outer(exposed_freq, cancer_freq))
np.random.seed(3)
sns.regplot(x=age, y=sbp)
# Non-Parametric Spearman
cor, pval = scipy.stats.spearmanr(age, sbp)
print("Non-Parametric Spearman cor test, cor: %.4f, pval: %.4f" % (cor, pval))
Source: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when com-
paring two related samples, matched samples, or repeated measurements on a single sample
to assess whether their population mean ranks differ (i.e. it is a paired difference test). It is
equivalent to one-sample test of the difference of paired samples.
It can be used as an alternative to the paired Student’s 𝑡-test, 𝑡-test for matched pairs, or the 𝑡-
test for dependent samples when the population cannot be assumed to be normally distributed.
When to use it? Observe the data distribution: - presence of outliers - the distribution of the
residuals is not Gaussian
It has a lower sensitivity compared to 𝑡-test. May be problematic to use when the sample size is
small.
Null hypothesis 𝐻0 : difference between the pairs follows a symmetric distribution around zero.
n = 20
# Buisness Volume time 0
bv0 = np.random.normal(loc=3, scale=.1, size=n)
# Buisness Volume time 1
bv1 = bv0 + 0.1 + np.random.normal(loc=0, scale=.1, size=n)
# create an outlier
bv1[0] -= 10
# Paired t-test
print(scipy.stats.ttest_rel(bv0, bv1))
Ttest_relResult(statistic=0.7766377807752968, pvalue=0.44693401731548044)
WilcoxonResult(statistic=23.0, pvalue=0.001209259033203125)
n = 20
# Buismess Volume group 0
bv0 = np.random.normal(loc=1, scale=.1, size=n)
# create an outlier
bv1[0] -= 10
# Two-samples t-test
print(scipy.stats.ttest_ind(bv0, bv1))
# Wilcoxon
print(scipy.stats.mannwhitneyu(bv0, bv1))
Ttest_indResult(statistic=0.6104564820307219, pvalue=0.5451934484051324)
MannwhitneyuResult(statistic=41.0, pvalue=9.037238869417781e-06)
Given 𝑛 random samples (𝑦𝑖 , 𝑥1𝑖 , . . . , 𝑥𝑝𝑖 ), 𝑖 = 1, . . . , 𝑛, the linear regression models the relation
between the observations 𝑦𝑖 and the independent variables 𝑥𝑝𝑖 is formulated as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + · · · + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖 𝑖 = 1, . . . , 𝑛
• The 𝛽’s are the model parameters, ie, the regression coeficients.
• 𝛽0 is the intercept or the bias.
• 𝜀𝑖 are the residuals.
• An independent variable (IV). It is a variable that stands alone and isn’t changed by
the other variables you are trying to measure. For example, someone’s age might be an
independent variable. Other factors (such as what they eat, how much they go to school,
how much television they watch) aren’t going to change a person’s age. In fact, when
you are looking for some kind of relationship between variables you are trying to see if
the independent variable causes some kind of change in the other variables, or dependent
variables. In Machine Learning, these variables are also called the predictors.
• A dependent variable. It is something that depends on other factors. For example, a test
score could be a dependent variable because it could change depending on several factors
such as how much you studied, how much sleep you got the night before you took the
test, or even how hungry you were when you took it. Usually when you are looking for
a relationship between two things you are trying to find out what makes the dependent
variable change the way it does. In Machine Learning this variable is called a target
variable.
Assumptions
Using the dataset “salary”, explore the association between the dependant variable (e.g. Salary)
and the independent variable (e.g.: Experience is quantitative), considering only non-managers.
df = salary[salary.management == 'N']
Model the data on some hypothesis e.g.: salary is a linear function of the experience.
salary𝑖 = 𝛽0 + 𝛽 experience𝑖 + 𝜖𝑖 ,
more generally
𝑦𝑖 = 𝛽0 + 𝛽 𝑥𝑖 + 𝜖𝑖
This can be rewritten in the matrix form using the design matrix made of values of independant
variable and the intercept:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦1 1 𝑥1 𝜖1
⎢𝑦2 ⎥ ⎢1 𝑥2 ⎥ [︂ ]︂ ⎢𝜖2 ⎥
⎢𝑦3 ⎥ = ⎢1 𝑥3 ⎥ 𝛽0 + ⎢𝜖3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ 𝛽1 ⎢ ⎥
⎣𝑦4 ⎦ ⎣1 𝑥4 ⎦ ⎣𝜖4 ⎦
𝑦5 1 𝑥5 𝜖5
• 𝛽: the slope or coefficient or parameter of the model,
• 𝛽0 : the intercept or bias is the second parameter of the model,
• 𝜖𝑖 : is the 𝑖th error, or residual with 𝜖 ∼ 𝒩 (0, 𝜎 2 ).
The simple regression is equivalent to the Pearson correlation.
Recall from calculus that an extreme point can be found by computing where the derivative is
zero, i.e. to find the intercept, we perform the steps:
𝜕𝑆𝑆𝐸 ∑︁
= (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 ) = 0
𝜕𝛽0
𝑖
∑︁ ∑︁
𝑦𝑖 = 𝛽 𝑥𝑖 + 𝑛 𝛽0
𝑖 𝑖
𝑛 𝑦¯ = 𝑛 𝛽 𝑥
¯ + 𝑛 𝛽0
𝛽0 = 𝑦¯ − 𝛽 𝑥
¯
To find the regression coefficient, we perform the steps:
𝜕𝑆𝑆𝐸 ∑︁
= 𝑥𝑖 (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 ) = 0
𝜕𝛽
𝑖
Plug in 𝛽0 :
∑︁
𝑥𝑖 (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝑦¯ + 𝛽 𝑥
¯) = 0
𝑖
∑︁ ∑︁ ∑︁
𝑥𝑖 𝑦𝑖 − 𝑦¯ 𝑥𝑖 = 𝛽 (𝑥𝑖 − 𝑥
¯)
𝑖 𝑖 𝑖
y, x = df.salary, df.experience
beta, beta0, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print("y = %f x + %f, r: %f, r-squared: %f,\np-value: %f, std_err: %f"
% (beta, beta0, r_value, r_value**2, p_value, std_err))
print("Using seaborn")
ax = sns.regplot(x="experience", y="salary", data=df)
Using seaborn
Multiple regression
Theory
or, simplified
𝑃 −1
𝛽𝑗 𝑥𝑗𝑖 + 𝜀𝑖 .
∑︁
𝑦𝑖 = 𝛽0 +
𝑗=1
Extending each sample with an intercept, 𝑥𝑖 := [1, 𝑥𝑖 ] ∈ 𝑅𝑃 +1 allows us to use a more general
notation based on linear algebra and write it as a simple dot product:
𝑦𝑖 = x𝑇𝑖 𝛽 + 𝜀𝑖 ,
where 𝛽 ∈ 𝑅𝑃 +1 is a vector of weights that define the 𝑃 + 1 parameters of the model. From
now we have 𝑃 regressors + the intercept.
Using the matrix notation:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦1 1 𝑥11 ... 𝑥1𝑃 ⎡ 𝛽 ⎤ 𝜖1
0
⎢𝑦2 ⎥ ⎢1 𝑥21 ... 𝑥2𝑃 ⎥ ⎢ 𝛽 ⎥ ⎢𝜖2 ⎥
⎥ ⎢
⎢ ⎥ ⎢
⎢𝑦3 ⎥ = ⎢1 ⎢ 1⎥ ⎢ ⎥
𝑥31 ... 𝑥3𝑃 ⎥
⎥⎢ .. ⎥ + ⎢𝜖3 ⎥
𝑥4𝑃 ⎦ .
⎢ ⎥ ⎢ ⎦ ⎣ ⎥
⎣𝑦4 ⎦ ⎣1 𝑥41 ...
⎣
𝜖4 ⎦
𝑦5 1 𝑥5 ... 𝑥5 𝛽 𝑃 𝜖5
Let 𝑋 = [𝑥𝑇0 , ..., 𝑥𝑇𝑁 ] be the (𝑁 × 𝑃 + 1) design matrix of 𝑁 samples of 𝑃 input features with
one column of one and let be 𝑦 = [𝑦1 , ..., 𝑦𝑁 ] be a vector of the 𝑁 targets.
𝑦 = 𝑋𝛽 + 𝜀
Using the matrix notation, the mean squared error (MSE) loss can be rewritten:
1
𝑀 𝑆𝐸(𝛽) = ||𝑦 − 𝑋𝛽||22 .
𝑁
The 𝛽 that minimises the MSE can be found by:
(︂ )︂
1
∇𝛽 ||𝑦 − 𝑋𝛽||22 =0 (4.30)
𝑁
1
∇𝛽 (𝑦 − 𝑋𝛽)𝑇 (𝑦 − 𝑋𝛽) = 0 (4.31)
𝑁
1
∇𝛽 (𝑦 𝑇 𝑦 − 2𝛽 𝑇 𝑋 𝑇 𝑦 + 𝛽 𝑇 𝑋 𝑇 𝑋𝛽) = 0 (4.32)
𝑁
−2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝛽 = 0 (4.33)
𝑋 𝑇 𝑋𝛽 = 𝑋 𝑇 𝑦 (4.34)
𝑇 −1 𝑇
𝛽 = (𝑋 𝑋) 𝑋 𝑦, (4.35)
⎡ ⎤ ⎡ ⎤ ⎡ 10 ⎤ ⎡ ⎤
𝑦1 1 𝑥1,1 𝑥1,2 𝑥1,3 𝜖1
⎢ .. ⎥ ⎢ .. .. .. .. ⎥ ⎢ 1 ⎥ ⎢ . ⎥
⎣ . ⎦ = ⎣. . . . ⎦ ⎣0.5⎦ + ⎣ .. ⎦
⎢ ⎥
𝑦50 1 𝑥50,1 𝑥50,2 𝑥50,3 0.1 𝜖50
# Dataset
N, P = 50, 4
X = np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[:, 0] = 1
print(X[:5, :])
Xpinv = linalg.pinv2(X)
betahat = np.dot(Xpinv, y)
print("Estimated beta:\n", betahat)
Estimated beta:
[10.14742501 0.57938106 0.51654653 0.17862194]
Sources: http://statsmodels.sourceforge.net/devel/examples/
Multiple regression
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Use R language syntax for data.frame. For an additive model: 𝑦𝑖 = 𝛽 0 + 𝑥1𝑖 𝛽 1 + 𝑥2𝑖 𝛽 2 + 𝜖𝑖 ≡ y ~
x1 + x2.
print(df.columns, df.shape)
# Build a model excluding the intercept, it is implicit
model = smf.ols("y~x1 + x2 + x3", df).fit()
print(model.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Analysis of covariance (ANCOVA) is a linear model that blends ANOVA and linear regression.
ANCOVA evaluates whether population means of a dependent variable (DV) are equal across
levels of a categorical independent variable (IV) often called a treatment, while statistically
controlling for the effects of other quantitative or continuous variables that are not of primary
interest, known as covariates (CV).
df = salary.copy()
Normality assumption of the residuals can be rejected (p-value < 0.05). There is an efect of the
“management” factor, take it into account.
One-way AN(C)OVA
sum_sq df F PR(>F)
management 5.755739e+08 1.0 183.593466 4.054116e-17
experience 3.334992e+08 1.0 106.377768 3.349662e-13
Residual 1.348070e+08 43.0 NaN NaN
Jarque-Bera normality test p-value 0.004
Distribution of residuals is still not normal but closer to normality. Both management and
experience are significantly associated with salary.
Two-way AN(C)OVA
df["residuals"] = twoway.resid
sns.displot(df, x='residuals', kind="kde", fill=True)
print(sm.stats.anova_lm(twoway, typ=2))
sum_sq df F PR(>F)
education 9.152624e+07 2.0 43.351589 7.672450e-11
management 5.075724e+08 1.0 480.825394 2.901444e-24
experience 3.380979e+08 1.0 320.281524 5.546313e-21
Residual 4.328072e+07 41.0 NaN NaN
Jarque-Bera normality test p-value 0.506
Normality assumtion cannot be rejected. Assume it. Education, management and experience
are significantly associated with salary.
oneway is nested within twoway. Comparing two nested models tells us if the additional predic-
tors (i.e. education) of the full model significantly decrease the residuals. Such comparison can
be done using an 𝐹 -test on residuals:
Factor coding
See http://statsmodels.sourceforge.net/devel/contrasts.html
By default Pandas use “dummy coding”. Explore:
print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, :])
[[1. 0. 0. 1. 1.]
[1. 0. 1. 0. 1.]
[1. 0. 1. 1. 1.]
[1. 1. 0. 0. 1.]
[1. 0. 1. 0. 1.]
[1. 1. 0. 1. 2.]
[1. 1. 0. 0. 2.]
[1. 0. 0. 0. 2.]
[1. 0. 1. 0. 2.]
[1. 1. 0. 0. 3.]]
# Dataset
n_samples, n_features = 100, 1000
n_info = int(n_features/10) # number of features with information
n1, n2 = int(n_samples/2), n_samples - int(n_samples/2)
snr = .5
Y = np.random.randn(n_samples, n_features)
grp = np.array(["g1"] * n1 + ["g2"] * n2)
#
import scipy.stats as stats
import matplotlib.pyplot as plt
tvals, pvals = np.full(n_features, np.NAN), np.full(n_features, np.NAN)
for j in range(n_features):
tvals[j], pvals[j] = stats.ttest_ind(Y[grp=="g1", j], Y[grp=="g2", j],
equal_var=True)
axis[2].hist([pvals[n_info:], pvals[:n_info]],
stacked=True, bins=100, label=["Negatives", "Positives"])
axis[2].set_xlabel("p-value histogram")
axis[2].set_ylabel("density")
axis[2].legend()
plt.tight_layout()
Note that under the null hypothesis the distribution of the p-values is uniform.
Statistical measures:
• True Positive (TP) equivalent to a hit. The test correctly concludes the presence of an
effect.
• True Negative (TN). The test correctly concludes the absence of an effect.
• False Positive (FP) equivalent to a false alarm, Type I error. The test improperly con-
cludes the presence of an effect. Thresholding at 𝑝-value < 0.05 leads to 47 FP.
• False Negative (FN) equivalent to a miss, Type II error. The test improperly concludes the
absence of an effect.
The Bonferroni correction is based on the idea that if an experimenter is testing 𝑃 hypothe-
ses, then one way of maintaining the familywise error rate (FWER) is to test each individual
hypothesis at a statistical significance level of 1/𝑃 times the desired maximum overall level.
So, if the desired significance level for the whole family of tests is 𝛼 (usually 0.05), then the
Bonferroni correction would test each individual hypothesis at a significance level of 𝛼/𝑃 . For
example, if a trial is testing 𝑃 = 8 hypotheses with a desired 𝛼 = 0.05, then the Bonferroni
correction would test each individual hypothesis at 𝛼 = 0.05/8 = 0.00625.
FDR-controlling procedures are designed to control the expected proportion of rejected null
hypotheses that were incorrect rejections (“false discoveries”). FDR-controlling procedures pro-
vide less stringent control of Type I errors compared to the familywise error rate (FWER) con-
trolling procedures (such as the Bonferroni correction), which control the probability of at least
one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased
rates of Type I errors.
The study provides the brain volumes of grey matter (gm), white matter (wm) and cerebrospinal
fluid) (csf) of 808 anatomical MRI scans.
import os
import os.path
import pandas as pd
import tempfile
import urllib.request
WD = os.path.join(tempfile.gettempdir(), "brainvol")
os.makedirs(WD, exist_ok=True)
#os.chdir(WD)
Fetch data
• Demographic data demo.csv (columns: participant_id, site, group, age, sex) and tissue
volume data: group is Control or Patient. site is the recruiting site.
• Gray matter volume gm.csv (columns: participant_id, session, gm_vol)
• White matter volume wm.csv (columns: participant_id, session, wm_vol)
• Cerebrospinal Fluid csf.csv (columns: participant_id, session, csf_vol)
base_url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/brain_
˓→volumes/%s'
data = dict()
for file in ["demo.csv", "gm.csv", "wm.csv", "csf.csv"]:
urllib.request.urlretrieve(base_url % file, os.path.join(WD, "data", file))
Out:
brain_vol = brain_vol.dropna()
assert brain_vol.shape == (766, 9)
import os
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smfrmla
import statsmodels.api as sm
Descriptive statistics Most of participants have several MRI sessions (column session) Select
on rows from session one “ses-01”
desc_glob_num = brain_vol1.describe()
print(desc_glob_num)
Out:
Out:
Out:
Out:
gm_vol
count meanstd min 25% 50% 75% max
group
Control 86.00 0.72 0.09 0.48 0.66 0.71 0.78 1.03
Patient 157.00 0.70 0.08 0.53 0.65 0.70 0.76 0.90
4.2.3 Statistics
Objectives:
1. Site effect of gray matter atrophy
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
3. Test for differences of atrophy between the patients and the controls
4. Test for interaction between age and clinical status, ie: is the brain atrophy process in
patient population faster than in the control population.
5. The effect of the medication in the patient population.
import statsmodels.api as sm
import statsmodels.formula.api as smfrmla
import scipy.stats
import seaborn as sns
Out:
<AxesSubplot:xlabel='site', ylabel='gm_f'>
Out:
print(sm.stats.anova_lm(anova, typ=2))
Out:
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
Plot
Out:
Out:
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Before testing for differences of atrophy between the patients ans the controls Preliminary tests
for age x group effect (patients would be older or younger than Controls)
Plot
Out:
<AxesSubplot:xlabel='group', ylabel='age'>
print(scipy.stats.ttest_ind(brain_vol1_ctl.age, brain_vol1_pat.age))
Out:
Ttest_indResult(statistic=-1.2155557697674162, pvalue=0.225343592508479)
Out:
----------------------------------------------------------------------------------
˓→--
==============================================================================
Omnibus: 35.711 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 20.726
Skew: 0.569 Prob(JB): 3.16e-05
Kurtosis: 2.133 Cond. No. 3.12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Preliminary tests for sex x group (more/less males in patients than in Controls)
Out:
3. Test for differences of atrophy between the patients and the controls
Out:
sum_sq df F PR(>F)
group 0.00 1.00 0.01 0.92
Residual 0.46 241.00 nan nan
No significant difference in atrophy between patients and controls
print(sm.stats.anova_lm(smfrmla.ols(
"gm_f ~ group + age + site", data=brain_vol1).fit(), typ=2))
print("No significant difference in GM between patients and controls")
Out:
sum_sq df F PR(>F)
group 0.00 1.00 1.82 0.18
site 0.11 5.00 19.79 0.00
age 0.09 1.00 86.86 0.00
Residual 0.25 235.00 nan nan
No significant difference in GM between patients and controls
print("%.3f%% of grey matter loss per year (almost %.1f%% per decade)" %
(ancova.params.age * 100, ancova.params.age * 100 * 10))
Out:
sum_sq df F PR(>F)
site 0.11 5.00 20.28 0.00
age 0.10 1.00 89.37 0.00
group:age 0.00 1.00 3.28 0.07
Residual 0.25 235.00 nan nan
= Parameters =
Intercept 0.52
site[T.S3] 0.01
(continues on next page)
Acknowledgements: Firstly, it’s right to pay thanks to the blogs and sources I have used in
writing this tutorial. Many parts of the text are quoted from the brillant book from Brady T.
West, Kathleen B. Welch and Andrzej T. Galecki, see [Brady et al. 2014] in the references section
below.
4.3.1 Introduction
Quoted from [Brady et al. 2014]:A linear mixed model (LMM) is a parametric linear model for
clustered, longitudinal, or repeated-measures data that quantifies the relationships between
a continuous dependent variable and various predictor variables. An LMM may include both
fixed-effect parameters associated with one or more continuous or categorical covariates and
random effects associated with one or more random factors. The mix of fixed and random
effects gives the linear mixed model its name. Whereas fixed-effect parameters describe the re-
lationships of the covariates to the dependent variable for an entire population, random effects
are specific to clusters or subjects within a population. LMM is closely related with hierarchical
linear model (HLM).
Clustered/structured datasets
Quoted from [Bruin 2006]: Random effects, are used when there is non independence in the
data, such as arises from a hierarchical structure with clustered data. For example, students
could be sampled from within classrooms, or patients from within doctors. When there are
multiple levels, such as patients seen by the same doctor, the variability in the outcome can be
thought of as being either within group or between group. Patient level observations are not
independent, as within a given doctor patients are more similar. Units sampled at the highest
level (in our example, doctors) are independent.
The continuous outcome variables is structured or clustered into units within observations
are not independents. Types of clustered data:
1. studies with clustered data, such as students in classrooms, or experimental designs with
random blocks, such as batches of raw material for an industrial process
2. longitudinal or repeated-measures studies, in which subjects are measured repeatedly
over time or under different conditions.
Fixed effects may be associated with continuous covariates, such as weight, baseline test score,
or socioeconomic status, which take on values from a continuous (or sometimes a multivalued
ordinal) range, or with factors, such as gender or treatment group, which are categorical. Fixed
effects are unknown constant parameters associated with either continuous covariates or the
levels of categorical factors in an LMM. Estimation of these parameters in LMMs is generally of
intrinsic interest, because they indicate the relationships of the covariates with the continuous
outcome variable.
Example: Suppose we want to study the relationship between the height of individuals and their
gender. We will: sample individuals in a population (first source of randomness), measure their
height (second source of randomness), and consider their gender (fixed for a given individual).
Finally, these measures are modeled in the following linear model:
height𝑖 = 𝛽0 + 𝛽1 gender𝑖 + 𝜀𝑖
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.read_csv('datasets/score_parentedu_byclass.csv')
print(df.head())
_ = sns.scatterplot(x="edu", y="score", hue="classroom", data=df)
Global effect regresses the the independant variable 𝑦 = score on the dependant variable 𝑥 =
edu without considering the any classroom effect. For each individual 𝑖 the model is:
where, 𝛽0 is the global intercept, 𝛽1 is the slope associated with edu and 𝜀𝑖𝑗 is the random error
at the individual level. Note that the classeroom, 𝑗 index is not taken into account by the model
and could be removed from the equation.
The general R formula is: y ~ x which in this case is score ~ edu. This model is:
• Not sensitive since it does not model the classroom effect (high standard error).
• Wrong because, residuals are not normals, and it considers samples from the same class-
room to be indenpendant.
#print(lm_glob.summary())
print(lm_glob.t_test('edu'))
print("MSE=%.3f" % lm_glob.mse_resid)
results.loc[len(results)] = ["LM-Global (biased)"] +\
list(rmse_coef_tstat_pval(mod=lm_glob, var='edu'))
Plot
Model diagnosis: plot the normality of the residuals and residuals vs prediction.
plot_lm_diagnosis(residual=lm_glob.resid,
prediction=lm_glob.predict(df), group=df.classroom)
Remember ANCOVA = ANOVA with covariates. Model the classroom 𝑧 = classroom (as a fixed
effect), ie a vertical shift for each classroom. The slope is the same for all classrooms. For each
individual 𝑖 and each classroom 𝑗 the model is:
where, 𝑢𝑗 is the coefficient (an intercept, or a shift) associated with classroom 𝑗 and 𝑧𝑖𝑗 = 1 if
subject 𝑖 belongs to classroom 𝑗 else 𝑧𝑖𝑗 = 0.
The general R formula is: y ~ x + z which in this case is score ~ edu + classroom.
This model is:
• Sensitive since it does not model the classroom effect (lower standard error). But,
• questionable because it considers the classroom to have a fixed constant effect without
any uncertainty. However, those classrooms have been sampled from a larger samples of
classrooms within the country.
print("MSE=%.3f" % ancova_inter.mse_resid)
results.loc[len(results)] = ["ANCOVA-Inter (biased)"] +\
list(rmse_coef_tstat_pval(mod=ancova_inter, var='edu'))
Plot
plot_ancova_oneslope_grpintercept(x="edu", y="score",
group="classroom", model=ancova_inter, df=df)
mod = ancova_inter
print("## Statistics:")
print(mod.tvalues, mod.pvalues)
plot_lm_diagnosis(residual=ancova_inter.resid,
prediction=ancova_inter.predict(df), group=df.classroom)
Fixed effect is the coeficient or parameter (𝛽1 in the model) that is associated with a continuous
covariates (age, education level, etc.) or (categorical) factor (sex, etc.) that is known without
uncertainty once a subject is sampled.
Random effect, in contrast, is the coeficient or parameter (𝑢𝑗 in the model below) that is as-
sociated with a continuous covariates or factor (classroom, individual, etc.) that is not known
without uncertainty once a subject is sampled. It generally conrespond to some random sam-
pling. Here the classroom effect depends on the teacher which has been sampled from a larger
samples of classrooms within the country. Measures are structured by units or a clustering
structure that is possibly hierarchical. Measures within units are not independant. Measures
between top level units are independant.
There are multiple ways to deal with structured data with random effect. One simple approach
is to aggregate.
Aggregation of measure at classroom level: average all values within classrooms to perform
statistical analysis between classroom. 1. Level 1 (within unit): Average by classrom:
𝑦 𝑗 = 𝛽 0 + 𝛽 1 𝑥 𝑗 + 𝜀𝑗
agregate = df.groupby('classroom').mean()
lm_agregate = smf.ols('score ~ edu', agregate).fit()
#print(lm_agregate.summary())
print(lm_agregate.t_test('edu'))
(continues on next page)
print("MSE=%.3f" % lm_agregate.mse_resid)
results.loc[len(results)] = ["Aggregation"] +\
list(rmse_coef_tstat_pval(mod=lm_agregate, var='edu'))
Plot
agregate = agregate.reset_index()
fig, axes = plt.subplots(1, 2, figsize=(9, 3), sharex=True, sharey=True)
sns.scatterplot(x='edu', y='score', hue='classroom',
data=df, ax=axes[0], s=20, legend=False)
sns.scatterplot(x='edu', y='score', hue='classroom',
data=agregate, ax=axes[0], s=150)
axes[0].set_title("Level 1: Average within classroom")
Hierarchical/multilevel modeling
Another approach to hierarchical data is analyzing data from one unit at a time. Thus, we
run three separate linear regressions - one for each classroom in the sample leading to three
estimated parameters of the score vs edu association. Then the paramteres are tested across
the classrooms:
1. Run three separate linear regressions - one for each classroom
The general R formula is: y ~ x which in this case is score ~ edu within classrooms.
2. Test across the classrooms if is the mean𝑗 (𝛽1𝑗 ) = 𝛽0 ̸= 0 :
𝛽1𝑗 = 𝛽0 + 𝜀𝑗
results.loc[len(results)] = ["Hierarchical"] + \
list(rmse_coef_tstat_pval(mod=lm_hm, var='Intercept'))
classroom beta
0 c0 0.129084
1 c1 0.177567
2 c2 0.055772
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
(continues on next page)
Plot
Linear mixed models (also called multilevel models) can be thought of as a trade off between
these two alternatives. The individual regressions has many estimates and lots of data, but is
noisy. The aggregate is less noisy, but may lose important differences by averaging all samples
within each classroom. LMMs are somewhere in between.
Model the classroom 𝑧 = classroom (as a random effect). For each individual 𝑖 and each
classroom 𝑗 the model is:
results.loc[len(results)] = ["LMM-Inter"] + \
list(rmse_coef_tstat_pval(mod=lmm_inter, var='edu'))
Explore model
print("Fixed effect:")
print(lmm_inter.params)
print("Random effect:")
print(lmm_inter.random_effects)
Fixed effect:
Intercept 9.865327
edu 0.131193
Group Var 10.844222
dtype: float64
Random effect:
{'c0': Group -2.889009
dtype: float64, 'c1': Group -0.323129
dtype: float64, 'c2': Group 3.212138
dtype: float64}
Plot
plot_lmm_oneslope_randintercept(x='edu', y='score',
group='classroom', df=df, model=lmm_inter)
Now suppose that the classroom random effect is not just a vertical shift (random intercept)
but that some teachers “compensate” or “amplify” educational disparity. The slope of the linear
relation between score and edu for teachers that amplify will be larger. In the contrary, it will
be smaller for teachers that compensate.
Model the classroom intercept and slope as a fixed effect: ANCOVA with interactions
1. Model the global association between edu and score: 𝑦𝑖𝑗 = 𝛽0 + 𝛽1 𝑥𝑖𝑗 , in R: score ~ edu.
2. Model the classroom 𝑧𝑗 = classroom (as a fixed effect) as a vertical shift (intercept, 𝑢1𝑗 )
for each classroom 𝑗 indicated by 𝑧𝑖𝑗 : 𝑦𝑖𝑗 = 𝑢1𝑗 𝑧𝑖𝑗 , in R: score ~ classroom.
3. Model the classroom (as a fixed effect) specitic slope (𝑢𝛼𝑗 ): 𝑦𝑖 = 𝑢𝛼𝑗 𝑥𝑖 𝑧𝑗 score ~
edu:classroom. The 𝑥𝑖 𝑧𝑗 forms 3 new columns with values of 𝑥𝑖 for each edu level, ie.:
for 𝑧𝑗 classroom 1, 2 and 3.
4. Put everything together:
# print(sm.stats.anova_lm(lm_fx, typ=3))
# print(lm_fx.summary())
print(ancova_full.t_test('edu'))
print("MSE=%.3f" % ancova_full.mse_resid)
results.loc[len(results)] = ["ANCOVA-Full (biased)"] + \
list(rmse_coef_tstat_pval(mod=ancova_full, var='edu'))
The graphical representation of the model would be the same than the one provided for “Model
a classroom intercept as a fixed effect: ANCOVA”. The same slope (associated to edu) with
different interpcept, depicted as dashed black lines. Moreover we added, as solid lines, the
model’s prediction that account different slopes.
print("Model parameters:")
print(ancova_full.params)
(continues on next page)
plot_ancova_fullmodel(x='edu', y='score',
group='classroom', df=df, model=ancova_full)
Model parameters:
Intercept 6.973753
classroom[T.c1] 2.316540
classroom[T.c2] 6.578594
edu 0.129084
edu:classroom[T.c1] 0.048482
edu:classroom[T.c2] -0.073313
dtype: float64
but:
• 𝑢1𝑗 is a random intercept associated with classroom 𝑗 following the same normal distri-
bution for all classroom, 𝑢1𝑗 ∼ 𝒩 (0, 𝜎 1 ).
• 𝑢𝛼𝑗 is a random slope associated with classroom 𝑗 following the same normal distribution
for all classroom, 𝑢𝛼𝑗 ∼ 𝒩 (0, 𝜎 𝛼 ).
Note the difference with linear model: the variances parameters (𝜎 1 , 𝜎 𝛼 ) should be estimated
together with fixed effect (𝛽0 + 𝛽1 ) and random effect (𝑢1 , 𝑢𝛼𝑗 , one pair of random inter-
cept/slope per classroom). The R notation is: score ~ edu + (edu | classroom). or score ~
1 + edu + (1 + edu | classroom), remember that intercepts are implicit. In statmodels, the
notation is ~1+edu or ~edu since the groups is provided by the groups argument.
/home/ed203246/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.
˓→py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge.␣
˓→Check mle_retvals
˓→lbfgs
warnings.warn(
(continues on next page)
warnings.warn(msg)
/home/ed203246/anaconda3/lib/python3.8/site-packages/statsmodels/regression/mixed_
˓→linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the␣
˓→parameter space.
warnings.warn(msg, ConvergenceWarning)
The warning results in a singular fit (correlation estimated at 1) caused by too little variance
among the random slopes. It indicates that we should considere to remove random slopes.
print(results)
Random intercepts
1. LM-Global is wrong (consider residuals to be independent) and has a large error (RMSE,
Root Mean Square Error) since it does not adjust for classroom effect.
2. ANCOVA-Inter is “wrong” (consider residuals to be independent) but it has a small error
since it adjusts for classroom effect.
3. Aggregation is ok (units average are independent) but it looses a lot of degrees of freedom
(df = 2 = 3 classroom - 1 intercept) and a lot of informations.
4. Hierarchical model is ok (unit average are independent) and it has a reasonable error
(look at the statistic, not the RMSE).
5. LMM-Inter (with random intercept) is ok (it models residuals non-independence) and it
has a small error.
6. ANCOVA-Inter, Hierarchical model and LMM provide similar coefficients for the fixed ef-
fect. So if statistical significance is not the key issue, the “biased” ANCOVA is a reasonable
choice.
7. Hierarchical and LMM with random intercept are the best options (unbiased and sensi-
tive), with an advantage to LMM.
Random slopes
Modeling individual slopes in both ANCOVA-Full and LMM-Full decreased the statistics, sug-
gesting that the supplementary regressors (one per classroom) do not significantly improve the
fit of the model (see errors).
If we consider only 6 samples (𝑖 ∈ {1, 6}, two sample for each classroom 𝑗 ∈ {c0, c1, c2}) and
the random intercept model. Stacking the 6 observations, the equation 𝑦𝑖𝑗 = 𝛽0 + 𝛽1 𝑥𝑖𝑗 + 𝑢𝑗 𝑧𝑗 +
𝜀𝑖𝑗 gives :
where u1 = 𝑢1 , 𝑢2 , 𝑢3 are the 3 parameters associated with the 3 level of the single random
factor classroom.
This can be re-written in a more general form as:
y = X𝛽 + Zu + 𝜀,
smaller than the estimated effects would be if they were computed by treating a random factor
as if it were fixed.
Overall covariance structure as variance components :math:`mathbf{V}`
The overall covariance structure can be obtained by:
∑︁
V= 𝜎𝑘 ZZ′ + R.
𝑘
∑︀ ′
The 𝑘 𝜎𝑘 ZZ define the 𝑁 × 𝑁 variance structure, using 𝑘 variance components, modeling the
non-independance between the observations. In our case with only one component we get:
⎡ ⎤ ⎡ ⎤
𝜎𝑘 𝜎𝑘 0 0 0 0 𝜎 0 0 0 0 0
⎢𝜎𝑘 𝜎𝑘 0 0 0 0 ⎥ ⎢ 0 𝜎 0 0 0 0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 0 𝜎𝑘 𝜎𝑘 0 0 ⎥ ⎢ 0 0 𝜎 0 0 0⎥
V=⎢
⎢ ⎥+⎢ ⎥
⎢ 0 0 𝜎 𝑘 𝜎𝑘 0 0 ⎥ ⎢0
⎥ ⎢ 0 0 𝜎 0 0⎥⎥
⎣ 0 0 0 0 𝜎𝑘 𝜎𝑘 ⎦ ⎣ 0 0 0 0 𝜎 0⎦
0 0 0 0 𝜎𝑘 𝜎𝑘 0 0 0 0 0 𝜎
⎡ ⎤
𝜎𝑘 + 𝜎 𝜎𝑘 0 0 0 0
⎢ 𝜎𝑘 𝜎 𝑘 + 𝜎 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 𝜎𝑘 + 𝜎 𝜎𝑘 0 0 ⎥
=⎢
⎢ 0
⎥
⎢ 0 𝜎 𝑘 𝜎 𝑘 +𝜎 0 0 ⎥⎥
⎣ 0 0 0 0 𝜎𝑘 + 𝜎 𝜎𝑘 ⎦
0 0 0 0 𝜎𝑘 𝜎𝑘 + 𝜎
𝛽ˆ = X′ V
^ −1 X−1 X′ V
^ −1 y
In the general case, V is unknown, therefore iterative solvers should be use to estimate the fixed
effect 𝛽 and the parameters (𝜎, 𝜎𝑘 , . . .) of variance-covariance matrix V. The ML Maximum
Likelihood estimates provide biased solution for V because they do not take into account the
loss of degrees of freedom that results from estimating the fixed-effect parameters in 𝛽. For this
reason, REML (restricted (or residual, or reduced) maximum likelihood) is often preferred to
ML estimation.
Tests for Fixed-Effect Parameters
Quoted from [Brady et al. 2014]: “The approximate methods that apply to both t-tests and
F-tests take into account the presence of random effects and correlated residuals in an LMM.
Several of these approximate methods (e.g., the Satterthwaite method, or the “between-within”
method) involve different choices for the degrees of freedom used in” the approximate t-tests
and F-tests”.
4.3.7 References
• Brady et al. 2014: Brady T. West, Kathleen B. Welch, Andrzej T. Galecki, Linear Mixed
Models: A Practical Guide Using Statistical Software (2nd Edition), 2014
• Bruin 2006: Introduction to Linear Mixed Models, UCLA, Statistical Consulting Group.
• Statsmodel: Linear Mixed Effects Models
• Comparing R lmer to statsmodels MixedLM
• Statsmoels: Variance Component Analysis with nested groups
Multivariate statistics includes all statistical techniques for analyzing samples made of two or
more variables. The data set (a 𝑁 × 𝑃 matrix X) is a collection of 𝑁 independent samples
column vectors [x1 , . . . , x𝑖 , . . . , x𝑁 ] of length 𝑃
⎡ 𝑇 ⎤ ⎡ ⎤ ⎡ ⎤
−x1 − 𝑥11 · · · 𝑥1𝑗 · · · 𝑥1𝑃 𝑥11 . . . 𝑥1𝑃
⎢ .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. ⎥
⎢ . ⎥ ⎢ . . . ⎥⎥ ⎢ . . ⎥
⎢
⎢ 𝑇 ⎥ ⎢ ⎥
X = ⎢−x𝑖 −⎥ = ⎢ 𝑥𝑖1 · · · 𝑥𝑖𝑗 · · · 𝑥𝑖𝑃 ⎥ = ⎢
⎢ ⎥ ⎢ ⎥ ⎢ X ⎥ .
⎢ .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. ⎥
⎥
⎣ . ⎦ ⎣ . . . ⎦ ⎣ . . ⎦
𝑇
−x𝑃 − 𝑥𝑁 1 · · · 𝑥𝑁 𝑗 · · · 𝑥𝑁 𝑃 𝑥𝑁 1 . . . 𝑥𝑁 𝑃 𝑁 ×𝑃
Source: Wikipedia
Algebraic definition
The dot product, denoted ’‘·” of two 𝑃 -dimensional vectors a = [𝑎1 , 𝑎2 , ..., 𝑎𝑃 ] and a =
[𝑏1 , 𝑏2 , ..., 𝑏𝑃 ] is defined as
⎡ ⎤
𝑏1
⎢ .. ⎥
]︀ ⎢ . ⎥
⎢ ⎥
∑︁
𝑇 𝑇
[︀
a·b=a b= 𝑎𝑖 𝑏𝑖 = 𝑎1 . . . a . . . 𝑎𝑃 ⎢ ⎢ b ⎥.
⎥
.
𝑖
⎣ .. ⎦
⎢ ⎥
𝑏𝑃
The Euclidean norm of a vector can be computed using the dot product, as
√
‖a‖2 = a · a.
a · b = 0.
At the other extreme, if they are codirectional, then the angle between them is 0° and
a · b = ‖a‖2 ‖b‖2
a · a = ‖a‖22 .
The scalar projection (or scalar component) of a Euclidean vector a in the direction of a Eu-
clidean vector b is given by
𝑎𝑏 = ‖a‖2 cos 𝜃,
Fig. 4: Projection.
import numpy as np
np.random.seed(42)
a = np.random.randn(10)
b = np.random.randn(10)
np.dot(a, b)
-4.085788532659924
𝑥𝑖𝑃 𝑥¯𝑃
• The covariance matrix ΣXX is a symmetric positive semi-definite matrix whose element
in the 𝑗, 𝑘 position is the covariance between the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ elements of a random vector
i.e. the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ columns of X.
• The covariance matrix generalizes the notion of covariance to multiple dimensions.
• The covariance matrix describe the shape of the sample distribution around the mean
assuming an elliptical distribution:
where
𝑁
1 1 ∑︁
𝑠𝑗𝑘 = 𝑠𝑘𝑗 = xj 𝑇 x k = 𝑥𝑖𝑗 𝑥𝑖𝑘
𝑁 −1 𝑁 −1
𝑖=1
np.random.seed(42)
colors = sns.color_palette()
# Generate dataset
for i in range(len(mean)):
X[i] = np.random.multivariate_normal(mean[i], Cov[i], n_samples)
# Plot
for i in range(len(mean)):
# Points
plt.scatter(X[i][:, 0], X[i][:, 1], color=colors[i], label="class %i" % i)
# Means
plt.scatter(mean[i][0], mean[i][1], marker="o", s=200, facecolors='w',
edgecolors=colors[i], linewidth=2)
# Ellipses representing the covariance matrices
pystatsml.plot_utils.plot_cov_ellipse(Cov[i], pos=mean[i], facecolor='none',
linewidth=2, edgecolor=colors[i])
plt.axis('equal')
_ = plt.legend(loc='upper left')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)
f, ax = plt.subplots(figsize=(5.5, 4.5))
cmap = sns.color_palette("RdBu_r", 11)
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(corr, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
lab=0
print(clusters)
reordered = np.concatenate(clusters)
(continues on next page)
R = corr.loc[reordered, reordered]
f, ax = plt.subplots(figsize=(5.5, 4.5))
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(R, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
[['mpg', 'cyl', 'disp', 'hp', 'wt', 'qsec', 'vs', 'carb'], ['am', 'gear'], ['drat
˓→']]
In statistics, precision is the reciprocal of the variance, and the precision matrix is the matrix
inverse of the covariance matrix.
It is related to partial correlations that measures the degree of association between two vari-
ables, while controlling the effect of other variables.
import numpy as np
print(Pcor.round(2))
# Precision matrix:
[[ 6.79 -3.21 -3.21 0. 0. 0. ]
[-3.21 6.79 -3.21 0. 0. 0. ]
[-3.21 -3.21 6.79 0. 0. 0. ]
[ 0. -0. -0. 5.26 -4.74 -0. ]
[ 0. 0. 0. -4.74 5.26 0. ]
[ 0. 0. 0. 0. 0. 1. ]]
# Partial correlations:
[[ nan 0.47 0.47 -0. -0. -0. ]
[ nan nan 0.47 -0. -0. -0. ]
[ nan nan nan -0. -0. -0. ]
[ nan nan nan nan 0.9 0. ]
[ nan nan nan nan nan -0. ]
[ nan nan nan nan nan nan]]
• The Mahalanobis distance is a measure of the distance between two points x and 𝜇 where
the dispersion (i.e. the covariance structure) of the samples is taken into account.
• The dispersion is considered through covariance matrix.
This is formally expressed as
√︁
𝐷𝑀 (x, 𝜇) = (x − 𝜇)𝑇 Σ−1 (x − 𝜇).
Intuitions
• Distances along the principal directions of dispersion are contracted since they correspond
to likely dispersion of points.
• Distances othogonal to the principal directions of dispersion are dilated since they corre-
spond to unlikely dispersion of points.
For example
√
𝐷𝑀 (1) = 1𝑇 Σ−1 1.
ones = np.ones(Cov.shape[0])
d_euc = np.sqrt(np.dot(ones, ones))
d_mah = np.sqrt(np.dot(np.dot(ones, Prec), ones))
The first dot product that distances along the principal directions of dispersion are contracted:
print(np.dot(ones, Prec))
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
%matplotlib inline
np.random.seed(40)
colors = sns.color_palette()
Covi = scipy.linalg.inv(Cov)
dm_m_x1 = scipy.spatial.distance.mahalanobis(mean, x1, Covi)
dm_m_x2 = scipy.spatial.distance.mahalanobis(mean, x2, Covi)
# Plot distances
vm_x1 = (x1 - mean) / d2_m_x1
(continues on next page)
plt.legend(loc='lower right')
plt.text(-6.1, 3,
'Euclidian: d(m, x1) = %.1f<d(m, x2) = %.1f' % (d2_m_x1, d2_m_x2),␣
˓→color='k')
plt.text(-6.1, 3.5,
'Mahalanobis: d(m, x1) = %.1f>d(m, x2) = %.1f' % (dm_m_x1, dm_m_x2),␣
˓→color='r')
plt.axis('equal')
print('Euclidian d(m, x1) = %.2f < d(m, x2) = %.2f' % (d2_m_x1, d2_m_x2))
print('Mahalanobis d(m, x1) = %.2f > d(m, x2) = %.2f' % (dm_m_x1, dm_m_x2))
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Eu-
clidean distance. If the covariance matrix is diagonal, then the resulting distance measure is
called a normalized Euclidean distance.
More generally, the Mahalanobis distance is a measure of the distance between a point x and a
distribution 𝒩 (x|𝜇, Σ). It is a multi-dimensional generalization of the idea of measuring how
many standard deviations away x is from the mean. This distance is zero if x is at the mean,
and grows as x moves away from the mean: along each principal component axis, it measures
the number of standard deviations from x to the mean of the distribution.
The distribution, or probability density function (PDF) (sometimes just density), of a continuous
random variable is a function that describes the relative likelihood for this random variable to
take on a given value.
The multivariate normal distribution, or multivariate Gaussian distribution, of a 𝑃 -dimensional
random vector x = [𝑥1 , 𝑥2 , . . . , 𝑥𝑃 ]𝑇 is
1 1
𝒩 (x|𝜇, Σ) = exp{− (x − 𝜇)𝑇 Σ−1 (x − 𝜇)}.
(2𝜋)𝑃/2 |Σ|1/2 2
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D
P = X.shape[1]
det = np.linalg.det(sigma)
norm_const = 1.0 / (((2*np.pi) ** (P/2)) * np.sqrt(det))
X_mu = X - mu
inv = np.linalg.inv(sigma)
d2 = np.sum(np.dot(X_mu, inv) * X_mu, axis=1)
return norm_const * np.exp(-0.5 * d2)
# x, y grid
x, y = np.mgrid[-3:3:.1, -3:3:.1]
X = np.stack((x.ravel(), y.ravel())).T
norm = multivariate_normal_pdf(X, mean, sigma).reshape(x.shape)
# Do it with scipy
norm_scpy = multivariate_normal(mu, sigma).pdf(np.stack((x, y), axis=2))
assert np.allclose(norm, norm_scpy)
# Plot
(continues on next page)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
4.4.8 Exercises
Two libraries:
• Pandas: https://pandas.pydata.org/pandas-docs/stable/timeseries.html
• scipy http://www.statsmodels.org/devel/tsa.html
4.5.1 Stationarity
A TS is said to be stationary if its statistical properties such as mean, variance remain constant
over time.
• constant mean
• constant variance
• an autocovariance that does not depend on time.
what is making a TS non-stationary. There are 2 major reasons behind non-stationaruty of a
TS:
1. Trend – varying mean over time. For eg, in this case we saw that on average, the number
of passengers was growing over time.
2. Seasonality – variations at specific time-frames. eg people might have a tendency to buy
cars in a particular month because of pay increment or festivals.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# String as index
prices = {'apple': 4.99,
'banana': 1.99,
'orange': 3.99}
ser = pd.Series(prices)
print(ser)
0 1
1 3
dtype: int64
apple 4.99
(continues on next page)
source: https://www.datacamp.com/community/tutorials/time-series-analysis-tutorial
Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over time
while learning about trends and seasonality in time series data.
In the Facebook Live code along session on the 4th of January, we checked out Google trends
data of keywords ‘diet’, ‘gym’ and ‘finance’ to see how they vary over time. We asked ourselves
if there could be more searches for these terms in January when we’re all trying to turn over a
new leaf?
In this tutorial, you’ll go through the code that we put together during the session step by step.
You’re not going to do much mathematics but you are going to do the following:
• Read data
• Recode data
• Exploratory Data Analysis
try:
url = "https://raw.githubusercontent.com/datacamp/datacamp_facebook_live_ny_
˓→resolution/master/datasets/multiTimeline.csv"
df = pd.read_csv(url, skiprows=2)
except:
df = pd.read_csv("../datasets/multiTimeline.csv", skiprows=2)
print(df.head())
# Rename columns
df.columns = ['month', 'diet', 'gym', 'finance']
# Describe
print(df.describe())
Next, you’ll turn the ‘month’ column into a DateTime data type and make it the index of the
DataFrame.
Note that you do this because you saw in the result of the .info() method that the ‘Month’
column was actually an of data type object. Now, that generic data type encapsulates everything
from strings to integers, etc. That’s not exactly what you want when you want to be looking
at time series data. That’s why you’ll use .to_datetime() to convert the ‘month’ column in your
DataFrame to a DateTime.
Be careful! Make sure to include the inplace argument when you’re setting the index of the
DataFrame df so that you actually alter the original index and set it to the ‘month’ column.
df.month = pd.to_datetime(df.month)
df.set_index('month', inplace=True)
print(df.head())
You can use a built-in pandas visualization method .plot() to plot your data as 3 line plots on a
single figure (one for each column, namely, ‘diet’, ‘gym’, and ‘finance’).
df.plot()
plt.xlabel('Year');
Note that this data is relative. As you can read on Google trends:
Numbers represent search interest relative to the highest point on the chart for the given region
and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term
is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the
peak.
Rolling average, for each time point, take the average of the points on either side of it. Note
that the number of points is specified by a window size.
Remove Seasonality with pandas Series.
See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html A: ‘year end frequency’
year frequency
diet = df['diet']
diet_resamp_yr = diet.resample('A').mean()
diet_roll_yr = diet.rolling(12).mean()
ax.legend()
<matplotlib.legend.Legend at 0x7f670db34a10>
x = np.asarray(df[['diet']])
win = 12
win_half = int(win / 2)
# print([((idx-win_half), (idx+win_half)) for idx in np.arange(win_half, len(x))])
plt.plot(diet_smooth)
[<matplotlib.lines.Line2D at 0x7f670cbbb1d0>]
gym = df['gym']
Text(0.5, 0, 'Year')
Detrending
Text(0.5, 0, 'Year')
df.diff().plot()
plt.xlabel('Year')
Text(0.5, 0, 'Year')
df.plot()
plt.xlabel('Year');
print(df.corr())
print(df.corr())
‘diet’ and ‘gym’ are negatively correlated! Remember that you have a seasonal and a trend
component. From the correlation coefficient, ‘diet’ and ‘gym’ are negatively correlated:
• trends components are negatively correlated.
• seasonal components would positively correlated and their
The actual correlation coefficient is actually capturing both of those.
Seasonal correlation: correlation of the first-order differences of these time series
df.diff().plot()
plt.xlabel('Year');
print(df.diff().corr())
print(df.diff().corr())
x = gym
plt.subplot(411)
plt.plot(x, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
(continues on next page)
4.5.10 Autocorrelation
A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Autocorrelation Function (ACF): It is a measure of the correlation between the TS with a lagged
version of itself. For instance at lag 5, ACF would compare series at time instant t1. . . t2 with
series at instant t1-5. . . t2-5 (t1-5 and t2 being end points).
Plot
x = df["diet"].astype(float)
autocorrelation_plot(x)
<AxesSubplot:xlabel='Lag', ylabel='Autocorrelation'>
ACF peaks every 12 months: Time series is correlated with itself shifted by 12 months.
4.5.11 Time series forecasting with Python using Autoregressive Moving Average
(ARMA) models
Source:
• https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/
9781783553358/7/ch07lvl1sec77/arma-models
• http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model
• ARIMA: https://www.analyticsvidhya.com/blog/2016/02/
time-series-forecasting-codes-python/
ARMA models are often used to forecast a time series. These models combine autoregressive
and moving average models. In moving average models, we assume that a variable is the sum
of the mean of the time series and a linear combination of noise components.
The autoregressive and moving average models can have different orders. In general, we can
define an ARMA model with p autoregressive terms and q moving average terms as follows:
𝑝
∑︁ 𝑞
∑︁
𝑥𝑡 = 𝑎𝑖 𝑥𝑡−𝑖 + 𝑏𝑖 𝜀𝑡−𝑖 + 𝜀𝑡
𝑖 𝑖
Choosing p and q
Plot the partial autocorrelation functions for an estimate of p, and likewise using the autocorre-
lation functions for an estimate of q.
Partial Autocorrelation Function (PACF): This measures the correlation between the TS with a
lagged version of itself but after eliminating the variations already explained by the intervening
comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained
by lags 1 to 4.
x = df["gym"].astype(float)
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function (q=1)')
(continues on next page)
#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function (p=1)')
plt.tight_layout()
In this plot, the two dotted lines on either sides of 0 are the confidence interevals. These can be
used to determine the p and q values as:
• p: The lag value where the PACF chart crosses the upper confidence interval for the first
time, in this case p=1.
• q: The lag value where the ACF chart crosses the upper confidence interval for the first
time, in this case q=1.
1. Define the model by calling ARMA() and passing in the p and q parameters.
2. The model is prepared on the training data by calling the fit() function.
3. Predictions can be made by calling the predict() function and specifying the index of the
time or times to be predicted.
print(model.summary())
plt.plot(x)
plt.plot(model.predict(), color='red')
plt.title('RSS: %.4f'% sum((model.fittedvalues-x)**2))
/home/ed203246/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/arima_model.
˓→py:472: FutureWarning:
To silence this warning and continue using ARMA and ARIMA until they are
removed, use:
import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA',
FutureWarning)
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARIMA',
FutureWarning)
warnings.warn(ARIMA_DEPRECATION_WARN, FutureWarning)
/home/ed203246/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_
˓→model.py:527: ValueWarning: No frequency information was provided, so inferred␣
% freq, ValueWarning)
FIVE
MACHINE LEARNING
5.1.1 Introduction
In machine learning and statistics, dimensionality reduction or dimension reduction is the pro-
cess of reducing the number of features under consideration, and can be divided into feature
selection (not addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and builds derived values (fea-
tures) intended to be informative and non-redundant, facilitating the subsequent learning and
generalization steps, and in some cases leading to better human interpretations. Feature extrac-
tion is related to dimensionality reduction.
The input matrix X, of dimension 𝑁 × 𝑃 , is
⎡ ⎤
𝑥11 . . . 𝑥1𝑃
⎢ ⎥
⎢ .. .. ⎥
⎢ ⎥
⎢ . X . ⎥
⎢ ⎥
⎣ ⎦
𝑥𝑁 1 . . . 𝑥 𝑁 𝑃
where the rows represent the samples and columns represent the variables. The goal is to learn
a transformation that extracts a few relevant features.
Models:
1. Linear matrix decomposition/factorisation SVD/PCA. Those models exploit the covariance
ΣXX between the input features.
2. Non-linear models based on manifold learning: Isomap, t-SNE. Those models
Decompose the data matrix X𝑁 ×𝑃 into a product of a mixing matrix U𝑁 ×𝐾 and a dictionary
matrix V𝑃 ×𝐾 .
X = UV𝑇 ,
181
Statistics and Machine Learning in Python, Release 0.5
X ≈ X̂ = UV𝑇 ,
X = UDV𝑇 ,
where
⎡ ⎤ ⎡ ⎤
𝑥11 𝑥1𝑃 𝑢11 ⎡ 𝑢1𝐾⎤⎡ ⎤
⎢ ⎥ ⎢ ⎥ 𝑑1 0 𝑣11 𝑣1𝑃
⎢ ⎥ ⎢ ⎥
⎢
⎢ X ⎥=⎢
⎥ ⎢ U ⎥⎣
⎥ D ⎦⎣ V𝑇 ⎦.
⎣ ⎦ ⎣ ⎦ 0 𝑑𝐾 𝑣𝐾1 𝑣𝐾𝑃
𝑥𝑁 1 𝑥𝑁 𝑃 𝑢𝑁 1 𝑢𝑁 𝐾
U: right-singular
• V = [v1 , · · · , v𝐾 ] is a 𝑃 × 𝐾 orthogonal matrix.
• It is a dictionary of patterns to be combined (according to the mixing coefficients) to
reconstruct the original samples.
• V perfoms the initial rotations (projection) along the 𝐾 = min(𝑁, 𝑃 ) principal compo-
nent directions, also called loadings.
• Each v𝑗 performs the linear combination of the variables that has maximum sample vari-
ance, subject to being uncorrelated with the previous v𝑗−1 .
D: singular values
V transforms correlated variables (X) into a set of uncorrelated ones (UD) that better expose
the various relationships among the original data items.
X = UDV𝑇 , (5.1)
XV = UDV𝑇 V, (5.2)
XV = UDI, (5.3)
XV = UD (5.4)
At the same time, SVD is a method for identifying and ordering the dimensions along which
data points exhibit the most variation.
import numpy as np
import scipy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
print(X.shape)
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(U[:, 0], U[:, 1], s=50)
plt.axis('equal')
plt.title("U: Rotated and scaled data")
plt.subplot(132)
# Project data
PC = np.dot(X, Vh.T)
plt.scatter(PC[:, 0], PC[:, 1], s=50)
plt.axis('equal')
plt.title("XV: Rotated data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], s=50)
for i in range(Vh.shape[0]):
plt.arrow(x=0, y=0, dx=Vh[i, 0], dy=Vh[i, 1], head_width=0.2,
head_length=0.2, linewidth=2, fc='r', ec='r')
plt.text(Vh[i, 0], Vh[i, 1],'v%i' % (i+1), color="r", fontsize=15,
horizontalalignment='right', verticalalignment='top')
plt.axis('equal')
plt.ylim(-4, 4)
plt.tight_layout()
(100, 2)
Sources:
• C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
• Everything you did and didn’t know about PCA
• Principal Component Analysis in 3 Simple Steps
Principles
• Principal components analysis is the main method used for linear dimension reduction.
• The idea of principal component analysis is to find the 𝐾 principal components di-
rections (called the loadings) V𝐾×𝑃 that capture the variation in the data as much as
possible.
• It converts a set of 𝑁 𝑃 -dimensional observations N𝑁 ×𝑃 of possibly correlated variables
into a set of 𝑁 𝐾-dimensional samples C𝑁 ×𝐾 , where the 𝐾 < 𝑃 . The new variables are
linearly uncorrelated. The columns of C𝑁 ×𝐾 are called the principal components.
• The dimension reduction is obtained by using only 𝐾 < 𝑃 components that exploit corre-
lation (covariance) among the original variables.
• PCA is mathematically defined as an orthogonal linear transformation V𝐾×𝑃 that trans-
forms the data to a new coordinate system such that the greatest variance by some projec-
tion of the data comes to lie on the first coordinate (called the first principal component),
the second greatest variance on the second coordinate, and so on.
C𝑁 ×𝐾 = X𝑁 ×𝑃 V𝑃 ×𝐾
• PCA can be thought of as fitting a 𝑃 -dimensional ellipsoid to the data, where each axis of
the ellipsoid represents a principal component. If some axis of the ellipse is small, then the
variance along that axis is also small, and by omitting that axis and its corresponding prin-
cipal component from our representation of the dataset, we lose only a commensurately
small amount of information.
• Finding the 𝐾 largest axes of the ellipse will permit to project the data onto a space having
dimensionality 𝐾 < 𝑃 while maximizing the variance of the projected data.
Dataset preprocessing
Centering
Consider a data matrix, X , with column-wise zero empirical mean (the sample mean of each
column has been shifted to zero), ie. X is replaced by X − 1x̄𝑇 .
Standardizing
Optionally, standardize the columns, i.e., scale them by their standard-deviation. Without stan-
dardization, a variable with a high variance will capture most of the effect of the PCA. The
principal direction will be aligned with this variable. Standardization will, however, raise noise
variables to the save level as informative variables.
The covariance matrix of centered standardized data is the correlation matrix.
To begin with, consider the projection onto a one-dimensional space (𝐾 = 1). We can define
the direction of this space using a 𝑃 -dimensional vector v, which for convenience (and without
loss of generality) we shall choose to be a unit vector so that ‖v‖2 = 1 (note that we are only
interested in the direction defined by v, not in the magnitude of v itself). PCA consists of two
mains steps:
Projection in the directions that capture the greatest variance
Each 𝑃 -dimensional data point x𝑖 is then projected onto v, where the coordinate (in the co-
ordinate system of v) is a scalar value, namely x𝑇𝑖 v. I.e., we want to find the vector v that
maximizes these coordinates along v, which we will see corresponds to maximizing the vari-
ance of the projected data. This is equivalently expressed as
1 ∑︁ (︀ 𝑇 )︀2
v = arg max x𝑖 v .
‖v‖=1 𝑁
𝑖
where SXX is a biased estiamte of the covariance matrix of the data, i.e.
1 𝑇
SXX = X X.
𝑁
We now maximize the projected variance v𝑇 SXX v with respect to v. Clearly, this has to be a
constrained maximization to prevent ‖v2 ‖ → ∞. The appropriate constraint comes from the
normalization condition ‖v‖2 ≡ ‖v‖22 = v𝑇 v = 1. To enforce this constraint, we introduce a
Lagrange multiplier that we shall denote by 𝜆, and then make an unconstrained maximization
of
By setting the gradient with respect to v equal to zero, we see that this quantity has a stationary
point when
SXX v = 𝜆v.
v𝑇 SXX v = 𝜆,
and so the variance will be at a maximum when v is equal to the eigenvector corresponding to
the largest eigenvalue, 𝜆. This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing each new
direction to be that which maximizes the projected variance amongst all possible directions that
are orthogonal to those already considered. If we consider the general case of a 𝐾-dimensional
projection space, the optimal linear projection for which the variance of the projected data is
maximized is now defined by the 𝐾 eigenvectors, v1 , . . . , vK , of the data covariance matrix
SXX that corresponds to the 𝐾 largest eigenvalues, 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 𝜆𝐾 .
Back to SVD
X𝑇 X = (UDV𝑇 )𝑇 (UDV𝑇 )
= VD𝑇 U𝑇 UDV𝑇
= VD2 V𝑇
V𝑇 X𝑇 XV = D2
1 1
V𝑇 X𝑇 XV = D2
𝑁 −1 𝑁 −1
1
V𝑇 SXX V = D2
𝑁 −1
.
Considering only the 𝑘 𝑡ℎ right-singular vectors v𝑘 associated to the singular value 𝑑𝑘
1
vk 𝑇 SXX vk = 𝑑2 ,
𝑁 −1 𝑘
It turns out that if you have done the singular value decomposition then you already have
the Eigenvalue decomposition for X𝑇 X. Where - The eigenvectors of SXX are equivalent to
the right singular vectors, V, of X. - The eigenvalues, 𝜆𝑘 , of SXX , i.e. the variances of the
components, are equal to 𝑁 1−1 times the squared singular values, 𝑑𝑘 .
Moreover computing PCA with SVD do not require to form the matrix X𝑇 X, so computing the
SVD is now the standard way to calculate a principal components analysis from a data matrix,
unless only a handful of components are required.
PCA outputs
The SVD or the eigendecomposition of the data covariance matrix provides three main quanti-
ties:
1. Principal component directions or loadings are the eigenvectors of X𝑇 X. The V𝐾×𝑃
or the right-singular vectors of an SVD of X are called principal component directions of
X. They are generally computed using the SVD of X.
2. Principal components is the 𝑁 × 𝐾 matrix C which is obtained by projecting X onto the
principal components directions, i.e.
C𝑁 ×𝐾 = X𝑁 ×𝑃 V𝑃 ×𝐾 .
Since X = UDV𝑇 and V is orthogonal (V𝑇 V = I):
C𝑁 ×𝐾 = UDV𝑇𝑁 ×𝑃 V𝑃 ×𝐾 (5.5)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 I𝐾×𝐾 (5.6)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 (5.7)
(5.8)
Thus c𝑗 = Xv𝑗 = u𝑗 𝑑𝑗 , for 𝑗 = 1, . . . 𝐾. Hence u𝑗 is simply the projection of the row vectors of
X, i.e., the input predictor vectors, on the direction v𝑗 , scaled by 𝑑𝑗 .
⎡ ⎤
𝑥1,1 𝑣1,1 + . . . + 𝑥1,𝑃 𝑣1,𝑃
⎢ 𝑥2,1 𝑣1,1 + . . . + 𝑥2,𝑃 𝑣1,𝑃 ⎥
c1 = ⎢ ..
⎢ ⎥
⎥
⎣ . ⎦
𝑥𝑁,1 𝑣1,1 + . . . + 𝑥𝑁,𝑃 𝑣1,𝑃
1
𝑣𝑎𝑟(c𝑘 ) = (Xv𝑘 )2 (5.9)
𝑁 −1
1
= (u𝑘 𝑑𝑘 )2 (5.10)
𝑁 −1
1
= 𝑑2 (5.11)
𝑁 −1 𝑘
We must choose 𝐾 * ∈ [1, . . . , 𝐾], the number of required components. This can be done by
calculating the explained variance ratio of the 𝐾 * first components and by choosing 𝐾 * such
that the cumulative explained variance ratio is greater than some given threshold (e.g., ≈
90%). This is expressed as
∑︀𝐾 *
𝑗 𝑣𝑎𝑟(c𝑘 )
cumulative explained variance(c𝑘 ) = ∑︀𝐾 .
𝑗 𝑣𝑎𝑟(c𝑘 )
PCs
Plot the samples projeted on first the principal components as e.g. PC1 against PC2.
PC directions
Exploring the loadings associated with a component provides the contribution of each original
variable in the component.
Remark: The loadings (PC directions) are the coefficients of multiple regression of PC on origi-
nal variables:
c = Xv (5.12)
𝑇 𝑇
X c = X Xv (5.13)
𝑇 −1 𝑇
(X X) X c=v (5.14)
Another way to evaluate the contribution of the original variables in each PC can be obtained
by computing the correlation between the PCs and the original variables, i.e. columns of X,
denoted x𝑗 , for 𝑗 = 1, . . . , 𝑃 . For the 𝑘 𝑡ℎ PC, compute and plot the correlations with all original
variables
𝑐𝑜𝑟(c𝑘 , x𝑗 ), 𝑗 = 1 . . . 𝐾, 𝑗 = 1 . . . 𝐾.
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
PC = pca.transform(X)
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("x1"); plt.ylabel("x2")
plt.subplot(122)
(continues on next page)
[0.93646607 0.06353393]
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)
# Utils function
def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.suptitle(title, size=16)
for i, comp in enumerate(images):
plt.subplot(n_row, n_col, i + 1)
vmax = max(comp.max(), -comp.min())
plt.imshow(comp.reshape(image_shape), cmap=cmap,
interpolation='nearest',
vmin=-vmax, vmax=vmax)
plt.xticks(())
plt.yticks(())
plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
Preprocessing
# global centering
faces_centered = faces - faces.mean(axis=0)
# local centering
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
pca = decomposition.PCA(n_components=n_components)
pca.fit(faces_centered)
plot_gallery("PCA first %i loadings" % n_components, pca.components_[:n_
˓→components])
5.1.5 Exercises
• Print the 𝐾 principal components directions and correlations of the 𝐾 principal compo-
nents with the original variables. Interpret the contribution of the original variables into
the PC.
• Plot the samples projected into the 𝐾 first PCs.
• Color samples by their species.
Sources:
• Scikit-learn documentation
• Wikipedia
Nonlinear dimensionality reduction or manifold learning cover unsupervised methods that
attempt to identify low-dimensional manifolds within the original 𝑃 -dimensional space that
represent high data density. Then those methods provide a mapping from the high-dimensional
space to the low-dimensional embedding.
Resources:
• http://www.stat.pitt.edu/sungkyu/course/2221Fall13/lec8_mds_combined.pdf
• https://en.wikipedia.org/wiki/Multidimensional_scaling
• Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer, Second Edition.
The purpose of MDS is to find a low-dimensional projection of the data in which the pairwise
distances between data points is preserved, as closely as possible (in a least-squares sense).
• Let D be the (𝑁 × 𝑁 ) pairwise distance matrix where 𝑑𝑖𝑗 is a distance between points 𝑖
and 𝑗.
• The MDS concept can be extended to a wide variety of data types specified in terms of a
similarity matrix.
Given the dissimilarity (distance) matrix D𝑁 ×𝑁 = [𝑑𝑖𝑗 ], MDS attempts to find 𝐾-dimensional
projections of the 𝑁 points x1 , . . . , x𝑁 ∈ R𝐾 , concatenated in an X𝑁 ×𝐾 matrix, so that 𝑑𝑖𝑗 ≈
‖x𝑖 − x𝑗 ‖ are as close as possible. This can be obtained by the minimization of a loss function
called the stress function
∑︁
stress(X) = (𝑑𝑖𝑗 − ‖x𝑖 − x𝑗 ‖)2 .
𝑖̸=𝑗
The Sammon mapping performs better at preserving small distances compared to the least-
squares scaling.
Example
The eurodist datset provides the road distances (in kilometers) between 21 cities in Europe.
Given this matrix of pairwise (non-Euclidean) distances D = [𝑑𝑖𝑗 ], MDS can be used to recover
the coordinates of the cities in some Euclidean referential whose orientation is arbitrary.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(url)
print(df.iloc[:5, :5])
city = df["city"]
D = np.array(df.iloc[:, 1:]) # Distance matrix
(continues on next page)
X = mds.fit_transform(D)
for i in range(len(city)):
plt.text(Xr[i, 0], Xr[i, 1], city[i])
plt.axis('equal')
(-1894.0919178069155,
2914.3554370871234,
-1712.9733697197494,
2145.437068788015)
We must choose 𝐾 * ∈ {1, . . . , 𝐾} the number of required components. Plotting the values of
the stress function, obtained using 𝑘 ≤ 𝑁 − 1 components. In general, start with 1, . . . 𝐾 ≤ 4.
Choose 𝐾 * where you can clearly distinguish an elbow in the stress curve.
Thus, in the plot below, we choose to retain information accounted for by the first two compo-
nents, since this is where the elbow is in the stress curve.
print(stress)
plt.plot(k_range, stress)
plt.xlabel("k")
plt.ylabel("stress")
Exercises
5.2.2 Isomap
5.2.3 t-SNE
Sources:
• Wikipedia
• scikit-learn
Principles
1. Construct a (Gaussian) probability distribution between pairs of object in input (high-
dimensional) space.
2. Construct a (student) ) probability distribution between pairs of object in embeded (low-
dimensional) space.
3. Minimize the Kullback–Leibler divergence (KL divergence) between the two distributions.
Features
• Isomap, LLE and variants are best suited to unfold a single continuous low dimensional
manifold
• t-SNE will focus on the local structure of the data and will tend to extract clustered local
groups of samples
ax = fig.add_subplot(131, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)
plt.title('2D "S shape" manifold in 3D')
ax = fig.add_subplot(133)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("t-SNE")
plt.xlabel("First component")
plt.ylabel("Second component")
plt.axis('tight')
5.2.4 Exercises
Run Manifold learning on handwritten digits: Locally Linear Embedding, Isomap with scikit-
learn
5.3 Clustering
Wikipedia: Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense or another)
to each other than to those in other groups (clusters). Clustering is one of the main task of
exploratory data mining, and a common technique for statistical data analysis, used in many
fields, including machine learning, pattern recognition, image analysis, information retrieval,
and bioinformatics.
Sources: http://scikit-learn.org/stable/modules/clustering.html
which represents the sum of the squares of the Euclidean distances of each data point to its
assigned vector 𝜇𝑘 . Our goal is to find values for the {𝑟𝑖𝑘 } and the {𝜇𝑘 } so as to minimize the
function 𝐽. We can do this through an iterative procedure in which each iteration involves two
successive steps corresponding to successive optimizations with respect to the 𝑟𝑖𝑘 and the 𝜇𝑘
. First we choose some initial values for the 𝜇𝑘 . Then in the first phase we minimize 𝐽 with
respect to the 𝑟𝑖𝑘 , keeping the 𝜇𝑘 fixed. In the second phase we minimize 𝐽 with respect to
the 𝜇𝑘 , keeping 𝑟𝑖𝑘 fixed. This two-stage optimization process is then repeated until conver-
gence. We shall see that these two stages of updating 𝑟𝑖𝑘 and 𝜇𝑘 correspond respectively to the
expectation (E) and maximization (M) steps of the expectation-maximisation (EM) algorithm,
and to emphasize this we shall use the terms E step and M step in the context of the 𝐾-means
algorithm.
Consider first the determination of the 𝑟𝑖𝑘 . Because 𝐽 in is a linear function of 𝑟𝑖𝑘 , this opti-
mization can be performed easily to give a closed form solution. The terms involving different
𝑖 are independent and so we can optimize for each 𝑖 separately by choosing 𝑟𝑖𝑘 to be 1 for
whichever value of 𝑘 gives the minimum value of ||𝑥𝑖 − 𝜇𝑘 ||2 . In other words, we simply assign
the 𝑖th data point to the closest cluster centre. More formally, this can be expressed as
{︃
1, if 𝑘 = arg min𝑗 ||𝑥𝑖 − 𝜇𝑗 ||2 .
𝑟𝑖𝑘 = (5.15)
0, otherwise.
Now consider the optimization of the 𝜇𝑘 with the 𝑟𝑖𝑘 held fixed. The objective function 𝐽 is a
quadratic function of 𝜇𝑘 , and it can be minimized by setting its derivative with respect to 𝜇𝑘 to
zero giving
∑︁
2 𝑟𝑖𝑘 (𝑥𝑖 − 𝜇𝑘 ) = 0
𝑖
The denominator in this expression is equal to the number of points assigned to cluster 𝑘, and so
this result has a simple interpretation, namely set 𝜇𝑘 equal to the mean of all of the data points
𝑥𝑖 assigned to cluster 𝑘. For this reason, the procedure is known as the 𝐾-means algorithm.
The two phases of re-assigning data points to clusters and re-computing the cluster means are
repeated in turn until there is no further change in the assignments (or until some maximum
number of iterations is exceeded). Because each phase reduces the value of the objective func-
tion 𝐽, convergence of the algorithm is assured. However, it may converge to a local rather than
global minimum of 𝐽.
iris = datasets.load_iris()
X = iris.data[:, :2] # use only 'sepal length and sepal width'
y_iris = iris.target
km2 = cluster.KMeans(n_clusters=2).fit(X)
km3 = cluster.KMeans(n_clusters=3).fit(X)
km4 = cluster.KMeans(n_clusters=4).fit(X)
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=km2.labels_)
plt.title("K=2, J=%.2f" % km2.inertia_)
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=km3.labels_)
plt.title("K=3, J=%.2f" % km3.inertia_)
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=km4.labels_)#.astype(np.float))
plt.title("K=4, J=%.2f" % km4.inertia_)
Exercises
1. Analyse clusters
• Analyse the plot above visually. What would a good value of 𝐾 be?
• If you instead consider the inertia, the value of 𝐽, what would a good value of 𝐾 be?
• Explain why there is such difference.
• For 𝐾 = 2 why did 𝐾-means clustering not find the two “natural” clusters? See the
assumptions of 𝐾-means: See sklearn doc.
Write a function kmeans(X, K) that return an integer vector of the samples’ labels.
The Gaussian mixture model (GMM) is a simple linear superposition of Gaussian components
over the data, aimed at providing a rich class of density models. We turn to a formulation of
Gaussian mixtures in terms of discrete latent variables: the 𝐾 hidden classes to be discovered.
Differences compared to 𝐾-means:
• Whereas the 𝐾-means algorithm performs a hard assignment of data points to clusters, in
which each data point is associated uniquely with one cluster, the GMM algorithm makes
a soft assignment based on posterior probabilities.
• Whereas the classic 𝐾-means is only based on Euclidean distances, classic GMM use a
Mahalanobis distances that can deal with non-spherical distributions. It should be noted
that Mahalanobis could be plugged within an improved version of 𝐾-Means clustering.
The Mahalanobis distance is unitless and scale-invariant, and takes into account the cor-
relations of the data set.
where:
• The 𝑝(𝑘) are ∑︀
the mixing coefficients also know as the class probability of class 𝑘, and they
sum to one: 𝐾 𝑘=1 𝑝(𝑘) = 1.
To compute the classes parameters: 𝑝(𝑘), 𝜇𝑘 , Σ𝑘 we sum over all samples, by weighting each
sample 𝑖 by its responsibility or contribution
∑︀ to class 𝑘: 𝑝(𝑘 | 𝑥𝑖 ) such that for each point its
contribution to all classes sum to one 𝑘 𝑝(𝑘 | 𝑥𝑖 ) = 1. This contribution is the conditional
probability of class 𝑘 given 𝑥: 𝑝(𝑘 | 𝑥) (sometimes called the posterior). It can be computed
using Bayes’ rule:
𝑝(𝑥 | 𝑘)𝑝(𝑘)
𝑝(𝑘 | 𝑥) = (5.16)
𝑝(𝑥)
𝒩 (𝑥 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
= ∑︀𝐾 (5.17)
𝑘=1 𝒩 (𝑥 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
Since the class parameters, 𝑝(𝑘), 𝜇𝑘 and Σ𝑘 , depend on the responsibilities 𝑝(𝑘 | 𝑥) and the
responsibilities depend on class parameters, we need a two-step iterative algorithm: the
expectation-maximization (EM) algorithm. We discuss this algorithm next.
### The expectation-maximization (EM) algorithm for Gaussian mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect
to the parameters (comprised of the means and covariances of the components and the mixing
coefficients).
Initialize the means 𝜇𝑘 , covariances Σ𝑘 and mixing coefficients 𝑝(𝑘)
1. E step. For each sample 𝑖, evaluate the responsibilities for each class 𝑘 using the current
parameter values
𝒩 (𝑥𝑖 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
𝑝(𝑘 | 𝑥𝑖 ) = ∑︀𝐾
𝑘=1 𝒩 (𝑥𝑖 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
2. M step. For each class, re-estimate the parameters using the current responsibilities
𝑁
1 ∑︁
𝜇new
𝑘 = 𝑝(𝑘 | 𝑥𝑖 )𝑥𝑖 (5.18)
𝑁𝑘
𝑖=1
𝑁
1 ∑︁
Σnew
𝑘 = 𝑝(𝑘 | 𝑥𝑖 )(𝑥𝑖 − 𝜇new new 𝑇
𝑘 )(𝑥𝑖 − 𝜇𝑘 ) (5.19)
𝑁𝑘
𝑖=1
𝑁𝑘
𝑝new (𝑘) = (5.20)
𝑁
3. Evaluate the log-likelihood
𝑁
{︃ 𝐾 }︃
∑︁ ∑︁
ln 𝒩 (𝑥|𝜇𝑘 , Σ𝑘 )𝑝(𝑘) ,
𝑖=1 𝑘=1
and check for convergence of either the parameters or the log-likelihood. If the convergence
criterion is not satisfied return to step 1.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color
import sklearn
from sklearn.mixture import GaussianMixture
import pystatsml.plot_utils
colors = sns.color_palette()
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm2.predict(X)])#,␣
˓→color=colors)
for i in range(gmm2.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm2.covariances_[i, :], pos=gmm2.
˓→means_[i, :],
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm3.predict(X)])
for i in range(gmm3.covariances_.shape[0]):
(continues on next page)
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm4.predict(X)]) # .
˓→astype(np.float))
for i in range(gmm4.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm4.covariances_[i, :], pos=gmm4.
˓→means_[i, :],
Models of covariances: parmeter covariance_type see Sklearn doc. K-means is almost a GMM
with spherical covariance.
In statistics, the Bayesian information criterion (BIC) is a criterion for model selection among
a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the
likelihood function and it is closely related to the Akaike information criterion (AIC).
X = iris.data
y_iris = iris.target
bic = list()
#print(X)
for k in ks:
gmm = GaussianMixture(n_components=k, covariance_type='full')
gmm.fit(X)
bic.append(gmm.bic(X))
k_chosen = ks[np.argmin(bic)]
plt.plot(ks, bic)
plt.xlabel("k")
plt.ylabel("BIC")
Choose k= 2
Hierarchical clustering is an approach to clustering that build hierarchies of clusters in two main
approaches:
• Agglomerative: A bottom-up strategy, where each observation starts in their own cluster,
and pairs of clusters are merged upwards in the hierarchy.
• Divisive: A top-down strategy, where all observations start out in the same cluster, and
then the clusters are split recursively downwards in the hierarchy.
In order to decide which clusters to merge or to split, a measure of dissimilarity between clusters
is introduced. More specific, this comprise a distance measure and a linkage criterion. The
distance measure is just what it sounds like, and the linkage criterion is essentially a function of
the distances between points, for instance the minimum distance between points in two clusters,
the maximum distance between points in two clusters, the average distance between points in
two clusters, etc. One particular linkage criterion, the Ward criterion, will be discussed next.
Ward clustering
Ward clustering belongs to the family of agglomerative hierarchical clustering algorithms. This
means that they are based on a “bottoms up” approach: each sample starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
In Ward clustering, the criterion for choosing the pair of clusters to merge at each step is the
minimum variance criterion. Ward’s minimum variance criterion minimizes the total within-
cluster variance by each merge. To implement this method, at each step: find the pair of
clusters that leads to minimum increase in total within-cluster variance after merging. This
increase is a weighted squared distance between cluster centers.
The main advantage of agglomerative hierarchical clustering over 𝐾-means clustering is that
you can benefit from known neighborhood information, for example, neighboring pixels in an
image.
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=ward2.labels_)
plt.title("K=2")
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=ward3.labels_)
plt.title("K=3")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=ward4.labels_) # .astype(np.float))
plt.title("K=4")
5.3.5 Exercises
Perform clustering of the iris dataset based on all variables using Gaussian mixture models. Use
PCA to visualize clusters.
Linear regression models the output, or target variable 𝑦 ∈ R as a linear combination of the 𝑃 -
dimensional input x ∈ R𝑃 . Let X be the 𝑁 × 𝑃 matrix with each row an input vector (with a 1
in the first position), and similarly let y be the 𝑁 -dimensional vector of outputs in the training
set, the linear model will predict the y given x using the parameter vector, or weight vector
w ∈ R𝑃 according to
y = Xw + 𝜀,
where 𝜀 ∈ R𝑁 are the residuals, or the errors of the prediction. The w is found by minimizing
an objective function, which is the loss function, 𝐿(w), i.e. the error measured on the data.
This error is the sum of squared errors (SSE) loss.
Minimizing the SSE is the Ordinary Least Square OLS regression as objective function. which is
a simple ordinary least squares (OLS) minimization whose analytic solution is:
𝐿(w, X, y) ∑︁
𝜕 =2 x𝑖 (x𝑖 · w − 𝑦𝑖 )
𝜕w
𝑖
Scikit learn offer many models for supervised learning, and they all follow the same application
programming interface (API), namely:
model = Estimator()
model.fit(X, y)
predictions = model.predict(X)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(precision=2)
pd.set_option('precision', 2)
Linear regression of Advertising.csv dataset with TV and Radio advertising as input features
and Sales as target. The linear model that minimizes the MSE is a plan (2 input features)
defined as: Sales = 0.05 TV + .19 Radio + 3:
5.4.3 Overfitting
In statistics and machine learning, overfitting occurs when a statistical model describes random
errors or noise instead of the underlying relationships. Overfitting generally occurs when a
model is excessively complex, such as having too many parameters relative to the number
of observations. A model that has been overfit will generally have poor predictive performance,
as it can exaggerate minor fluctuations in the data.
A learning algorithm is trained using some set of training samples. If the learning algorithm has
the capacity to overfit the training samples the performance on the training sample set will
improve while the performance on unseen test sample set will decline.
The overfitting phenomenon has three main explanations: - excessively complex models, - mul-
ticollinearity, and - high dimensionality.
Model complexity
Complex learners with too many parameters relative to the number of observations may overfit
the training dataset.
Multicollinearity
Predictors are highly correlated, meaning that one can be linearly predicted from the others.
In this situation the coefficient estimates of the multiple regression may change erratically in
response to small changes in the model or the data. Multicollinearity does not reduce the
predictive power or reliability of the model as a whole, at least not within the sample data
set; it only affects computations regarding individual predictors. That is, a multiple regression
model with correlated predictors can indicate how well the entire bundle of predictors predicts
the outcome variable, but it may not give valid results about any individual predictor, or about
which predictors are redundant with respect to others. In case of perfect multicollinearity the
predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a
general linear model y = Xw + 𝜀, the ordinary least-squares estimator, w𝑂𝐿𝑆 = (X𝑇 X)−1 X𝑇 y,
does not exist.
An example where correlated predictor may produce an unstable model follows: We want to
predict the business potential (pb) of some companies given their business volume (bv) and the
taxes (tx) they are paying. Here pb ~ 10% of bv. However, taxes = 20% of bv (tax and bv
are highly collinear), therefore there is an infinite number of linear combinations of tax and bv
that lead to the same prediction. Solutions with very large coefficients will produce excessively
large predictions.
X = np.column_stack([bv, tax])
beta_star = np.array([.1, 0]) # true solution
'''
Since tax and bv are correlated, there is an infinite number of linear␣
˓→combinations
• Feature selection: select a small number of features. See: Isabelle Guyon and André
Elisseeff An introduction to variable and feature selection The Journal of Machine Learning
Research, 2003.
• Feature selection: select a small number of features using ℓ1 shrinkage.
• Extract few independent (uncorrelated) features using e.g. principal components analysis
(PCA), partial least squares regression (PLS-R) or regression methods that cut the number
of predictors to a smaller set of uncorrelated components.
High dimensionality
High dimensions means a large number of input features. Linear predictor associate one pa-
rameter to each input feature, so a high-dimensional situation (𝑃 , number of features, is large)
with a relatively small number of samples 𝑁 (so-called large 𝑃 small 𝑁 situation) generally
lead to an overfit of the training data. Thus it is generally a bad idea to add many input features
into the learner. This phenomenon is called the curse of dimensionality.
One of the most important criteria to use when choosing a learning algorithm is based on the
relative size of 𝑃 and 𝑁 .
• Remenber that the “covariance” matrix X𝑇 X used in the linear model is a 𝑃 × 𝑃 matrix of
rank min(𝑁, 𝑃 ). Thus if 𝑃 > 𝑁 the equation system is overparameterized and admit an
infinity of solutions that might be specific to the learning dataset. See also ill-conditioned
or singular matrices.
• The sampling density of 𝑁 samples in an 𝑃 -dimensional space is proportional to 𝑁 1/𝑃 .
Thus a high-dimensional space becomes very sparse, leading to poor estimations of sam-
ples densities. To preserve a constant density, an exponential growth in the number of
observations is required. 50 points in 1D, would require 2 500 points in 2D and 125 000
in 3D!
• Another consequence of the sparse sampling in high dimensions is that all sample points
are close to an edge of the sample. Consider 𝑁 data points uniformly distributed in a
𝑃 -dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor
estimate at the origin. The median distance from the origin to the closest data point is
1/𝑁 1/𝑃
(︁ )︁
given by the expression: 𝑑(𝑃, 𝑁 ) = 1 − 21 .
A more complicated expression exists for the mean distance to the closest point. For N = 500,
P = 10 , 𝑑(𝑃, 𝑁 ) ≈ 0.52, more than halfway to the boundary. Hence most data points are
closer to the boundary of the sample space than to any other data point. The reason that
this presents a problem is that prediction is much more difficult near the edges of the training
sample. One must extrapolate from neighboring sample points rather than interpolate between
them. (Source: T Hastie, R Tibshirani, J Friedman.The Elements of Statistical Learning: Data
Mining, Inference, and Prediction.* Second Edition, 2009.)*
• Structural risk minimization provides a theoretical background of this phenomenon. (See
VC dimension.)
• See also bias–variance trade-off.
Regarding linear models, overfitting generally leads to excessively complex solutions (coeffi-
cient vectors), accounting for noise or spurious correlations within predictors. Regularization
aims to alleviate this phenomenon by constraining (biasing or reducing) the capacity of the
learning algorithm in order to promote simple solutions. Regularization penalizes “large” solu-
tions forcing the coefficients to be small, i.e. to shrink them toward zeros.
The objective function 𝐽(w) to minimize with respect to w is composed of a loss function 𝐿(w)
for goodness-of-fit and a penalty term Ω(w) (regularization to avoid overfitting). This is a
trade-off where the respective contribution of the loss and the penalty terms is controlled by
the regularization parameter 𝜆.
Therefore the loss function 𝐿(w) is combined with a penalty function Ω(w) leading to the
general form:
The respective contribution of the loss and the penalty is controlled by the regularization
parameter 𝜆.
For regression problems the loss is the SSE given by:
𝑁
∑︁
𝐿(w) = 𝑆𝑆𝐸(w) = (𝑦𝑖 − x𝑇𝑖 w)2
𝑖
= ‖y − xw‖22
effective_rank=3, coef=True)
Ridge regression impose a ℓ2 penalty on the coefficients, i.e. it penalizes with the Euclidean
norm of the coefficients while minimizing SSE. The objective function becomes:
𝑁
∑︁
Ridge(w) = (𝑦𝑖 − x𝑇𝑖 w)2 + 𝜆‖w‖22 (5.25)
𝑖
= ‖y − xw‖22 + 𝜆‖w‖22 . (5.26)
The w that minimises 𝐹𝑅𝑖𝑑𝑔𝑒 (w) can be found by the following derivation:
∇w Ridge(w) = 0 (5.27)
∇w (y − Xw)𝑇 (y − Xw) + 𝜆w𝑇 w = 0
(︀ )︀
(5.28)
∇w (y𝑇 y − 2w𝑇 X𝑇 y + w𝑇 X𝑇 Xw + 𝜆w𝑇 w) = 0
(︀ )︀
(5.29)
−2X𝑇 y + 2X𝑇 Xw + 2𝜆w = 0 (5.30)
𝑇 𝑇
−X y + (X X + 𝜆I)w = 0 (5.31)
(X𝑇 X + 𝜆I)w = x𝑇 y (5.32)
𝑇 −1 𝑇
w = (X X + 𝜆I) x y (5.33)
• The solution adds a positive constant to the diagonal of X𝑇 X before inversion. This makes
the problem nonsingular, even if X𝑇 X is not of full rank, and was the main motivation
behind ridge regression.
• Increasing 𝜆 shrinks the w coefficients toward 0.
• This approach penalizes the objective function by the Euclidian (:math:`ell_2`) norm
of the coefficients such that solutions with large coefficients become unattractive.
The gradient of the loss:
𝐿(w, X, y) ∑︁
𝜕 = 2( x𝑖 (x𝑖 · w − 𝑦𝑖 ) + 𝜆w)
𝜕w
𝑖
Lasso regression penalizes the coefficients by the ℓ1 norm. This constraint will reduce (bias)
the capacity of the learning algorithm. To add such a penalty forces the coefficients to be small,
i.e. it shrinks them toward zero. The objective function to minimize becomes:
𝑁
∑︁
Lasso(w) = (𝑦𝑖 − x𝑇𝑖 w)2 + 𝜆‖w‖1 . (5.34)
𝑖
This penalty forces some coefficients to be exactly zero, providing a feature selection property.
Occam’s razor
Occam’s razor (also written as Ockham’s razor, and lex parsimoniae in Latin, which means
law of parsimony) is a problem solving principle attributed to William of Ockham (1287-1347),
who was an English Franciscan friar and scholastic philosopher and theologian. The principle
can be interpreted as stating that among competing hypotheses, the one with the fewest
assumptions should be selected.
Principle of parsimony
The penalty based on the ℓ1 norm promotes sparsity (scattered, or not dense): it forces many
coefficients to be exactly zero. This also makes the coefficient vector scattered.
The figure bellow illustrates the OLS loss under a constraint acting on the ℓ1 norm of the coef-
ficient vector. I.e., it illustrates the following optimization problem:
minimize ‖y − Xw‖22
w
subject to ‖w‖1 ≤ 1.
Optimization issues
Section to be completed
• No more closed-form solution.
• Convex but not differentiable.
• Requires specific optimization algorithms, such as the fast iterative shrinkage-thresholding
algorithm (FISTA): Amir Beck and Marc Teboulle, A Fast Iterative Shrinkage-Thresholding
Algorithm for Linear Inverse Problems SIAM J. Imaging Sci., 2009.
The ridge penalty shrinks the coefficients toward zero. The figure illustrates: the OLS solution
on the left. The ℓ1 and ℓ2 penalties in the middle pane. The penalized OLS in the right pane.
The right pane shows how the penalties shrink the coefficients toward zero. The black points
are the minimum found in each case, and the white points represents the true solution used to
generate the data.
The Elastic-net estimator combines the ℓ1 and ℓ2 penalties, and results in the problem to
𝑁
∑︁
(𝑦𝑖 − x𝑇𝑖 w)2 + 𝛼 𝜌 ‖w‖1 + (1 − 𝜌) ‖w‖22 ,
(︀ )︀
Enet(w) = (5.35)
𝑖
Rational
• If there are groups of highly correlated variables, Lasso tends to arbitrarily select only
one from each group. These models are difficult to interpret because covariates that are
strongly associated with the outcome are not included in the predictive model. Conversely,
the elastic net encourages a grouping effect, where strongly correlated predictors tend to
be in or out of the model together.
• Studies on real world data and simulation studies show that the elastic net often outper-
forms the lasso, while enjoying a similar sparsity of representation.
R-squared
The goodness of fit of a statistical model describes how well it fits a set of observations. Mea-
sures of goodness of fit typically summarize the discrepancy between observed values and the
values expected under the model in question. We will consider the explained variance also
known as the coefficient of determination, denoted 𝑅2 pronounced R-squared.
The total sum of squares, 𝑆𝑆tot is the sum of the sum of squares explained by the regression,
𝑆𝑆reg , plus the sum of squares of residuals unexplained by the regression, 𝑆𝑆res , also called the
SSE, i.e. such that
Fig. 7: title
The mean of 𝑦 is
1 ∑︁
𝑦¯ = 𝑦𝑖 .
𝑛
𝑖
The total sum of squares is the total squared sum of deviations from the mean of 𝑦, i.e.
∑︁
𝑆𝑆tot = (𝑦𝑖 − 𝑦¯)2
𝑖
The regression sum of squares, also called the explained sum of squares:
∑︁
𝑆𝑆reg = 𝑦𝑖 − 𝑦¯)2 ,
(ˆ
𝑖
where 𝑦ˆ𝑖 = 𝛽𝑥𝑖 + 𝛽0 is the estimated value of salary 𝑦ˆ𝑖 given a value of experience 𝑥𝑖 .
The sum of squares of the residuals (SSE, Sum Squared Error), also called the residual sum of
squares (RSS) is:
∑︁
𝑆𝑆res = (𝑦𝑖 − 𝑦ˆ𝑖 )2 .
𝑖
𝑅2 is the explained sum of squares of errors. It is the variance explain by the regression divided
by the total variance, i.e.
explained SS 𝑆𝑆reg 𝑆𝑆𝑟𝑒𝑠
𝑅2 = = =1− .
total SS 𝑆𝑆𝑡𝑜𝑡 𝑆𝑆𝑡𝑜𝑡
Test
ˆ 2 = 𝑆𝑆res /(𝑛 − 2) be an estimator of the variance of 𝜖. The 2 in the denominator stems
Let 𝜎
from the 2 estimated parameters: intercept and coefficient.
𝑆𝑆res
• Unexplained variance: ^2
𝜎
∼ 𝜒2𝑛−2
𝑆𝑆
• Explained variance: 𝜎^ 2reg ∼ 𝜒21 . The single degree of freedom comes from the difference
between 𝑆𝑆
^2
𝜎
tot
(∼ 𝜒2𝑛−1 ) and 𝑆𝑆
^2
𝜎
res
(∼ 𝜒2𝑛−2 ), i.e. (𝑛 − 1) − (𝑛 − 2) degree of freedom.
The Fisher statistics of the ratio of two variances:
Explained variance 𝑆𝑆reg /1
𝐹 = = ∼ 𝐹 (1, 𝑛 − 2)
Unexplained variance 𝑆𝑆res /(𝑛 − 2)
Using the 𝐹 -distribution, compute the probability of observing a value greater than 𝐹 under
𝐻0 , i.e.: 𝑃 (𝑥 > 𝐹 |𝐻0 ), i.e. the survival function (1 − Cumulative Distribution Function) at 𝑥 of
the given 𝐹 -distribution.
X, y = datasets.make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1)
lr = lm.LinearRegression()
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)
r2 = metrics.r2_score(y_test, yhat)
mse = metrics.mean_squared_error(y_test, yhat)
mae = metrics.mean_absolute_error(y_test, yhat)
In pure numpy:
y_mu = np.mean(y_test)
(continues on next page)
r2 = (1 - ss_res / ss_tot)
mse = np.mean(res ** 2)
mae = np.mean(np.abs(res))
A thresholding of the activation (shifted by the bias or intercept) provides the predicted class
label.
The vector of parameters, that defines the discriminative axis, minimizes an objective function
𝐽(𝑤) that is a sum of of loss function 𝐿(𝑤) and some penalties on the weights vector Ω(𝑤).
∑︁
min 𝐽 = 𝐿(𝑦𝑖 , 𝑓 (𝑥𝑖 𝑇 𝑤)) + Ω(𝑤),
𝑤
𝑖
This geometric method does not make any probabilistic assumptions, instead it relies on dis-
tances. It looks for the linear projection of the data points onto a vector, 𝑤, that maximizes
the between/within variance ratio, denoted 𝐹 (𝑤). Under a few assumptions, it will provide the
same results as linear discriminant analysis (LDA), explained below.
Suppose two classes of observations, 𝐶0 and 𝐶1 , have means 𝜇0 and 𝜇1 and the same total
within-class scatter (“covariance”) matrix,
∑︁ ∑︁
𝑆𝑊 = (𝑥𝑖 − 𝜇0 )(𝑥𝑖 − 𝜇0 )𝑇 + (𝑥𝑗 − 𝜇1 )(𝑥𝑗 − 𝜇1 )𝑇 (5.36)
𝑖∈𝐶0 𝑗∈𝐶1
𝑇
= 𝑋𝑐 𝑋𝑐 , (5.37)
where 𝑋0 and 𝑋1 are the (𝑁0 × 𝑃 ) and (𝑁1 × 𝑃 ) matrices of samples of classes 𝐶0 and 𝐶1 .
Let 𝑆𝐵 being the scatter “between-class” matrix, given by
𝑆𝐵 = (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 .
2
𝜎between
𝐹Fisher (𝑤) = 2 (5.38)
𝜎within
(𝑤𝑇 𝜇1 − 𝑤𝑇 𝜇0 )2
= (5.39)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
(𝑤 (𝜇1 − 𝜇0 ))2
𝑇
= (5.40)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
𝑤𝑇 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑤
= (5.41)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
𝑤 𝑇 𝑆𝐵 𝑤
= 𝑇 . (5.42)
𝑤 𝑆𝑊 𝑤
In the two-class case, the maximum separation occurs by a projection on the (𝜇1 − 𝜇0 ) using
the Mahalanobis metric 𝑆𝑊 −1 , so that
𝑤 ∝ 𝑆𝑊 −1 (𝜇1 − 𝜇0 ).
Demonstration
∇𝑤 𝐹Fisher (𝑤) = 0
(︂ 𝑇 )︂
𝑤 𝑆𝐵 𝑤
∇𝑤 =0
𝑤 𝑇 𝑆𝑊 𝑤
(𝑤𝑇 𝑆𝑊 𝑤)(2𝑆𝐵 𝑤) − (𝑤𝑇 𝑆𝐵 𝑤)(2𝑆𝑊 𝑤) = 0
(𝑤𝑇 𝑆𝑊 𝑤)(𝑆𝐵 𝑤) = (𝑤𝑇 𝑆𝐵 𝑤)(𝑆𝑊 𝑤)
𝑤 𝑇 𝑆𝐵 𝑤
𝑆𝐵 𝑤 = (𝑆𝑊 𝑤)
𝑤 𝑇 𝑆𝑊 𝑤
𝑆𝐵 𝑤 = 𝜆(𝑆𝑊 𝑤)
𝑆𝑊 −1 𝑆𝐵 𝑤 = 𝜆𝑤.
Since we do not care about the magnitude of 𝑤, only its direction, we replaced the scalar factor
(𝑤𝑇 𝑆𝐵 𝑤)/(𝑤𝑇 𝑆𝑊 𝑤) by 𝜆.
In the multiple-class case, the solutions 𝑤 are determined by the eigenvectors of 𝑆𝑊 −1 𝑆𝐵 that
correspond to the 𝐾 − 1 largest eigenvalues.
However, in the two-class case (in which 𝑆𝐵 = (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 ) it is easy to show that
𝑤 = 𝑆𝑊 −1 (𝜇1 − 𝜇0 ) is the unique eigenvector of 𝑆𝑊 −1 𝑆𝐵 :
𝑆𝑊 −1 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑤 = 𝜆𝑤
𝑆𝑊 −1 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑆𝑊 −1 (𝜇1 − 𝜇0 ) = 𝜆𝑆𝑊 −1 (𝜇1 − 𝜇0 ),
𝑤 ∝ 𝑆𝑊 −1 (𝜇1 − 𝜇0 ).
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(precision=2)
pd.set_option('precision', 2)
Exercise
errors = y_pred_lda != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→lda)))
Logistic regression is called a generalized linear models. ie.: it is a linear model with a link
function that maps the output of linear multiple regression to the posterior probability of class
1 𝑝(1|𝑥) using the logistic sigmoid function:
1
𝑝(1|𝑤, 𝑥𝑖 ) =
1 + exp(−𝑤 · 𝑥𝑖 )
x = np.linspace(-6, 6, 100)
plt.plot(x, logistic(x))
plt.grid(True)
plt.title('Logistic (sigmoid)')
Logistic regression is a discriminative model since it focuses only on the posterior probability
of each class 𝑝(𝐶𝑘 |𝑥). It only requires to estimate the 𝑃 weights of the 𝑤 vector. Thus it should
be favoured over LDA with many input features. In small dimension and balanced situations it
would provide similar predictions than LDA.
However imbalanced group sizes cannot be explicitly controlled. It can be managed using a
reweighting of the input samples.
logreg = lm.LogisticRegression(penalty='none').fit(X, y)
# This class implements regularized logistic regression.
# C is the Inverse of regularization strength.
# Large value => no regularization.
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)
errors = y_pred_logreg != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→logreg)))
print(logreg.coef_)
Exercise
Explore the Logistic Regression parameters and proposes a solution in cases of highly im-
balanced training dataset 𝑁1 ≫ 𝑁0 when we know that in reality both classes have the same
probability 𝑝(𝐶1 ) = 𝑝(𝐶0 ).
5.5.4 Losses
The Loss function for sample 𝑖 is the negative log of the probability:
{︃
− log(𝑝(1|𝑤, 𝑥𝑖 )) if 𝑦𝑖 = 1
𝐿(𝑤, 𝑥𝑖 , 𝑦𝑖 ) =
− log(1 − 𝑝(1|𝑤, 𝑥𝑖 ) if 𝑦𝑖 = 0
For the whole dataset 𝑋, 𝑦 = {𝑥𝑖 , 𝑦𝑖 } the loss function to minimize 𝐿(𝑤, 𝑋, 𝑦) is the negative
negative log likelihood (nll) that can be simplied using a 0/1 coding of the label in the case of
binary classification:
This is known as the cross-entropy between the true label 𝑦 and the predicted probability 𝑝.
For the logistic regression case, we have:
∑︁
𝐿(𝑤, 𝑋, 𝑦) = {𝑦𝑖 𝑤 · 𝑥𝑖 − log(1 + exp(𝑤 · 𝑥𝑖 ))}
𝑖
TODO
5.5.5 Overfitting
The penalties use in regression are also used in classification. The only difference is the loss
function generally the negative log likelihood (cross-entropy) or the hinge loss. We will explore:
• Ridge (also called ℓ2 ) penalty: ‖w‖22 . It shrinks coefficients toward 0.
• Lasso (also called ℓ1 ) penalty: ‖w‖1 . It performs feature selection by setting some coeffi-
cients to 0.
• ElasticNet (also called ℓ1 ℓ2 ) penalty: 𝛼 𝜌 ‖w‖1 + (1 − 𝜌) ‖w‖22 . It performs selection of
(︀ )︀
lr = lm.LogisticRegression(penalty='none').fit(X, y)
When the matrix 𝑆𝑊 is not full rank or 𝑃 ≫ 𝑁 , the The Fisher most discriminant projection
estimate of the is not unique. This can be solved using a biased version of 𝑆𝑊 :
𝑆𝑊 𝑅𝑖𝑑𝑔𝑒 = 𝑆𝑊 + 𝜆𝐼
where 𝐼 is the 𝑃 × 𝑃 identity matrix. This leads to the regularized (ridge) estimator of the
Fisher’s linear discriminant analysis:
Increasing 𝜆 will:
• Shrinks the coefficients toward zero.
• The covariance will converge toward the diagonal matrix, reducing the contribution of
the pairwise covariances.
The objective function to be minimized is now the combination of the logistic loss (negative
log likelyhood) − log ℒ(𝑤) with a penalty of the L2 norm of the weights vector. In the two-class
case, using the 0/1 coding we obtain:
lrl2.fit(X, y)
y_pred_l2 = lrl2.predict(X)
prob_pred_l2 = lrl2.predict_proba(X)
print("Coef vector:")
print(lrl2.coef_)
errors = y_pred_l2 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y)))
The objective function to be minimized is now the combination of the logistic loss − log ℒ(𝑤)
with a penalty of the L1 norm of the weights vector. In the two-class case, using the 0/1 coding
we obtain:
lrl1.fit(X, y)
y_pred_lrl1 = lrl1.predict(X)
errors = y_pred_lrl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→lrl1)))
print("Coef vector:")
print(lrl1.coef_)
Support Vector Machine seek for separating hyperplane with maximum margin to enforce ro-
bustness against noise. Like logistic regression it is a discriminative method that only focuses
of predictions.
Here we present the non separable case of Maximum Margin Classifiers with ±1 coding (ie.:
𝑦𝑖 {−1, +1}). In the next figure the legend aply to samples of “dot” class.
Here we introduced the slack variables: 𝜉𝑖 , with 𝜉𝑖 = 0 for points that are on or inside the
correct margin boundary and 𝜉𝑖 = |𝑦𝑖 − (𝑤 𝑐𝑑𝑜𝑡 · 𝑥𝑖 )| for other points. Thus:
1. If 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 1 then the point lies outside the margin but on the correct side of the
decision boundary. In this case 𝜉𝑖 = 0. The constraint is thus not active for this point. It
does not contribute to the prediction.
2. If 1 > 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 0 then the point lies inside the margin and on the correct side of the
decision boundary. In this case 0 < 𝜉𝑖 ≤ 1. The constraint is active for this point. It does
contribute to the prediction as a support vector.
3. If 0 < 𝑦𝑖 (𝑤 · 𝑥𝑖 )) then the point is on the wrong side of the decision boundary (missclassi-
fication). In this case 0 < 𝜉𝑖 > 1. The constraint is active for this point. It does contribute
to the prediction as a support vector.
This loss is called the hinge loss, defined as:
max(0, 1 − 𝑦𝑖 (𝑤 · 𝑥𝑖 ))
So linear SVM is closed to Ridge logistic regression, using the hinge loss instead of the logistic
loss. Both will provide very similar predictions.
svmlin = svm.LinearSVC(C=.1)
# Remark: by default LinearSVC uses squared_hinge as loss
svmlin.fit(X, y)
y_pred_svmlin = svmlin.predict(X)
errors = y_pred_svmlin != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→svmlin)))
print("Coef vector:")
print(svmlin.coef_)
Linear SVM for classification (also called SVM-C or SVC) with l1-regularization
∑︀𝑁
min 𝐹Lasso linear SVM (𝑤) = ||𝑤||1 + 𝐶 𝑖 𝜉𝑖
with ∀𝑖 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 1 − 𝜉𝑖
svmlinl1.fit(X, y)
y_pred_svmlinl1 = svmlinl1.predict(X)
errors = y_pred_svmlinl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→svmlinl1)))
print("Coef vector:")
print(svmlinl1.coef_)
## Exercise
Compare predictions of Logistic regression (LR) and their SVM counterparts, ie.: L2 LR vs L2
SVM and L1 LR vs L1 SVM
• Compute the correlation between pairs of weights vectors.
• Compare the predictions of two classifiers using their decision function:
– Give the equation of the decision function for a linear classifier, assuming that their
is no intercept.
– Compute the correlation decision function.
– Plot the pairwise decision function of the classifiers.
• Conclude on the differences between Linear SVM and logistic regression.
The objective function to be minimized is now the combination of the logistic loss log 𝐿(𝑤) or
the hinge loss with combination of L1 and L2 penalties. In the two-class case, using the 0/1
coding we obtain:
# Or saga solver:
# enetloglike = lm.LogisticRegression(penalty='elasticnet',
# C=.1, l1_ratio=0.5, solver='saga')
print("Hinge loss and logistic loss provide almost the same predictions.")
print("Confusion matrix")
metrics.confusion_matrix(enetlog.predict(X), enethinge.predict(X))
Hinge loss and logistic loss provide almost the same predictions.
Confusion matrix
Decision_function log x hinge losses:
source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity
Imagine a study evaluating a new test that screens people for a disease. Each person taking the
test either has or does not have the disease. The test outcome can be positive (classifying the
person as having the disease) or negative (classifying the person as not having the disease). The
test results for each subject may or may not match the subject’s actual status. In that setting:
• True positive (TP): Sick people correctly identified as sick
• False positive (FP): Healthy people incorrectly identified as sick
• True negative (TN): Healthy people correctly identified as healthy
• False negative (FN): Sick people incorrectly identified as healthy
• Accuracy (ACC):
ACC = (TP + TN) / (TP + FP + FN + TN)
• Sensitivity (SEN) or recall of the positive class or true positive rate (TPR) or hit rate:
SEN = TP / P = TP / (TP+FN)
• Specificity (SPC) or recall of the negative class or true negative rate:
SPC = TN / N = TN / (TN+FP)
• Precision or positive predictive value (PPV):
PPV = TP / (TP + FP)
• Balanced accuracy (bACC):is a useful performance measure is the balanced accuracy
which avoids inflated performance estimates on imbalanced datasets (Brodersen, et
al. (2010). “The balanced accuracy and its posterior distribution”). It is defined as the
arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either
class:
metrics.accuracy_score(y_true, y_pred)
# Balanced accuracy
b_acc = recalls.mean()
P-value associated to classification rate. Compared the number of correct classifications (=ac-
curacy ×𝑁 ) to the null hypothesis of Binomial distribution of parameters 𝑝 (typically 50% of
chance level) and 𝑁 (Number of observations).
Is 65% of accuracy a significant prediction rate among 70 observations?
Since this is an exact, two-sided test of the null hypothesis, the p-value can be divided by 2
since we test that the accuracy is superior to the chance level.
import scipy.stats
acc, N = 0.65, 70
pval = scipy.stats.binom_test(x=int(acc * N), n=N, p=0.5) / 2
print(pval)
0.01123144774625465
Some classifier may have found a good discriminative projection 𝑤. However if the threshold
to decide the final predicted class is poorly adjusted, the performances will highlight an high
specificity and a low sensitivity or the contrary.
In this case it is recommended to use the AUC of a ROC analysis which basically provide a mea-
sure of overlap of the two classes when points are projected on the discriminative axis. For more
detail on ROC and AUC see:https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
Learning with discriminative (logistic regression, SVM) methods is generally based on minimiz-
ing the misclassification of training samples, which may be unsuitable for imbalanced datasets
where the recognition might be biased in favor of the most numerous class. This problem
can be addressed with a generative approach, which typically requires more parameters to be
determined leading to reduced performances in high dimension.
Dealing with imbalanced class may be addressed by three main ways (see Japkowicz and
Stephen (2002) for a review), resampling, reweighting and one class learning.
In sampling strategies, either the minority class is oversampled or majority class is undersam-
pled or some combination of the two is deployed. Undersampling (Zhang and Mani, 2003) the
majority class would lead to a poor usage of the left-out samples. Sometime one cannot afford
such strategy since we are also facing a small sample size problem even for the majority class.
Informed oversampling, which goes beyond a trivial duplication of minority class samples, re-
quires the estimation of class conditional distributions in order to generate synthetic samples.
Here generative models are required. An alternative, proposed in (Chawla et al., 2002) generate
samples along the line segments joining any/all of the k minority class nearest neighbors. Such
procedure blindly generalizes the minority area without regard to the majority class, which may
be particularly problematic with high-dimensional and potentially skewed class distribution.
Reweighting, also called cost-sensitive learning, works at an algorithmic level by adjusting
the costs of the various classes to counter the class imbalance. Such reweighting can be im-
plemented within SVM (Chang and Lin, 2001) or logistic regression (Friedman et al., 2010)
classifiers. Most classifiers of Scikit learn offer such reweighting possibilities.
The class_weight parameter can be positioned into the "balanced" mode which uses the values
of 𝑦 to automatically adjust weights inversely proportional to class frequencies in the input data
as 𝑁/(2𝑁𝑘 ).
# dataset
X, y = datasets.make_classification(n_samples=500,
n_features=5,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=1,
shuffle=False)
5.5.16 Exercise
Write a class FisherLinearDiscriminant that implements the Fisher’s linear discriminant anal-
ysis. This class must be compliant with the scikit-learn API by providing two methods: - fit(X,
y) which fits the model and returns the object itself; - predict(X) which returns a vector of the
predicted values. Apply the object on the dataset presented for the LDA.
Here we focuse on non-linear models for classification. Nevertheless, each classification model
has its regression counterpart.
# get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
np.set_printoptions(precision=2)
pd.set_option('precision', 2)
SVM are based kernel methods require only a user-specified kernel function 𝐾(𝑥𝑖 , 𝑥𝑗 ), i.e., a
similarity function over pairs of data points (𝑥𝑖 , 𝑥𝑗 ) into kernel (dual) space on which learning
algorithms operate linearly, i.e. every operation on points is a linear combination of 𝐾(𝑥𝑖 , 𝑥𝑗 ).
Outline of the SVM algorithm:
1. Map points 𝑥 into kernel space using a kernel function: 𝑥 → 𝐾(𝑥, .).
2. Learning algorithms operates linearly by dot product into high-kernel space 𝐾(., 𝑥𝑖 ) ·
𝐾(., 𝑥𝑗 ).
• Using the kernel trick (Mercer’s Theorem) replaces dot product in high dimensional
space by a simpler operation such that 𝐾(., 𝑥𝑖 ) · 𝐾(., 𝑥𝑗 ) = 𝐾(𝑥𝑖 , 𝑥𝑗 ). Thus we only
need to compute a similarity measure for each pairs of point and store in a 𝑁 × 𝑁
Gram matrix.
• Finally, The learning process consist of estimating the $alpha_i$ of the decision func-
tion that maximises the hinge loss (of 𝑓 (𝑥)) plus some penalty when applied on all
training points.
(︃ 𝑁 )︃
∑︁
𝑓 (𝑥) = sign 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥) .
𝑖
‖𝑥𝑖 − 𝑥𝑗 ‖2
(︂ )︂
𝐾(𝑥𝑖 , 𝑥𝑗 ) = exp − (5.48)
2𝜎 2
= exp −𝛾 ‖𝑥𝑖 − 𝑥𝑗 ‖2
(︀ )︀
(5.49)
Where 𝜎 (or 𝛾) defines the kernel width parameter. Basically, we consider a Gaussian function
centered on each training sample 𝑥𝑖 . it has a ready interpretation as a similarity measure as it
decreases with squared Euclidean distance between the two feature vectors.
Non linear SVM also exists for regression problems.
dataset
X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.5, stratify=y, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
Out:
True
Decision tree
A tree can be “learned” by splitting the training dataset into subsets based on an features value
test. Each internal node represents a “test” on an feature resulting on the split of the current
sample. At each step the algorithm selects the feature and a cutoff value that maximises a given
metric. Different metrics exist for regression tree (target is continuous) or classification tree
(the target is qualitative). This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node has all the
same value of the target variable, or when splitting no longer adds value to the predictions.
This general principle is implemented by many recursive partitioning tree algorithms.
Decision trees are simple to understand and interpret however they tend to overfit the data.
However decision trees tend to overfit the training set. Leo Breiman propose random forest to
deal with this issue.
A single decision tree is usually overfits the data it is learning from because it learn from only
one pathway of decisions. Predictions from a single decision tree usually don’t make accurate
predictions on new data.
Forest
A random forest is a meta estimator that fits a number of decision tree learners on various
sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting. Random forest models reduce the risk of overfitting by introducing randomness by:
y_pred = forest.predict(X_test)
y_prob = forest.predict_proba(X_test)[:, 1]
Out:
Gradient boosting is a meta estimator that fits a sequence of weak learners. Each learner
aims to reduce the residuals (errors) produced by the previous learner. The two main hyper-
parameters are:
• The learning rate (lr) controls over-fitting: decreasing the lr limits the capacity of a
learner to overfit the residuals, ie, it slows down the learning speed and thus increases the
regularisation.
• The sub-sampling fraction controls the fraction of samples to be used for fitting the
learners. Values smaller than 1 leads to Stochastic Gradient Boosting. It thus controls
for over-fitting reducing variance and incresing bias.
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
subsample=0.5, random_state=0)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
y_prob = gb.predict_proba(X_test)[:, 1]
Out:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Machine learning algorithms overfit taining data. Predictive performances MUST be evaluated
on independant hold-out dataset.
1. Training dataset: Dataset used to fit the model (set the model parameters like weights).
The training error can be easily calculated by applying the statistical learning method to
the observations used in its training. But because of overfitting, the training error rate
can dramatically underestimate the error that would be obtained on new samples.
mod = lm.Ridge(alpha=10)
mod.fit(X_train, y_train)
y_pred_test = mod.predict(X_test)
print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test))
Out:
The Grid search procedure (GridSearchCV) performs a model selection of the best hyper-
parameters 𝛼 over a grid of possible values. Train set is “splitted (inner split) into
train/validation sets.
Model selection with grid search procedure:
1. Fit the learner (ie. estimate parameters Ω𝑘 ) on training set: X𝑡𝑟𝑎𝑖𝑛 , y𝑡𝑟𝑎𝑖𝑛 → 𝑓𝛼𝑘 ,Ω𝑘 (.)
2. Evaluate the model on the validation set and keep the hyper-parameter(s) that minimises
the error measure 𝛼* = arg min 𝐿(𝑓𝛼𝑘 ,Ω𝑘 (X𝑣𝑎𝑙 ), y𝑣𝑎𝑙 )
3. Refit the learner on all training + validation data, X𝑡𝑟𝑎𝑖𝑛∪𝑣𝑎𝑙 , y𝑡𝑟𝑎𝑖𝑛∪𝑣𝑎𝑙 , using the best
hyper parameters (𝛼* ): → 𝑓𝛼* ,Ω* (.)
Model evaluation: on the test set: 𝐿(𝑓𝛼* ,Ω* (X𝑡𝑒𝑠𝑡 ), y𝑡𝑒𝑠𝑡 )
split_inner = PredefinedSplit(test_fold=validation_idx)
print("Train set size: %i" % X_train[train_idx].shape[0])
print("Validation set size: %i" % X_train[validation_idx].shape[0])
print("Test set size: %i" % X_test.shape[0])
# Predict
y_pred_test = lm_cv.predict(X_test)
print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test))
Out:
If sample size is limited, train/validation/test split may not be possible. Cross Validation (CV)
can be used to replace train/validation split and/or train+validation / test split.
Cross-Validation scheme randomly divides the set of observations into K groups, or folds, of
approximately equal size. The first fold is treated as a validation set, and the method 𝑓 () is
fitted on the remaining union of K - 1 folds: (𝑓 (𝑋 −𝐾 , 𝑦 −𝐾 )). The measure of performance
(the score function 𝒮), either a error measure or an correct prediction measure is an average
of a loss error or correct prediction measure, noted ℒ, between a true target value and the
predicted target value. The score function is evaluated of the on the observations in the held-
out fold. For each sample i we consider the model estimated 𝑓 (𝑋 −𝑘(𝑖) , 𝑦 −𝑘(𝑖) on the data set
without the group k that contains i noted -k(i). This procedure is repeated K times; each time,
a different group of observations is treated as a test set. Then we compare the predicted value
(𝑓−𝑘(𝑖) (𝑥𝑖 ) = 𝑦ˆ𝑖 ) with true value 𝑦𝑖 using a Error or Loss function ℒ(𝑦, 𝑦ˆ).
For 10-fold we can either average over 10 values (Macro measure) or concatenate the 10 ex-
periments and compute the micro measures.
Two strategies [micro vs macro estimates](https://stats.stackexchange.com/questions/34611/
meanscores-vs-scoreconcatenation-in-cross-validation):
• Micro measure: average(individual scores): compute a score 𝒮 for each sample and
average over all samples. It is simillar to average score(concatenation): an averaged
score computed over all concatenated samples.
• Macro measure mean(CV scores) (the most commonly used method): compute a score
𝒮 on each each fold k and average accross folds:
These two measures (an average of average vs. a global average) are generaly similar. They
may differ slightly is folds are of different sizes. This validation scheme is known as the K-Fold
CV. Typical choices of K are 5 or 10, [Kohavi 1995]. The extreme case where K = N is known
as leave-one-out cross-validation, LOO-CV.
CV for regression
Usually the error function ℒ() is the r-squared score. However other function (MAE, MSE) can
be used.
CV with explicit loop:
estimator = lm.Ridge(alpha=10)
Out:
Train r2:0.99
Test r2:0.67
Out:
Test r2:0.73
Test r2:0.67
Out:
With classification problems it is essential to sample folds where each set contains approximately
the same percentage of samples of each target class as the complete set. This is called strati-
fication. In this case, we will use StratifiedKFold with is a variation of k-fold which returns
stratified folds. Usually the error function 𝐿() are, at least, the sensitivity and the specificity.
However other function could be used.
CV with explicit loop:
cv = StratifiedKFold(n_splits=5)
Out:
Out:
Test ACC:0.80
Out:
Test bACC:0.80
Out:
Combine CV and grid search: Re-split (inner split) train set into CV folds train/validation folds
and build a GridSearchCV out of it:
# Outer split:
X_train, X_test, y_train, y_test =\
train_test_split(X, y, test_size=0.25, shuffle=True, random_state=42)
# Predict
y_pred_test = lm_cv.predict(X_test)
print("Test bACC: %.2f" % metrics.balanced_accuracy_score(y_test, y_pred_test))
Out:
5.7.6 Cross-validation for both model (outer) evaluation and model (inner) selection
Out:
n_jobs=-1, cv=5)
scores = cross_val_score(estimator=mod_cv, X=X, y=y, cv=5)
print("Test ACC:%.2f" % scores.mean())
Out:
Regression
Out:
A permutation test is a type of non-parametric randomization test in which the null distribution
of a test statistic is estimated by randomly permuting the observations.
Permutation tests are highly attractive because they make no assumptions other than that the
observations are independent and identically distributed under the null hypothesis.
1. Compute a observed statistic 𝑡𝑜𝑏𝑠 on the data.
2. Use randomization to compute the distribution of 𝑡 under the null hypothesis: Perform 𝑁
random permutation of the data. For each sample of permuted data, 𝑖 the data compute
the statistic 𝑡𝑖 . This procedure provides the distribution of t under the null hypothesis 𝐻0 :
𝑃 (𝑡|𝐻0 )
3. Compute the p-value = 𝑃 (𝑡 > 𝑡𝑜𝑏𝑠 |𝐻0 ) |{𝑡𝑖 > 𝑡𝑜𝑏𝑠 }|, where 𝑡′𝑖 𝑠𝑖𝑛𝑐𝑙𝑢𝑑𝑒 : 𝑚𝑎𝑡ℎ :.
Example Ridge regression
Sample the distributions of r-squared and coefficients of ridge regression under the null hypoth-
esis. Simulated dataset:
orig_all = np.arange(X.shape[0])
for perm_i in range(1, nperm + 1):
model.fit(X, np.random.permutation(y))
y_pred = model.predict(X).ravel()
scores_perm[perm_i, :] = metrics.r2_score(y, y_pred)
coefs_perm[perm_i, :] = model.coef_
Out:
Compute p-values corrected for multiple comparisons using FWER max-T (Westfall and Young,
1993) procedure.
Out:
Plot distribution of third coefficient under null-hypothesis Coeffitients 0 and 1 are significantly
different from 0.
Paramters
(continues on next page)
n_coef = coefs_perm.shape[1]
fig, axes = plt.subplots(n_coef, 1, figsize=(12, 9))
for i in range(n_coef):
hist_pvalue( coefs_perm[:, i], axes[i], str(i))
Exercise
Given the logistic regression presented above and its validation given a 5 folds CV.
1. Compute the p-value associated with the prediction accuracy measured with 5CV using a
permutation test.
2. Compute the p-value associated with the prediction accuracy using a parametric test.
5.7.10 Bootstrapping
# Bootstrap loop
nboot = 100 # !! Should be at least 1000
scores_names = ["r2"]
scores_boot = np.zeros((nboot, len(scores_names)))
coefs_boot = np.zeros((nboot, X.shape[1]))
orig_all = np.arange(X.shape[0])
for boot_i in range(nboot):
boot_tr = np.random.choice(orig_all, size=len(orig_all), replace=True)
boot_te = np.setdiff1d(orig_all, boot_tr, assume_unique=False)
Xtr, ytr = X[boot_tr, :], y[boot_tr]
Xte, yte = X[boot_te, :], y[boot_te]
model.fit(Xtr, ytr)
y_pred = model.predict(Xte).ravel()
scores_boot[boot_i, :] = metrics.r2_score(yte, y_pred)
coefs_boot[boot_i, :] = model.coef_
coefs_boot = pd.DataFrame(coefs_boot)
coefs_stat = coefs_boot.describe(percentiles=[.975, .5, .025])
print("Coefficients distribution")
print(coefs_stat)
Out:
df = pd.DataFrame(coefs_boot)
staked = pd.melt(df, var_name="Variable", value_name="Coef. distribution")
sns.set_theme(style="whitegrid")
ax = sns.violinplot(x="Variable", y="Coef. distribution", data=staked)
_ = ax.axhline(0, ls='--', lw=2, color="black")
Dataset
import numpy as np
from sklearn import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics
from sklearn.model_selection import StratifiedKFold
X, y = datasets.make_classification(n_samples=20, n_features=5, n_informative=2,␣
˓→random_state=42)
cv = StratifiedKFold(n_splits=5)
Out:
Sequential computation
If we want have full control of the operations performed within each fold (retrieve the models
parameters, etc.). We would like to parallelize the folowing sequetial code:
# In[22]:
print(np.mean(test_accs), test_accs)
coefs_cv = np.array(coefs_seq)
print(coefs_cv)
print(coefs_cv.mean(axis=0))
print("Std Err of the coef")
print(coefs_cv.std(axis=0) / np.sqrt(coefs_cv.shape[0]))
Out:
parallel = Parallel(n_jobs=5)
cv_ret = parallel(
delayed(_split_fit_predict)(
clone(estimator), X, y, train, test)
for train, test in cv.split(X, y))
print(np.mean(test_accs), test_accs)
Out:
These methods are Ensemble learning techniques. These models are machine learning
paradigms where multiple models (often called “weak learners”) are trained to solve the same
problem and combined to get better results. The main hypothesis is that when weak models
are correctly combined we can obtain more accurate and/or robust models.
To understand these techniques, first, we will explore what is boostrapping and its different
hypothesis.
5.8.2 Bagging
In parallel methods we fit the different considered learners independently from each others
and, so, it is possible to train them concurrently. The most famous such approach is “bagging”
(standing for “bootstrap aggregating”) that aims at producing an ensemble model that is more
robust than the individual models composing it.
When training a model, no matter if we are dealing with a classification or a regression problem,
we obtain a function that takes an input, returns an output and that is defined with respect to
the training dataset.
The idea of bagging is then simple: we want to fit several independent models and “average”
their predictions in order to obtain a model with a lower variance. However, we can’t, in
practice, fit fully independent models because it would require too much data. So, we rely on
the good “approximate properties” of bootstrap samples (representativity and independence)
to fit models that are almost independent.
First, we create multiple bootstrap samples so that each new bootstrap sample will act as
another (almost) independent dataset drawn from true distribution. Then, we can fit a weak
learner for each of these samples and finally aggregate them such that we kind of “aver-
age” their outputs and, so, obtain an ensemble model with less variance that its components.
Roughly speaking, as the bootstrap samples are approximatively independent and identically
distributed (i.i.d.), so are the learned base models. Then, “averaging” weak learners outputs
do not change the expected answer but reduce its variance.
So, assuming that we have L bootstrap samples (approximations of L independent datasets) of
size B denoted
and then aggregate them into some kind of averaging process in order to get an ensemble model
with a lower variance. For example, we can define our strong model such that
There are several possible ways to aggregate the multiple models fitted in parallel. - For a
regression problem, the outputs of individual models can literally be averaged to obtain the
output of the ensemble model. - For classification problem the class outputted by each model
can be seen as a vote and the class that receives the majority of the votes is returned by
the ensemble model (this is called hard-voting). Still for a classification problem, we can
also consider the probabilities of each classes returned by all the models, average these
probabilities and keep the class with the highest average probability (this is called soft-
voting). –> Averages or votes can either be simple or weighted if any relevant weights can be
used.
Finally, we can mention that one of the big advantages of bagging is that it can be parallelised.
As the different models are fitted independently from each others, intensive parallelisation tech-
niques can be used if required.
Bagging consists in fitting several base models on different bootstrap samples and build an
ensemble model that “average” the results of these weak learners.
Question : - Can you name an algorithms based on Bagging technique , Hint : leaf
###### Examples
Here, we are trying some example of stacking
• Bagged Decision Trees for Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/
˓→master/pima-indians-diabetes.data.csv",names=names)
array = dataframe.values
x = array[:,0:8]
(continues on next page)
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/
˓→master/pima-indians-diabetes.data.csv",names=names)
array = dataframe.values
x = array[:,0:8]
y = array[:,8]
Both of these algorithms will print, Accuracy: 0.77 (+/- 0.07). They are equivalent.
5.8.3 Boosting
In sequential methods the different combined weak models are no longer fitted indepen-
dently from each others. The idea is to fit models iteratively such that the training of model at
a given step depends on the models fitted at the previous steps. “Boosting” is the most famous
of these approaches and it produces an ensemble model that is in general less biased than the
weak learners that compose it.
Boosting methods work in the same spirit as bagging methods: we build a family of models
that are aggregated to obtain a strong learner that performs better.
However, unlike bagging that mainly aims at reducing variance, boosting is a technique
that consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset that
were badly handled by the previous models in the sequence. Intuitively, each new model
focus its efforts on the most difficult observations to fit up to now, so that we obtain, at the
end of the process, a strong learner with lower bias (even if we can notice that boosting can
also have the effect of reducing variance).
–> Boosting, like bagging, can be used for regression as well as for classification problems.
Being mainly focused at reducing bias, the base models that are often considered for boosting
are* *models with low variance but high bias. For example, if we want to usetreesas our base
models, we will choosemost of the time shallow decision trees with only a few depths.**
Another important reason that motivates the use of low variance but high bias models as weak
learners for boosting is that these models are in general less computationally expensive to fit
(few degrees of freedom when parametrised). Indeed, as computations to fit the different mod-
els can’t be done in parallel (unlike bagging), it could become too expensive to fit sequentially
several complex models.
Once the weak learners have been chosen, we still need to define how they will be sequentially
fitted and how they will be aggregated. We will discuss these questions in the two follow-
ing subsections, describing more especially two important boosting algorithms: adaboost and
gradient boosting.
In a nutshell, these two meta-algorithms differ on how they create and aggregate the weak
learners during the sequential process. Adaptive boosting updates the weights attached to
each of the training dataset observations whereas gradient boosting updates the value of
these observations. This main difference comes from the way both methods try to solve the
optimisation problem of finding the best model that can be written as a weighted sum of weak
learners.
Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble model and
“update” the training dataset to better take into account the strengths and weakness of the
current ensemble model when fitting the next base model.
1/ Adaptative boosting
In adaptative boosting (often called “adaboost”), we try to define our ensemble model as a
weighted sum of L weak learners
Finding the best ensemble model with this form is a difficult optimisation problem. Then,
instead of trying to solve it in one single shot (finding all the coefficients and weak learners
that give the best overall additive model), we make use of an iterative optimisation process
that is much more tractable, even if it can lead to a sub-optimal solution. More especially,
we add the weak learners one by one, looking at each iteration for the best possible pair
(coefficient, weak learner) to add to the current ensemble model. In other words, we define
recurrently the (s_l)’s such that
where c_l and w_l are chosen such that s_l is the model that fit the best the training data and,
so, that is the best possible improvement over s_(l-1). We can then denote
where E(.) is the fitting error of the given model and e(.,.) is the loss/error function. Thus,
instead of optimising “globally” over all the L models in the sum, we approximate the optimum
by optimising “locally” building and adding the weak learners to the strong model one by one.
More especially, when considering a binary classification, we can show that the adaboost algo-
rithm can be re-written into a process that proceeds as follow. First, it updates the observations
weights in the dataset and train a new weak learner with a special focus given to the obser-
vations misclassified by the current ensemble model. Second, it adds the weak learner to the
weighted sum according to an update coefficient that expresse the performances of this weak
model: the better a weak learner performs, the more it contributes to the strong learner.
So, assume that we are facing a binary classification problem, with N observations in our
dataset and we want to use adaboost algorithm with a given family of weak models. At the
very beginning of the algorithm (first model of the sequence), all the observations have
the same weights 1/N. Then, we repeat L times (for the L learners in the sequence) the
following steps:
fit the best possible weak model with the current observations weights
compute the value of the update coefficient that is some kind of scalar evaluation metric of the
weak learner that indicates how much this weak learner should be taken into account into the
ensemble model
update the strong learner by adding the new weak learner multiplied by its update coefficient
compute new observations weights that expresse which observations we would like to focus
on at the next iteration (weights of observations wrongly predicted by the aggregated model
increase and weights of the correctly predicted observations decrease)
Repeating these steps, we have then build sequentially our L models and aggregate them into
a simple linear combination weighted by coefficients expressing the performance of each
learner.
Notice that there exists variants of the initial adaboost algorithm such that LogitBoost (classifi-
cation) or L2Boost (regression) that mainly differ by their choice of loss function.
Adaboost updates weights of the observations at each iteration. Weights of well classified obser-
vations decrease relatively to weights of misclassified observations. Models that perform better
have higher weights in the final ensemble model.
2/ Gradient boosting
In gradient boosting, the ensemble model we try to build is also a weighted sum of weak
learners
Just as we mentioned for adaboost, finding the optimal model under this form is too difficult
and an iterative approach is required. The main difference with adaptative boosting is in
the definition of the sequential optimisation process. Indeed, gradient boosting casts the
problem into a gradient descent one: at each iteration we fit a weak learner to the opposite
of the gradient of the current fitting error with respect to the current ensemble model.
Let’s try to clarify this last point. First, theoretical gradient descent process over the ensemble
model can be written
where E(.) is the fitting error of the given model, c_l is a coefficient corresponding to the step
size and
This entity is the opposite of the gradient of the fitting error with respect to the ensemble
model at step l-1. This opposite of the gradient is a function that can, in practice, only be
evaluated for observations in the training dataset (for which we know inputs and outputs):
these evaluations are called pseudo-residuals attached to each observations. Moreover, even if
we know for the observations the values of these pseudo-residuals, we don’t want to add to
our ensemble model any kind of function: we only want to add a new instance of weak model.
So, the natural thing to do is to fit a weak learner to the pseudo-residuals computed for each
observation. Finally, the coefficient c_l is computed following a one dimensional optimisation
process (line-search to obtain the best step size c_l).
So, assume that we want to use gradient boosting technique with a given family of weak models.
At the very beginning of the algorithm (first model of the sequence), the pseudo-residuals are
set equal to the observation values. Then, we repeat L times (for the L models of the sequence)
the following steps:
fit the best possible weak model to pseudo-residuals (approximate the opposite of the gradient
with respect to the current strong learner)
compute the value of the optimal step size that defines by how much we update the ensemble
model in the direction of the new weak learner
update the ensemble model by adding the new weak learner multiplied by the step size (make
a step of gradient descent)
compute new pseudo-residuals that indicate, for each observation, in which direction we would
like to update next the ensemble model predictions
Repeating these steps, we have then build sequentially our L models and aggregate them fol-
lowing a gradient descent approach. Notice that, while adaptative boosting tries to solve at
each iteration exactly the “local” optimisation problem (find the best weak learner and its
coefficient to add to the strong model), gradient boosting uses instead a gradient descent
approach and can more easily be adapted to large number of loss functions. Thus, gradi-
ent boosting can be considered as a generalization of adaboost to arbitrary differentiable
loss functions.
Note There is an algorithm which gained huge popularity after a Kaggle’s competitions. It is
XGBoost (Extreme Gradient Boosting). This is a gradient boosting algorithm which has more
flexibility (varying number of terminal nodes and left weights) parameters to avoid sub-
learners correlations. Having these important qualities, XGBOOST is one of the most used
algorithm in data science. LIGHTGBM is a recent implementation of this algorithm. It was
published by Microsoft and it gives us the same scores (if parameters are equivalents) but it
runs quicker than a classic XGBOOST.
Gradient boosting updates values of the observations at each iteration. Weak learners are
trained to fit the pseudo-residuals that indicate in which direction to correct the current en-
semble model predictions to lower the error.
Examples
Here, we are trying an example of Boosting and compare it to a Bagging. Both of algorithms
take the same weak learners to build the macro-model
• Adaboost Classifier
breast_cancer = load_breast_cancer()
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
# Transforming string Target to an int
encoder = LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))
clf_boosting = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=200
)
clf_boosting.fit(train_x, train_y)
predictions = clf_boosting.predict(test_x)
print("For Boosting : F1 Score {}, Accuracy {}".format(round(f1_score(test_y,
˓→predictions),2),round(accuracy_score(test_y,predictions),2)))
breast_cancer = load_breast_cancer()
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
# Transforming string Target to an int
encoder = LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))
Comparaison
Stacking mainly differ from bagging and boosting on two points : - First stacking often con-
siders heterogeneous weak learners (different learning algorithms are combined) whereas
bagging and boosting consider mainly homogeneous weak learners. - Second, stacking learns
to combine the base models using a meta-model whereas bagging and boosting combine weak
learners following deterministic algorithms.
As we already mentioned, the idea of stacking is to learn several different weak learners and
combine them by training a meta-model to output predictions based on the multiple predic-
tions returned by these weak models. So, we need to define two things in order to build our
stacking model: the L learners we want to fit and the meta-model that combines them.
For example, for a classification problem, we can choose as weak learners a KNN classifier, a
logistic regression and a SVM, and decide to learn a neural network as meta-model. Then, the
neural network will take as inputs the outputs of our three weak learners and will learn to
return final predictions based on it.
So, assume that we want to fit a stacking ensemble composed of L weak learners. Then we have
to follow the steps thereafter:
• split the training data in two folds
• choose L weak learners and fit them to data of the first fold
• for each of the L weak learners, make predictions for observations in the second fold
• fit the meta-model on the second fold, using predictions made by the weak learners
as inputs
In the previous steps, we split the dataset in two folds because predictions on data that have
been used for the training of the weak learners are not relevant for the training of the meta-
model.
Stacking consists in training a meta-model to produce outputs based on the outputs returned
by some lower layer weak learners.
A possible extension of stacking is multi-level stacking. It consists in doing stacking with
multiple layers. As an example,
Multi-level stacking considers several layers of stacking: some meta-models are trained on out-
puts returned by lower layer meta-models and so on. Here we have represented a 3-layers
stacking model.
Examples
Here, we are trying an example of Stacking and compare it to a Bagging & a Boosting. We note
that, many other applications (datasets) would show more difference between these techniques.
breast_cancer = load_breast_cancer()
x = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
boosting_clf_ada_boost= AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=3
)
bagging_clf_rf = RandomForestClassifier(n_estimators=200, max_depth=1,random_
˓→state=2020)
clf_logistic_reg = LogisticRegression(solver='liblinear',random_state=2020)
def __init__(self,classifiers):
if(len(classifiers) < 2):
raise numberOfClassifierException("You must fit your classifier with␣
˓→2 classifiers at least");
else:
self._classifiers = classifiers
def fit(self,data_x,data_y):
stacked_data_x = data_x.copy()
for classfier in self._classifiers[:-1]:
classfier.fit(data_x,data_y)
stacked_data_x = np.column_stack((stacked_data_x,classfier.predict_
˓→proba(data_x)))
last_classifier = self._classifiers[-1]
last_classifier.fit(stacked_data_x,data_y)
def predict(self,data_x):
stacked_data_x = data_x.copy()
for classfier in self._classifiers[:-1]:
prob_predictions = classfier.predict_proba(data_x)
stacked_data_x = np.column_stack((stacked_data_x,prob_predictions))
last_classifier = self._classifiers[-1]
return last_classifier.predict(stacked_data_x)
bagging_clf_rf.fit(train_x, train_y)
boosting_clf_ada_boost.fit(train_x, train_y)
classifers_list = [clf_rf,clf_ada_boost,clf_logistic_reg]
clf_stacking = Stacking(classifers_list)
clf_stacking.fit(train_x,train_y)
predictions_bagging = bagging_clf_rf.predict(test_x)
predictions_boosting = boosting_clf_ada_boost.predict(test_x)
(continues on next page)
Comparaison
5.9.1 Introduction
Consider the 3-dimensional graph below in the context of a cost function. Our goal is to move
from the mountain in the top right corner (high cost) to the dark blue sea in the bottom
left (low cost). The arrows represent the direction of steepest descent (negative gradient)
from any given point–the direction that decreases the cost function as quickly as possible
Gradient descent intuition.
Starting at the top of the mountain, we take our first step downhill in the direction specified by
the negative gradient. Next we recalculate the negative gradient (passing in the coordinates
of our new point) and take another step in the direction it specifies. We continue this process
iteratively until we get to the bottom of our graph, or to a point where we can no longer
move downhill–a local minimum.
Learning rate
The size of these steps is called the learning rate. With a high learning rate we can cover
more ground each step, but we risk overshooting the lowest point since the slope of the hill is
constantly changing. With a very low learning rate, we can confidently move in the direction
of the negative gradient since we are recalculating it so frequently. A low learning rate is
more precise, but calculating the gradient is time-consuming, so it will take us a very long
time to get to the bottom.
Cost function
A Loss Function (Error function) tells us “how good” our model is at making predictions for
a given set of parameters. The cost function has its own curve and its own gradients. The
slope of this curve tells us how to update our parameters to make the model more accurate.
To solve for the gradient, we iterate through our data points using our
:math:beta_1 and :math:beta_0 values and compute the
partial derivatives. This new gradient tells us the slope of our cost function at our cur-
rent position (current parameter values) and the direction we should move to update our
parameters. The size of our update is controlled by the learning rate.
Pseudocode of this algorithm
m : 1
b : 1
m_deriv : 0
b_deriv : 0
data_length : length(X)
loop i : 1 --> number_iterations:
loop i : 1 -> data_length :
m_deriv : m_deriv -X[i] * ((m*X[i] + b) - Y[i])
b_deriv : b_deriv - ((m*X[i] + b) - Y[i])
m : m - (m_deriv / data_length) * learning_rate
b : b - (b_deriv / data_length) * learning_rate
return m, b
There are three variants of gradient descent, which differ in how much data we use to
compute the gradient of the objective function. Depending on the amount of data, we make
a trade-off between the accuracy of the parameter update and the time it takes to perform
an update.
Batch gradient descent, known also as Vanilla gradient descent, computes the gradient of the
cost function with respect to the parameters 𝜃 for the entire training dataset :
𝜃 = 𝜃 − 𝜂 · ∇𝜃 𝐽(𝜃)
As we need to calculate the gradients for the whole dataset to perform just one update, batch
gradient descent can be very slow and is intractable for datasets that don’t fit in memory.
Batch gradient descent also doesn’t allow us to update our model online.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training
example 𝑥(𝑖) and label 𝑦 (𝑖)
• Choose an initial vector of parameters 𝑤 and learning rate 𝜂.
• Repeat until an approximate minimum is obtained:
– Randomly shuffle examples in the training set.
– For 𝑖 ∈ 1, . . . , 𝑛
(𝑖) (𝑖)
* 𝜃 = 𝜃 − 𝜂 · ∇𝜃 𝐽(𝜃; 𝑥 ; 𝑦 )
Batch gradient descent performs redundant computations for large datasets, as it recom-
putes gradients for similar examples before each parameter update. SGD does away with
this redundancy by performing one update at a time. It is therefore usually much faster
and can also be used to learn online. SGD performs frequent updates with a high variance
that cause the objective function to fluctuate heavily as in the image below.
SGD fluctuation.
While batch gradient descent converges to the minimum of the basin the parameters are
placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially
better local minima. On the other hand, this ultimately complicates convergence to the
exact minimum, as SGD will keep overshooting. However, it has been shown that when we
slowly decrease the learning rate, SGD shows the same convergence behaviour as batch
gradient descent, almost certainly converging to a local or the global minimum for non-
convex and convex optimization respectively.
Mini-batch gradient descent finally takes the best of both worlds and performs an update for
every mini-batch of n training examples:
This way, it :
• reduces the variance of the parameter updates, which can lead to more stable con-
vergence.
• can make use of highly optimized matrix optimizations common to state-of-the-art deep
learning libraries that make computing the gradient very efficient. Common mini-batch
sizes range between 50 and 256, but can vary for different applications.
Mini-batch gradient descent is typically the algorithm of choice when training a neural
network.
Vanilla mini-batch gradient descent, however, does not guarantee good convergence, but offers
a few challenges that need to be addressed:
• Choosing a proper learning rate can be difficult. A learning rate that is too small leads
to painfully slow convergence, while a learning rate that is too large can hinder conver-
gence and cause the loss function to fluctuate around the minimum or even to diverge.
• Learning rate schedules try to adjust the learning rate during training by e.g. an-
nealing, i.e. reducing the learning rate according to a pre-defined schedule or when the
change in objective between epochs falls below a threshold. These schedules and thresh-
olds, however, have to be defined in advance and are thus unable to adapt to a dataset’s
characteristics.
• Additionally, the same learning rate applies to all parameter updates. If our data is sparse
and our features have very different frequencies, we might not want to update all of
them to the same extent, but perform a larger update for rarely occurring features.
• Another key challenge of minimizing highly non-convex error functions common for
neural networks is avoiding getting trapped in their numerous suboptimal local min-
ima. These saddle points (local minimas) are usually surrounded by a plateau of the
same error, which makes it notoriously hard for SGD to escape, as the gradient is close
to zero in all dimensions.
In the following, we will outline some algorithms that are widely used by the deep learning
community to deal with the aforementioned challenges.
Momentum
SGD has trouble navigating ravines (areas where the surface curves much more steeply in
one dimension than in another), which are common around local optima. In these scenarios,
SGD oscillates across the slopes of the ravine while only making hesitant progress along
the bottom towards the local optimum as in the image below.
of the update vector of the past time step to the current update vector
𝑣𝑡 = 𝜌𝑣𝑡−1 + ∇𝜃 𝐽(𝜃)
(5.50)
𝜃 = 𝜃 − 𝑣𝑡
vx = 0
while True:
dx = gradient(J, x)
vx = rho * vx + dx
x -= learning_rate * vx
Note: The momentum term :math:`rho` is usually set to 0.9 or a similar value.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates mo-
mentum as it rolls downhill, becoming faster and faster on the way (until it reaches its
terminal velocity if there is air resistance, i.e. :math:`rho` <1 ).
The same thing happens to our parameter updates: The momentum term increases for dimen-
sions whose gradients point in the same directions and reduces updates for dimensions
whose gradients change directions. As a result, we gain faster convergence and reduced
oscillation.
• Added element-wise scaling of the gradient based on the historical sum of squares in each
dimension.
• “Per-parameter learning rates” or “adaptive learning rates”
grad_squared = 0
while True:
dx = gradient(J, x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
grad_squared = 0
while True:
dx = gradient(J, x)
grad_squared += decay_rate * grad_squared + (1 - decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d
like to have a smarter ball, a ball that has a notion of where it is going so that it knows to
slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) is a way to
give our momentum term this kind of prescience. We know that we will use our momentum
term 𝛾𝑣𝑡−1 to move the parameters 𝜃.
Computing 𝜃 − 𝛾𝑣𝑡−1 thus gives us an approximation of the next position of the parameters
(the gradient is missing for the full update), a rough idea where our parameters are going to
be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current
parameters 𝜃 but w.r.t. the approximate future position of our parameters:
Again, we set the momentum term 𝛾 to a value of around 0.9. While Momentum first com-
putes the current gradient and then takes a big jump in the direction of the updated
accumulated gradient , NAG first makes a big jump in the direction of the previous ac-
cumulated gradient, measures the gradient and then makes a correction, which results
in the complete NAG update. This anticipatory update prevents us from going too fast and
results in increased responsiveness, which has significantly increased the performance of
RNNs on a number of tasks
Adam
Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates for
each parameter. In addition to storing an exponentially decaying average of past squared
gradients :math:`v_t`, Adam also keeps an exponentially decaying average of past gradi-
ents :math:`m_t`, similar to momentum. Whereas momentum can be seen as a ball running
down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima
in the error surface. We compute the decaying averages of past and past squared gradients 𝑚𝑡
and 𝑣𝑡 respectively as follows:
𝑚𝑡 and 𝑣𝑡 are estimates of the first moment (the mean) and the second moment (the uncentered
variance) of the gradients respectively, hence the name of the method. Adam (almost)
first_moment = 0
second_moment = 0
while True:
dx = gradient(J, x)
# Momentum:
first_moment = beta1 * first_moment + (1 - beta1) * dx
# AdaGrad/RMSProp
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7)
As 𝑚𝑡 and 𝑣𝑡 are initialized as vectors of 0’s, the authors of Adam observe that they are biased
towards zero, especially during the initial time steps, and especially when the decay rates are
small (i.e. 𝛽1 and 𝛽2 are close to 1). They counteract these biases by computing bias-corrected
first and second moment estimates:
𝑚𝑡
𝑚
ˆ𝑡 = (5.53)
1 − 𝛽1𝑡
𝑣𝑡
𝑣ˆ𝑡 = (5.54)
1 − 𝛽2𝑡
They then use these to update the parameters (Adam update rule):
𝜂
𝜃𝑡+1 = 𝜃𝑡 − √ 𝑚
ˆ𝑡
𝑣ˆ𝑡 + 𝜖
• 𝑚
ˆ 𝑡 Accumulate gradient: velocity.
• 𝑣ˆ𝑡 Element-wise scaling of the gradient based on the historical sum of squares in each
dimension.
• Choose Adam as default optimizer
• Default values of 0.9 for 𝛽1 , 0.999 for 𝛽2 , and 10−7 for 𝜖.
• learning rate in a range between 1𝑒 − 3 and 5𝑒 − 4
This lab is inspired by a scikit-learn lab: Faces recognition example using eigenfaces and SVMs
It uses scikit-learan and pytorch models using skorch (slides).
• skorch provides scikit-learn compatible neural network library that wraps PyTorch.
• skorch abstracts away the training loop, making a lot of boilerplate code obsolete. A
simple net.fit(X, y) is enough.
Note that more sofisticated models can be used, see for a overview.
Models:
• Eigenfaces unsupervized exploratory analysis.
• LogisticRegression with L2 regularization (includes model selection with 5CV`_
• SVM-RBF (includes model selection with 5CV.
• MLP using sklearn using sklearn (includes model selection with 5CV)
• MLP using skorch classifier
• Basic Convnet (ResNet18) using skorch.
• Pretrained ResNet18 using skorch.
Pipelines:
• Univariate feature filtering (Anova) with Logistic-L2
• PCA with LogisticRegression with L2 regularization
import numpy as np
from time import time
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Preprocesing
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
# Dataset
from sklearn.datasets import fetch_lfw_people
# Models
from sklearn.decomposition import PCA
import sklearn.manifold as manifold
import sklearn.linear_model as lm
import sklearn.svm as svm
from sklearn.neural_network import MLPClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# Pytorch Models
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier
import skorch
5.10.1 Utils
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
Out:
Out:
{'Ariel Sharon': 0.06, 'Colin Powell': 0.18, 'Donald Rumsfeld': 0.09, 'George W␣
˓→Bush': 0.41, 'Gerhard Schroeder': 0.08, 'Hugo Chavez': 0.05, 'Tony Blair': 0.11}
single_faces[::5, :, :] = mean_faces
titles = [n for name in target_names for n in [name] * 5]
plot_gallery(single_faces, titles, h, w, n_row=n_classes, n_col=5)
5.10.4 Eigenfaces
Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled dataset): unsupervised
feature extraction / dimensionality reduction
n_components = 150
Out:
T-SNE
•
Out:
Plot eigenfaces:
Our goal is to obtain a good balanced accuracy, ie, the macro average (macro avg) of classes’
reccalls. In this perspective, the good practices are:
• Scale input features using either StandardScaler() or MinMaxScaler() “It doesn’t harm”.
• Re-balance classes’ contributions class_weight=’balanced’
• Do not include an intercept (fit_intercept=False) in the model. This should reduce the
global accuracy weighted avg. But rememember that we decided to maximize the balanced
accuracy.
lrl2_cv = make_pipeline(
preprocessing.StandardScaler(),
# preprocessing.MinMaxScaler(), # Would have done the job either
GridSearchCV(lm.LogisticRegression(max_iter=1000, class_weight='balanced',
(continues on next page)
t0 = time()
lrl2_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
print(lrl2_cv.steps[-1][1].best_params_)
y_pred = lrl2_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
Out:
done in 5.383s
Best params found by grid search:
{'C': 1.0}
precision recall f1-score support
[[ 17 0 1 0 0 1 0]
[ 2 49 3 3 0 0 2]
[ 3 0 24 1 0 1 1]
[ 7 3 4 107 5 3 4]
[ 0 0 1 0 21 1 4]
[ 0 2 0 3 2 10 1]
[ 0 0 1 3 2 0 30]]
Coeficients
coefs = lrl2_cv.steps[-1][1].best_estimator_.coef_
coefs = coefs.reshape(-1, h, w)
plot_gallery(coefs, target_names, h, w)
Remarks: - RBF generally requires “large” C (>1) - Poly generally requires “small” C (<1)
svm_cv = make_pipeline(
# preprocessing.StandardScaler(),
preprocessing.MinMaxScaler(),
GridSearchCV(svm.SVC(class_weight='balanced'),
{'kernel': ['poly', 'rbf'], 'C': 10. ** np.arange(-2, 3)},
# {'kernel': ['rbf'], 'C': 10. ** np.arange(-1, 4)},
cv=5, n_jobs=5))
t0 = time()
svm_cv.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
(continues on next page)
y_pred = svm_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 23.788s
Best params found by grid search:
{'C': 0.1, 'kernel': 'poly'}
precision recall f1-score support
mlp_param_grid = {"hidden_layer_sizes":
# Configurations with 1 hidden layer:
[(100, ), (50, ), (25, ), (10, ), (5, ),
# Configurations with 2 hidden layers:
(100, 50, ), (50, 25, ), (25, 10, ), (10, 5, ),
# Configurations with 3 hidden layers:
(100, 50, 25, ), (50, 25, 10, ), (25, 10, 5, )],
"activation": ["relu"], "solver": ["adam"], 'alpha': [0.0001]}
mlp_cv = make_pipeline(
# preprocessing.StandardScaler(),
(continues on next page)
t0 = time()
mlp_cv.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
print(mlp_cv.steps[-1][1].best_params_)
y_pred = mlp_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 128.638s
Best params found by grid search:
{'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'solver':
˓→'adam'}
class SimpleMLPClassifierPytorch(nn.Module):
"""Simple (one hidden layer) MLP Classifier with Pytorch."""
def __init__(self):
super(SimpleMLPClassifierPytorch, self).__init__()
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
t0 = time()
mlp.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = mlp.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 3.142s
precision recall f1-score support
anova_l2lr = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('anova', SelectKBest(f_classif)),
('l2lr', lm.LogisticRegression(max_iter=1000, class_weight='balanced',
fit_intercept=False))
])
t0 = time()
anova_l2lr_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = anova_l2lr_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 18.828s
Best params found by grid search:
{'anova__k': 1850, 'l2lr__C': 100.0}
precision recall f1-score support
pca_lrl2_cv = make_pipeline(
PCA(n_components=150, svd_solver='randomized', whiten=True),
GridSearchCV(lm.LogisticRegression(max_iter=1000, class_weight='balanced',
fit_intercept=False),
{'C': 10. ** np.arange(-3, 3)},
cv=5, n_jobs=5))
t0 = time()
pca_lrl2_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = pca_lrl2_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
Out:
done in 0.333s
Best params found by grid search:
{'C': 0.1}
precision recall f1-score support
[[17 0 1 0 0 1 0]
[ 4 46 2 3 1 0 3]
[ 3 1 23 0 0 2 1]
[ 8 5 8 94 4 12 2]
[ 1 0 0 1 20 0 5]
[ 0 1 0 1 3 12 1]
[ 1 1 1 1 5 0 27]]
Note that to simplify, do not use pipeline (scaler + CNN) here. But it would have been simple
to do so, since pytorch is warpped in skorch object that is compatible with sklearn.
Sources:
• ConvNet on MNIST
• NeuralNetClassifier
class Cnn(nn.Module):
"""Basic ConvNet Conv(1, 32, 64) -> FC(100, 7) -> softmax."""
x = torch.relu(self.fc1_drop(self.fc1(x)))
x = torch.softmax(self.fc2(x), dim=-1)
return x
torch.manual_seed(0)
cnn = NeuralNetClassifier(
Cnn,
max_epochs=100,
lr=0.001,
optimizer=torch.optim.Adam,
device=device,
train_split=skorch.dataset.CVSplit(cv=5, stratified=True),
verbose=0)
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train).reshape(-1, 1, h, w)
(continues on next page)
t0 = time()
cnn.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = cnn.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 39.086s
precision recall f1-score support
class Resnet18(nn.Module):
"""ResNet 18, pretrained, with one input chanel and 7 outputs."""
# self.model = torchvision.models.resnet18()
self.model = torchvision.models.resnet18(pretrained=True)
# Last layer
num_ftrs = self.model.fc.in_features
self.model.fc = nn.Linear(num_ftrs, n_outputs)
torch.manual_seed(0)
resnet = NeuralNetClassifier(
Resnet18,
# `CrossEntropyLoss` combines `LogSoftmax and `NLLLoss`
criterion=nn.CrossEntropyLoss,
max_epochs=50,
batch_size=128, # default value
optimizer=torch.optim.Adam,
# optimizer=torch.optim.SGD,
optimizer__lr=0.001,
optimizer__betas=(0.9, 0.999),
optimizer__eps=1e-4,
optimizer__weight_decay=0.0001, # L2 regularization
# Shuffle training data on each epoch
# iterator_train__shuffle=True,
train_split=skorch.dataset.CVSplit(cv=5, stratified=True),
device=device,
verbose=0)
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train).reshape(-1, 1, h, w)
X_test_s = scaler.transform(X_test).reshape(-1, 1, h, w)
t0 = time()
resnet.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = resnet.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
Out:
done in 116.626s
precision recall f1-score support
SIX
DEEP LEARNING
6.1 Backpropagation
%matplotlib inline
Y = max(XW(1) , 0)W(2)
A fully-connected ReLU network with one hidden layer and no biases, trained to predict y from
x using Euclidean error.
Chaine rule
𝑥 → 𝑧 (1) = 𝑥𝑇 𝑤(1) → ℎ(1) = max(𝑧 (1) , 0) → 𝑧 (2) = ℎ(1)𝑇 𝑤(2) → 𝐿(𝑧 (2) , 𝑦) = (𝑧 (2) − 𝑦)2
𝑤(1) ↗ 𝑤(2) ↗
𝜕𝑧 (1) 𝜕ℎ(1) 1 if 𝑧 (1) >0 𝜕𝑧
(2) 𝜕𝐿
=𝑥 = {else 0 = ℎ(1) = 2(𝑧 (2) − 𝑦)
𝜕𝑤(1) 𝜕𝑧 (1) 𝜕𝑤(2) 𝜕𝑧 (2)
𝜕𝑧 (1) 𝜕𝑧 (2)
= 𝑤(1) = 𝑤(2)
𝜕𝑥 𝜕ℎ(1)
Backward: compute gradient of the loss given each parameters vectors applying chaine rule
from the loss downstream to the parameters:
For 𝑤(2) :
307
Statistics and Machine Learning in Python, Release 0.5
𝜕𝐿 𝜕𝐿 𝜕𝑧 (2)
= (6.1)
𝜕𝑤(2) 𝜕𝑧 (2) 𝜕𝑤(2)
=2(𝑧 (2) − 𝑦)ℎ(1) (6.2)
For 𝑤(1) :
Given a function 𝑧 = 𝑥 with 𝑧 the output, 𝑥 the input and 𝑤 the coeficients.
• Scalar to Scalar: 𝑥 ∈ R, 𝑧 ∈ R, 𝑤 ∈ R
Regular derivative:
𝜕𝑧
=𝑥∈R
𝜕𝑤
If 𝑤 changes by a small amount, how much will 𝑧 change?
• Vector to Scalar: 𝑥 ∈ R𝑁 , 𝑧 ∈ R, 𝑤 ∈ R𝑁
𝜕𝑧
Derivative is Gradient of partial derivative: 𝜕𝑤 ∈ R𝑁
⎡ 𝜕𝑧 ⎤
𝜕𝑤1
⎢ .. ⎥
⎢ . ⎥
𝜕𝑧 ⎢ 𝜕𝑧 ⎥
= ∇𝑤 𝑧 = ⎢
⎢ 𝜕𝑤𝑖
⎥ (6.5)
𝜕𝑤 ⎢ ..
⎥
⎥
⎣ . ⎦
𝜕𝑧
𝜕𝑤𝑁
For each element 𝑤𝑖 of 𝑤, if it changes by a small amount then how much will y change?
• Vector to Vector: 𝑤 ∈ R𝑁 , 𝑧 ∈ R𝑀
Derivative is Jacobian of partial derivative:
TO COMPLETE
𝜕𝑧
𝜕𝑤 ∈ R𝑁 ×𝑀
Backpropagation summary
Backpropagation algorithm in a graph: 1. Forward pass, for each node compute local partial
derivatives of ouput given inputs 2. Backward pass: apply chain rule from the end to each
parameters - Update parameter with gradient descent using the current upstream gradient and
the current local gradient - Compute upstream gradient for the backward nodes
Think locally and remember that at each node: - For the loss the gradient is the error - At each
step, the upstream gradient is obtained by multiplying the upstream gradient (an error) with
the current parameters (vector of matrix). - At each step, the current local gradient equal the
input, therfore the current update is the current upstream gradient time the input.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection
iris = sns.load_dataset("iris")
#g = sns.pairplot(iris, hue="species")
df = iris[iris.species != "setosa"]
g = sns.pairplot(df, hue="species")
df['species_n'] = iris.species.map({'versicolor':1, 'virginica':2})
# Scale
from sklearn.preprocessing import StandardScaler
scalerx, scalery = StandardScaler(), StandardScaler()
X_iris = scalerx.fit_transform(X_iris)
Y_iris = StandardScaler().fit_transform(Y_iris)
/home/edouard/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5:␣
˓→SettingWithCopyWarning:
"""
This implementation uses numpy to manually compute the forward pass, loss, and backward
pass.
W1 = np.random.randn(D_in, H)
W2 = np.random.randn(H, D_out)
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y
z1 = X.dot(W1)
h1 = np.maximum(z1, 0)
Y_pred = h1.dot(W2)
# Update weights
W1 -= learning_rate * grad_w1
W2 -= learning_rate * grad_w2
losses_tr.append(loss)
losses_val.append(loss_val)
if t % 10 == 0:
print(t, loss, loss_val)
(continues on next page)
lr=1e-4, nite=50)
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)),␣
˓→losses_val, "-r")
0 15126.224825529907 2910.260853330454
10 71.5381374591153 104.97056197642135
20 50.756938353833334 80.02800827986354
30 46.546510744624236 72.85211241738614
40 44.41413064447564 69.31127324764276
[<matplotlib.lines.Line2D at 0x7f960cf5e9b0>,
<matplotlib.lines.Line2D at 0x7f960cf5eb00>]
source
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical compu-
tations. For modern deep neural networks, GPUs often provide speedups of 50x or greater,
so unfortunately numpy won’t be enough for modern deep learning. Here we introduce the
most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a
numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for op-
erating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph
and gradients, but they’re also useful as a generic tool for scientific computing. Also unlike
numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a
PyTorch Tensor on GPU, you simply need to cast it to a new datatype. Here we use PyTorch
Tensors to fit a two-layer network to random data. Like the numpy example above we need to
manually implement the forward and backward passes through the network:
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y
z1 = X.mm(W1)
h1 = z1.clamp(min=0)
y_pred = h1.mm(W2)
losses_tr.append(loss)
losses_val.append(loss_val)
if t % 10 == 0:
print(t, loss, loss_val)
lr=1e-4, nite=50)
0 8086.1591796875 5429.57275390625
10 225.77589416503906 331.83734130859375
20 86.46501159667969 117.72447204589844
30 52.375606536865234 73.84156036376953
40 43.16458511352539 64.0667495727539
[<matplotlib.lines.Line2D at 0x7f960033c470>,
<matplotlib.lines.Line2D at 0x7f960033c5c0>]
source
A fully-connected ReLU network with one hidden layer and no biases, trained to predict y
from x by minimizing squared Euclidean distance. This implementation computes the forward
pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients.
A PyTorch Tensor represents a node in a computational graph. If x is a Tensor that has x.
requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to
some scalar value.
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass␣
˓→using
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
y_pred = X_val.mm(W1).clamp(min=0).mm(W2)
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
lr=1e-4, nite=50)
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)),␣
˓→losses_val, "-r")
0 8307.1806640625 2357.994873046875
(continues on next page)
[<matplotlib.lines.Line2D at 0x7f95ff2ad978>,
<matplotlib.lines.Line2D at 0x7f95ff2adac8>]
source
This implementation uses the nn package from PyTorch to build the network. PyTorch autograd
makes it easy to define computational graphs and take gradients, but raw autograd can be a bit
too low-level for defining complex neural networks; this is where the nn package can help. The
nn package defines a set of Modules, which you can think of as a neural network layer that has
produces output from input and may have some trainable weights.
import torch
X = torch.from_numpy(X)
(continues on next page)
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y by passing x to the model. Module␣
˓→objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(X)
# Compute and print loss. We pass Tensors containing the predicted and␣
˓→ true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, Y)
# Backward pass: compute gradient of the loss with respect to all the␣
˓→learnable
# parameters of the model. Internally, the parameters of each Module are␣
˓→stored
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
lr=1e-4, nite=50)
0 82.32025146484375 91.3389892578125
10 50.322200775146484 63.563087463378906
20 40.825225830078125 57.13555145263672
30 37.53572082519531 55.74506378173828
40 36.191200256347656 55.499732971191406
[<matplotlib.lines.Line2D at 0x7f95ff296668>,
<matplotlib.lines.Line2D at 0x7f95ff2967b8>]
This implementation uses the nn package from PyTorch to build the network. Rather than man-
ually updating the weights of the model as we have been doing, we use the optim package to
define an Optimizer that will update the weights for us. The optim package defines many op-
timization algorithms that are commonly used for deep learning, including SGD+momentum,
RMSProp, Adam, etc.
import torch
X = torch.from_numpy(X)
Y = torch.from_numpy(Y)
X_val = torch.from_numpy(X_val)
Y_val = torch.from_numpy(Y_val)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many␣
˓→other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = lr
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(nite):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(X)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
(continues on next page)
with torch.no_grad():
y_pred = model(X_val)
loss_val = loss_fn(y_pred, Y_val)
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
lr=1e-3, nite=50)
plt.plot(np.arange(len(losses_tr)), losses_tr, "-b", np.arange(len(losses_val)),␣
˓→losses_val, "-r")
0 92.271240234375 83.96189880371094
10 64.25907135009766 59.872535705566406
20 47.6252555847168 50.228126525878906
30 40.33802032470703 50.60377502441406
40 38.19448471069336 54.03163528442383
[<matplotlib.lines.Line2D at 0x7f95ff200080>,
<matplotlib.lines.Line2D at 0x7f95ff2001d0>]
%matplotlib inline
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
from torchvision import transforms
from torchvision import datasets
from torchvision import models
#
from pathlib import Path
import matplotlib.pyplot as plt
# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device = 'cpu' # Force CPU
print(device)
cpu
Hyperparameters
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # Mean and␣
˓→Std of the MNIST dataset
])),
batch_size=batch_size_train, shuffle=True)
val_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transforms.Compose([
(continues on next page)
])),
batch_size=batch_size_test, shuffle=True)
return train_loader, val_loader
So one test data batch is a tensor of shape: . This means we have 1000 examples of 28x28 pixels
in grayscale (i.e. no rgb channels, hence the one). We can plot some of them using matplotlib.
𝑓 (𝑥) = 𝜎(𝑥𝑇 𝑤)
𝑓 (𝑥) = softmax(𝑥𝑇 𝑊 + 𝑏)
X_train = train_loader.dataset.data.numpy()
#print(X_train.shape)
X_train = X_train.reshape((X_train.shape[0], -1))
(continues on next page)
X_test = val_loader.dataset.data.numpy()
X_test = X_test.reshape((X_test.shape[0], -1))
y_test = val_loader.dataset.targets.numpy()
print(X_train.shape, y_train.shape)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()
for i in range(10):
l1_plot = plt.subplot(2, 5, i + 1)
l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',
cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xticks(())
l1_plot.set_yticks(())
l1_plot.set_xlabel('Class %i' % i)
plt.suptitle('Classification vector for...')
plt.show()
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
plt.show()
/home/ed203246/anaconda3/lib/python3.7/site-packages/sklearn/neural_network/_
˓→multilayer_perceptron.py:585: ConvergenceWarning: Stochastic Optimizer: Maximum␣
% self.max_iter, ConvergenceWarning)
class TwoLayerMLP(nn.Module):
# %load train_val_model.py
# %load train_val_model.py
import numpy as np
import torch
import time
import copy
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
running_loss = 0.0
running_corrects = 0
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
#nsamples = dataloaders[phase].dataset.data.shape[0]
epoch_loss = running_loss / nsamples
epoch_acc = running_corrects.double() / nsamples
losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
if log_interval is not None and epoch % log_interval == 0:
print('{} Loss: {:.4f} Acc: {:.2f}%'.format(
(continues on next page)
num_epochs=1, log_interval=1)
print(next(model.parameters()).is_cuda)
torch.save(model.state_dict(), 'models/mod-%s.pth' % model.__class__.__name__)
False
torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])
Total number of parameters = 39760
Epoch 0/0
----------
train Loss: 0.4431 Acc: 87.93%
(continues on next page)
Training complete in 0m 7s
Best val Acc: 91.21%
False
Use the model to make new predictions. Consider the device, ie, load data on device
example_data.to(device) from prediction, then move back to cpu example_data.cpu().
with torch.no_grad():
output = model(example_data).cpu()
example_data = example_data.cpu()
# print(output.is_cuda)
# Softmax predictions
preds = output.argmax(dim=1)
show_data_label_prediction(data=example_data, y_true=example_targets, y_
˓→pred=preds, shape=(3, 4))
err_idx = np.where(errors)[0]
show_data_label_prediction(data=example_data[err_idx], y_true=example_targets[err_
˓→idx],
Continue training from checkpoints: reload the model and run 10 more epochs
num_epochs=10, log_interval=2)
Epoch 0/9
----------
train Loss: 0.3096 Acc: 91.11%
val Loss: 0.2897 Acc: 91.65%
Epoch 2/9
----------
train Loss: 0.2853 Acc: 92.03%
val Loss: 0.2833 Acc: 92.04%
Epoch 4/9
----------
train Loss: 0.2749 Acc: 92.36%
(continues on next page)
Epoch 6/9
----------
train Loss: 0.2692 Acc: 92.51%
val Loss: 0.2741 Acc: 92.29%
Epoch 8/9
----------
train Loss: 0.2651 Acc: 92.61%
val Loss: 0.2715 Acc: 92.32%
• Define a MultiLayerMLP([D_in, 512, 256, 128, 64, D_out]) class that take the size of
the layers as parameters of the constructor.
• Add some non-linearity with relu acivation function
class MLP(nn.Module):
self.linears = nn.ModuleList(layer_list)
(continues on next page)
num_epochs=10, log_interval=2)
Epoch 0/9
----------
train Loss: 1.1216 Acc: 66.19%
val Loss: 0.3347 Acc: 90.71%
Epoch 2/9
----------
train Loss: 0.1744 Acc: 94.94%
val Loss: 0.1461 Acc: 95.52%
Epoch 4/9
----------
train Loss: 0.0979 Acc: 97.14%
val Loss: 0.1089 Acc: 96.49%
Epoch 6/9
----------
train Loss: 0.0635 Acc: 98.16%
val Loss: 0.0795 Acc: 97.68%
Epoch 8/9
----------
train Loss: 0.0422 Acc: 98.77%
val Loss: 0.0796 Acc: 97.54%
Reduce the size of the training dataset by considering only 10 minibatche for size16.
train_size = 10 * 16
# Stratified sub-sampling
targets = train_loader.dataset.targets.numpy()
nclasses = len(set(targets))
Train size= 160 Train label count= {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6:␣
˓→16, 7: 16, 8: 16, 9: 16}
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000,␣
˓→28, 28])}
num_epochs=100, log_interval=20)
Epoch 0/99
----------
train Loss: 2.3050 Acc: 10.00%
val Loss: 2.3058 Acc: 8.92%
Epoch 20/99
----------
train Loss: 2.2389 Acc: 42.50%
val Loss: 2.2534 Acc: 29.90%
Epoch 40/99
----------
train Loss: 0.9381 Acc: 83.75%
val Loss: 1.1041 Acc: 68.36%
Epoch 60/99
----------
train Loss: 0.0533 Acc: 100.00%
val Loss: 0.7823 Acc: 76.69%
Epoch 80/99
----------
train Loss: 0.0138 Acc: 100.00%
val Loss: 0.8884 Acc: 76.88%
num_epochs=100, log_interval=20)
Epoch 0/99
----------
train Loss: 2.2706 Acc: 23.75%
val Loss: 2.1079 Acc: 44.98%
Epoch 20/99
----------
train Loss: 0.0012 Acc: 100.00%
val Loss: 1.0338 Acc: 78.23%
Epoch 40/99
----------
train Loss: 0.0003 Acc: 100.00%
val Loss: 1.1383 Acc: 78.24%
Epoch 60/99
----------
train Loss: 0.0002 Acc: 100.00%
val Loss: 1.2075 Acc: 78.17%
(continues on next page)
Epoch 80/99
----------
train Loss: 0.0001 Acc: 100.00%
val Loss: 1.2571 Acc: 78.26%
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images
per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images.
The test batch contains exactly 1000 randomly-selected images from each class. The training
batches contain the remaining images in random order, but some training batches may contain
more images from one class than another. Between them, the training batches contain exactly
5000 images from each class.
Here are the classes in the dataset, as well as 10 random images from each: - airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck
import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Hyper-parameters
num_epochs = 5
learning_rate = 0.001
# CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
train=True,
transform=transform,
download=True)
val_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=100,
shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
(continues on next page)
num_epochs=50, log_interval=10)
---------------------------------------------------------------------------
<ipython-input-36-13724f7cb709> in <module>
----> 1 model = MLP([D_in, 512, 256, 128, 64, D_out]).to(device)
2 optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
3 criterion = nn.NLLLoss()
4
5 model, losses, accuracies = train_val_model(model, criterion, optimizer,␣
˓→dataloaders,
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self,␣
˓→*args, **kwargs)
425
--> 426 return self._apply(convert)
427
(continues on next page)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self,
˓→ fn)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self,
˓→ fn)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self,
˓→ fn)
226 if should_use_set_data:
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
422
423 def convert(t):
--> 424 return t.to(device, dtype if t.is_floating_point() else None,␣
˓→non_blocking)
425
426 return self._apply(convert)
6.3.1 Outline
2. Architecures
3. Train and test functions
4. CNN models
5. MNIST
6. CIFAR-10
Sources:
Deep learning - cs231n.stanford.edu
CNN - Stanford cs231n
Pytorch - WWW tutorials - github tutorials - github examples
MNIST and pytorch: - MNIST nextjournal.com/gkoehler/pytorch-mnist - MNIST
github/pytorch/examples - MNIST kaggle
6.3.2 Architectures
Sources:
• cv-tricks.com
• [zhenye-na.github.io(]https://zhenye-na.github.io/2018/12/01/cnn-deep-leearning-ai-
week2.html)
LeNet
Fig. 1: LeNet
AlexNet
Fig. 2: AlexNet
• Deeper, bigger,
• Featured Convolutional Layers stacked on top of each other (previously it was common to
only have a single CONV layer always immediately followed by a POOL layer).
• ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid.
The advantage of the ReLu over sigmoid is that it trains much faster than the latter because the
derivative of sigmoid becomes very small in the saturating region and therefore the updates to
the weights almost vanish. This is called vanishing gradient problem.
• Dropout: reduces the over-fitting by using a Dropout layer after every FC layer. Dropout
layer has a probability,(p), associated with it and is applied at every neuron of the response
map separately. It randomly switches off the activation with the probability p.
Why does DropOut work?
Fig. 4: Dropout
The idea behind the dropout is similar to the model ensembles. Due to the dropout layer, dif-
ferent sets of neurons which are switched off, represent a different architecture and all these
different architectures are trained in parallel with weight given to each subset and the sum-
mation of weights being one. For n neurons attached to DropOut, the number of subset ar-
chitectures formed is 2^n. So it amounts to prediction being averaged over these ensembles
of models. This provides a structured model regularization which helps in avoiding the over-
fitting. Another view of DropOut being helpful is that since neurons are randomly chosen, they
tend to avoid developing co-adaptations among themselves thereby enabling them to develop
meaningful features, independent of others.
• Data augmentation is carried out to reduce over-fitting. This Data augmentation includes
mirroring and cropping the images to increase the variation in the training data-set.
GoogLeNet. (Szegedy et al. from Google 2014) was a Convolutional Network . Its main contri-
bution was the development of an
• Inception Module that dramatically reduced the number of parameters in the network
(4M, compared to AlexNet with 60M).
• There are also several followup versions to the GoogLeNet, most recently Inception-v4.
VGGNet. (Karen Simonyan and Andrew Zisserman 2014)
• 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture.
• Only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Replace
large kernel-sized filters(11 and 5 in the first and second convolutional layer, respectively)
with multiple 3X3 kernel-sized filters one after another.
With a given receptive field(the effective area size of input image on which output depends),
multiple stacked smaller size kernel is better than the one with a larger size kernel because
multiple non-linear layers increases the depth of the network which enables it to learn more
complex features, and that too at a lower cost. For example, three 3X3 filters on top of each
other with stride 1 ha a receptive size of 7, but the number of parameters involved is 3*(9^2)
in comparison to 49^2 parameters of kernels with a size of 7.
Fig. 6: VGGNet
Fig. 9: ResNet 18
%matplotlib inline
import os
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
import torchvision.transforms as transforms
from torchvision import models
#
from pathlib import Path
import matplotlib.pyplot as plt
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = 'cpu' # Force CPU
# %load train_val_model.py
import numpy as np
import torch
import time
import copy
(continues on next page)
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
running_loss = 0.0
running_corrects = 0
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
#nsamples = dataloaders[phase].dataset.data.shape[0]
epoch_loss = running_loss / nsamples
epoch_acc = running_corrects.double() / nsamples
losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
if log_interval is not None and epoch % log_interval == 0:
print('{} Loss: {:.4f} Acc: {:.2f}%'.format(
phase, epoch_loss, 100 * epoch_acc))
LeNet-5
import torch.nn as nn
import torch.nn.functional as F
class LeNet5(nn.Module):
"""
layers: (nb channels in input layer,
nb channels in 1rst conv,
nb channels in 2nd conv,
(continues on next page)
x = x.view(-1, self.layers[3])
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x, dim=1)
class MiniVGGNet(torch.nn.Module):
def __init__(self, layers=(1, 16, 32, 1024, 120, 84, 10), debug=False):
super(MiniVGGNet, self).__init__()
self.layers = layers
self.debug = debug
# Conv block 1
self.conv11 = nn.Conv2d(in_channels=layers[0], out_channels=layers[1],␣
˓→kernel_size=3,
x = F.relu(self.conv21(x))
x = F.relu(self.conv22(x))
x = F.max_pool2d(x, 2)
if self.debug:
print("### DEBUG: Shape of last convnet=", x.shape[1:], ". FC size=",␣
˓→np.prod(x.shape[1:]))
x = x.view(-1, self.layers[3])
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
ResNet-like Model:
# ---------------------------------------------------------------------------- #
# An implementation of https://arxiv.org/pdf/1512.03385.pdf #
# See section 4.2 for the model architecture on CIFAR-10 #
# Some part of the code was referenced from below #
# https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py #
# ---------------------------------------------------------------------------- #
import torch.nn as nn
# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
return nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
(continues on next page)
# Residual block
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(ResidualBlock, self).__init__()
self.conv1 = conv3x3(in_channels, out_channels, stride)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(out_channels, out_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
# ResNet
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=10):
super(ResNet, self).__init__()
self.in_channels = 16
self.conv = conv3x3(3, 16)
self.bn = nn.BatchNorm2d(16)
self.relu = nn.ReLU(inplace=True)
self.layer1 = self.make_layer(block, 16, layers[0])
self.layer2 = self.make_layer(block, 32, layers[1], 2)
self.layer3 = self.make_layer(block, 64, layers[2], 2)
self.avg_pool = nn.AvgPool2d(8)
self.fc = nn.Linear(64, num_classes)
ResNet9
• DAWNBench on cifar10
• ResNet9: train to 94% CIFAR10 accuracy in 100 seconds
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=batch_size_train, shuffle=True)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
(continues on next page)
LeNet
Dry run in debug mode to get the shape of the last convnet layer.
LeNet5(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([16, 5, 5]) . FC size= 400
num_epochs=5, log_interval=2)
torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 400])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters = 61706
Epoch 0/4
----------
train Loss: 0.7807 Acc: 75.65%
val Loss: 0.1586 Acc: 94.96%
Epoch 2/4
----------
train Loss: 0.0875 Acc: 97.33%
val Loss: 0.0776 Acc: 97.47%
Epoch 4/4
----------
train Loss: 0.0592 Acc: 98.16%
val Loss: 0.0533 Acc: 98.30%
MiniVGGNet
print(model)
_ = model(data_example)
MiniVGGNet(
(conv11): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 5, 5]) . FC size= 800
num_epochs=5, log_interval=2)
torch.Size([16, 1, 3, 3])
torch.Size([16])
torch.Size([16, 16, 3, 3])
torch.Size([16])
torch.Size([32, 16, 3, 3])
torch.Size([32])
torch.Size([32, 32, 3, 3])
torch.Size([32])
torch.Size([120, 800])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters = 123502
Epoch 0/4
----------
train Loss: 1.4180 Acc: 48.27%
val Loss: 0.2277 Acc: 92.68%
Epoch 2/4
----------
train Loss: 0.0838 Acc: 97.41%
val Loss: 0.0587 Acc: 98.14%
Epoch 4/4
----------
train Loss: 0.0495 Acc: 98.43%
val Loss: 0.0407 Acc: 98.63%
Reduce the size of the training dataset by considering only 10 minibatche for size16.
train_size = 10 * 16
# Stratified sub-sampling
targets = train_loader.dataset.targets.numpy()
nclasses = len(set(targets))
Train size= 160 Train label count= {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6:␣
˓→16, 7: 16, 8: 16, 9: 16}
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000,␣
˓→28, 28])}
LeNet5
num_epochs=100, log_interval=20)
Epoch 0/99
----------
train Loss: 2.3086 Acc: 11.88%
val Loss: 2.3068 Acc: 14.12%
Epoch 20/99
----------
train Loss: 0.8060 Acc: 76.25%
val Loss: 0.8522 Acc: 72.84%
Epoch 40/99
----------
train Loss: 0.0596 Acc: 99.38%
val Loss: 0.6188 Acc: 82.67%
Epoch 60/99
----------
train Loss: 0.0072 Acc: 100.00%
val Loss: 0.6888 Acc: 83.08%
Epoch 80/99
----------
train Loss: 0.0033 Acc: 100.00%
val Loss: 0.7546 Acc: 82.96%
(continues on next page)
MiniVGGNet
num_epochs=100, log_interval=20)
Epoch 0/99
----------
train Loss: 2.3040 Acc: 10.00%
val Loss: 2.3025 Acc: 10.32%
Epoch 20/99
----------
train Loss: 2.2963 Acc: 10.00%
val Loss: 2.2969 Acc: 10.35%
Epoch 40/99
----------
train Loss: 2.1158 Acc: 37.50%
val Loss: 2.0764 Acc: 38.06%
(continues on next page)
Epoch 60/99
----------
train Loss: 0.0875 Acc: 97.50%
val Loss: 0.7315 Acc: 80.50%
Epoch 80/99
----------
train Loss: 0.0023 Acc: 100.00%
val Loss: 1.0397 Acc: 81.69%
import numpy as np
import torch
import torch.nn as nn
(continues on next page)
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Hyper-parameters
num_epochs = 5
learning_rate = 0.001
# CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
train=True,
transform=transform,
download=True)
val_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=100,
shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
batch_size=100,
shuffle=False)
LeNet
LeNet5(
(conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([16, 6, 6]) . FC size= 576
num_epochs=25, log_interval=5)
torch.Size([6, 3, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 576])
torch.Size([120])
torch.Size([84, 120])
(continues on next page)
Epoch 5/24
----------
train Loss: 2.2991 Acc: 11.18%
val Loss: 2.2983 Acc: 11.00%
Epoch 10/24
----------
train Loss: 2.2860 Acc: 10.36%
val Loss: 2.2823 Acc: 10.60%
Epoch 15/24
----------
train Loss: 2.1759 Acc: 18.83%
val Loss: 2.1351 Acc: 20.74%
Epoch 20/24
----------
train Loss: 2.0159 Acc: 25.35%
val Loss: 1.9878 Acc: 26.90%
num_epochs=25, log_interval=5)
Epoch 0/24
----------
train Loss: 2.0963 Acc: 21.65%
val Loss: 1.8211 Acc: 33.49%
Epoch 5/24
----------
train Loss: 1.3500 Acc: 51.34%
val Loss: 1.2278 Acc: 56.40%
Epoch 10/24
----------
train Loss: 1.1569 Acc: 58.79%
val Loss: 1.0933 Acc: 60.95%
Epoch 15/24
----------
train Loss: 1.0724 Acc: 62.12%
val Loss: 0.9863 Acc: 65.34%
Epoch 20/24
----------
train Loss: 1.0131 Acc: 64.41%
val Loss: 0.9720 Acc: 66.14%
num_epochs=25, log_interval=5)
Epoch 0/24
----------
train Loss: 1.8411 Acc: 30.21%
val Loss: 1.5768 Acc: 41.22%
Epoch 5/24
----------
train Loss: 1.3185 Acc: 52.17%
val Loss: 1.2181 Acc: 55.71%
Epoch 10/24
----------
train Loss: 1.1724 Acc: 57.89%
val Loss: 1.1244 Acc: 59.17%
Epoch 15/24
----------
train Loss: 1.0987 Acc: 60.98%
val Loss: 1.0153 Acc: 63.82%
(continues on next page)
Epoch 20/24
----------
train Loss: 1.0355 Acc: 63.01%
val Loss: 0.9901 Acc: 64.90%
MiniVGGNet
MiniVGGNet(
(conv11): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 6, 6]) . FC size= 1152
num_epochs=25, log_interval=5)
Epoch 0/24
----------
train Loss: 2.3027 Acc: 10.14%
val Loss: 2.3010 Acc: 10.00%
Epoch 5/24
----------
train Loss: 1.4829 Acc: 46.08%
val Loss: 1.3860 Acc: 50.39%
Epoch 10/24
----------
train Loss: 1.0899 Acc: 61.43%
val Loss: 1.0121 Acc: 64.59%
Epoch 15/24
----------
train Loss: 0.8825 Acc: 69.02%
val Loss: 0.7788 Acc: 72.73%
Epoch 20/24
----------
train Loss: 0.7805 Acc: 72.73%
val Loss: 0.7222 Acc: 74.72%
Adam
num_epochs=25, log_interval=5)
Epoch 0/24
----------
train Loss: 1.8591 Acc: 30.74%
val Loss: 1.5424 Acc: 43.46%
Epoch 5/24
----------
train Loss: 1.1562 Acc: 58.46%
val Loss: 1.0811 Acc: 61.87%
Epoch 10/24
----------
train Loss: 0.9630 Acc: 65.69%
val Loss: 0.8669 Acc: 68.94%
Epoch 15/24
----------
train Loss: 0.8634 Acc: 69.38%
val Loss: 0.7933 Acc: 72.33%
(continues on next page)
Epoch 20/24
----------
train Loss: 0.8033 Acc: 71.75%
val Loss: 0.7737 Acc: 73.57%
ResNet
num_epochs=25, log_interval=5)
Epoch 0/24
----------
train Loss: 1.4169 Acc: 48.11%
val Loss: 1.5213 Acc: 48.08%
Epoch 5/24
----------
(continues on next page)
Epoch 10/24
----------
train Loss: 0.4772 Acc: 83.57%
val Loss: 0.5314 Acc: 82.09%
Epoch 15/24
----------
train Loss: 0.4010 Acc: 86.09%
val Loss: 0.6457 Acc: 79.03%
Epoch 20/24
----------
train Loss: 0.3435 Acc: 88.07%
val Loss: 0.4887 Acc: 84.34%
Sources:
• cs231n @ Stanford
• Sasank Chilamkurthy
Quote cs231n @ Stanford:
In practice, very few people train an entire Convolutional Network from scratch (with random
initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is
common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2
million images with 1000 categories), and then use the ConvNet either as an initialization or a
fixed feature extractor for the task of interest.
These two major transfer learning scenarios look as follows:
• ConvNet as fixed feature extractor:
– Take a ConvNet pretrained on ImageNet,
– Remove the last fully-connected layer (this layer’s outputs are the 1000 class scores
for a different task like ImageNet)
– Treat the rest of the ConvNet as a fixed feature extractor for the new dataset.
In practice:
– Freeze the weights for all of the network except that of the final fully connected layer.
This last fully connected layer is replaced with a new one with random weights and
only this layer is trained.
• Finetuning the convnet:
fine-tune the weights of the pretrained network by continuing the backpropagation. It is possi-
ble to fine-tune all the layers of the ConvNet
Instead of random initializaion, we initialize the network with a pretrained network, like the
one that is trained on imagenet 1000 dataset. Rest of the training looks as usual.
%matplotlib inline
import os
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
import torchvision.transforms as transforms
from torchvision import models
#
from pathlib import Path
import matplotlib.pyplot as plt
# Device configuration
(continues on next page)
# %load train_val_model.py
import numpy as np
import torch
import time
import copy
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
running_loss = 0.0
running_corrects = 0
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
#nsamples = dataloaders[phase].dataset.data.shape[0]
epoch_loss = running_loss / nsamples
epoch_acc = running_corrects.double() / nsamples
losses[phase].append(epoch_loss)
accuracies[phase].append(epoch_acc)
if log_interval is not None and epoch % log_interval == 0:
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
# CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
train=True,
transform=transform,
download=True)
test_dataset = torchvision.datasets.CIFAR10(root='data/',
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=100,
shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=100,
shuffle=False)
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Here the size of each output sample is set to 10.
model_ft.fc = nn.Linear(num_ftrs, D_out)
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')
Epoch 0/24
----------
train Loss: 1.2476 Acc: 0.5593
val Loss: 0.9043 Acc: 0.6818
Epoch 5/24
----------
train Loss: 0.5791 Acc: 0.7978
val Loss: 0.5725 Acc: 0.8035
Epoch 15/24
----------
train Loss: 0.4581 Acc: 0.8388
val Loss: 0.5220 Acc: 0.8226
Epoch 20/24
----------
train Loss: 0.4575 Acc: 0.8394
val Loss: 0.5218 Acc: 0.8236
Adam optimizer
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Here the size of each output sample is set to 10.
model_ft.fc = nn.Linear(num_ftrs, D_out)
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')
Epoch 0/24
----------
train Loss: 1.0622 Acc: 0.6341
val Loss: 0.8539 Acc: 0.7066
Epoch 5/24
----------
train Loss: 0.5674 Acc: 0.8073
val Loss: 0.5792 Acc: 0.8019
Epoch 10/24
----------
train Loss: 0.3416 Acc: 0.8803
val Loss: 0.4313 Acc: 0.8577
Epoch 15/24
----------
train Loss: 0.2898 Acc: 0.8980
val Loss: 0.4491 Acc: 0.8608
Epoch 20/24
----------
train Loss: 0.2792 Acc: 0.9014
val Loss: 0.4352 Acc: 0.8631
Freeze all the network except the final layer: requires_grad == False to freeze the parameters
so that the gradients are not computed in backward().
model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
model_conv = model_conv.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')
Epoch 0/24
(continues on next page)
Epoch 5/24
----------
train Loss: 1.6686 Acc: 0.4170
val Loss: 1.6981 Acc: 0.4146
Epoch 10/24
----------
train Loss: 1.6462 Acc: 0.4267
val Loss: 1.6768 Acc: 0.4210
Epoch 15/24
----------
train Loss: 1.6388 Acc: 0.4296
val Loss: 1.6752 Acc: 0.4226
Epoch 20/24
----------
train Loss: 1.6368 Acc: 0.4325
val Loss: 1.6720 Acc: 0.4240
Adam optimizer
model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
(continues on next page)
model_conv = model_conv.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['val'], '--r')
---------------------------------------------------------------------------
<ipython-input-7-dde92868b554> in <module>
19
20 model, losses, accuracies = train_val_model(model_conv, criterion,␣
˓→optimizer_conv,
SEVEN
• genindex
• modindex
• search
385