Statistics Machine Learning Python
Statistics Machine Learning Python
Python
Release 0.8
2 Python language 13
2.1 Import libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Execution control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 List comprehensions, iterators, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Regular expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 System programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Scripts and argument parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.10 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11 Object Oriented Programming (OOP) . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12 Style guide for Python programming . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Documenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Modules and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.15 Unit testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Statistics 137
5.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 Hands-On: Brain volumes study . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
i
5.3 Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.4 Multivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.5 Resampling and Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . 228
ii
Statistics and Machine Learning in Python, Release 0.8
• Github
• Latest pdf
• Official deposit for citation.
• Web page
TABLE OF CONTENTS 1
Statistics and Machine Learning in Python, Release 0.8
2 TABLE OF CONTENTS
CHAPTER
ONE
3
Statistics and Machine Learning in Python, Release 0.8
a = 1
b = 2
print("Hello world")
2. Run with python interpreter. On the dos/unix command line execute wholes file:
python file.py
Interactive mode
1. python interpreter:
python
ipython
import numpy as np
X = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
np.dot(X, v)
X - X.mean(axis=0)
Scipy: General scientific libraries with advanced matrix operation and solver
import scipy
import scipy.linalg
scipy.linalg.svd(X, full_matrices=False)
import pandas as pd
data = pandas.read_excel("datasets/iris.xls")
print(data.head())
Out[8]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
import numpy as np
import matplotlib.pyplot as plt
#%matplotlib qt
x = np.linspace(0, 10, 50)
sinus = np.sin(x)
plt.plot(x, sinus)
plt.show()
A typical Python development environment consists of several key components, which work
together to facilitate coding, testing, and debugging. Here are the main components:
1. Python Interpreter. The core of any Python development environment is the Python inter-
preter (e.g., Python 3.x). It runs Python code and converts it into machine-readable form.
You can download it from python.org.
2. Text Editor or Integrated Development Environment (IDE) or jupyter-notebook.
• Text Editors: Lightweight editors like Sublime Text, Atom, or VS Code offer basic text
editing with syntax highlighting and extensions for Python development.
• IDEs: Full-featured IDEs like PyCharm, VS Code (with Python extensions), or Spy-
der offer advanced features like code completion, debugging, project management,
version control, and testing integrations.
3. Package Manager & Dependency Management
• pip: The default Python package manager, which allows you to install, upgrade, and
manage external libraries and dependencies.
• Conda: An alternative package and environment manager, often used in data science
for managing dependencies and virtual environments.
• Pixi is a fast software package manager built on top of the existing conda ecosystem.
• Conda & Pixi: provide the Python Interpreter.
4. Virtual Environment Manager
• Virtual environments allow you to create isolated environments for different projects,
preventing conflicts between different project dependencies. Tools include:
• venv (python module): Built-in module to create virtual environments.
• virtualenv: Another popular tool for creating isolated environments.
• Conda & Pixi: Manages both packages and environments.
5. Version Control System
• Git: Essential for source control, collaboration, and version management. Platforms
like GitHub, GitLab, and Bitbucket integrate Git for remote repository management.
IDEs often have built-in Git support or plugins that make using Git seamless.
6. Debugger
• Python has a built-in debugger called pdb.
• Most IDEs, like PyCharm or VS Code, offer graphical debugging tools with features
like breakpoints, variable inspection, and step-through execution.
7. Testing Framework
• Tools like unittest (built-in), pytest, or nose2 help automate testing and ensure code
quality.
• IDEs often integrate testing frameworks to run and debug tests efficiently.
8. Documentation Tools
• Tools like Sphinx or pdoc help generate documentation from your code, making it
easier for other developers (and your future self) to understand.
9. Containers (Optional)
• Docker: Used to create isolated, reproducible development environments and ensure
consistency between development and production environments.
Pixi is a modern package management tool designed to enhance the experience of managing
Python environments particularly for data science and machine learning workflows. It aims
to improve upon the existing tools like Conda by offering faster and more efficient package
management:
• 7 Reasons to Switch from Conda to Pixi.
• Transitioning from the conda or mamba to pixi
• Tutorial for python.
Installation
Linux & macOS
Windows
Uninstall
Creating an Environment, then add python, and packages
pixi install
pixi shell
pixi list
Deactivating an environment
exit
Install/uninstall a package
Anaconda is a python distribution that ships most of python tools and libraries.
Installation
1. Download anaconda
2. Install it, on Linux
bash Anaconda3-2.4.1-Linux-x86_64.sh
3. Add anaconda path in your PATH variable (For Linux in your .bashrc file), example:
export PATH="${HOME}/anaconda3/bin:$PATH"
Conda environments
• A Conda environments contains a specific collection of conda packages that you have
installed.
• Control packages environment for a specific purpose: collaborating with someone else,
delivering an application to your client,
• Switch between environments
Creating an environment. Example, environment_student.yml:
name: pystatsml
channels:
- conda-forge
dependencies:
- ipython
- scipy
- numpy
- pandas>=2.0.3
- jupyter
- matplotlib
- scikit-learn>=1.3.0
(continues on next page)
Updating an environment (additional or better package, remove packages). Update the contents
of your environment.yml file accordingly and then run the following command:
List all packages or search for a specific package in the current environment:
conda list
conda list numpy
Delete an environment:
Miniconda
Anaconda without the collection of (>700) packages. With Miniconda you download only the
packages you want with the conda command: conda install PACKAGENAME
1. Download Miniconda
2. Install it, on Linux:
bash Miniconda3-latest-Linux-x86_64.sh
export PATH=${HOME}/miniconda3/bin:$PATH
1.4.3 Pip
Example:
Integrated Development Environment (IDE) are software development environment that pro-
vide:
• Source-code editor (auto-completion, etc.).
• Execution facilities (interactive, etc.).
• Debugger.
Setup
• Installation.
• Tuto for Linux.pen the Command Palette (Ctrl+Shift+P)
• Useful settings for python: VS Code for python
• Extensions for data-science in python: Python, Jupyter, Python Extension Pack, Python
Pylance, Path Intellisense
Set Python environment: Open the Command Palette (Ctrl+Shift+P) search >Python: Select
interpreter.
Execution, three possibilities:
1. Run Python file
2. Interactive execution in python interpreter, type: Shift/Enter
1.5.2 Spyder
JupyterLab allows data scientists to create and share document, ie, Jupyter Notebook. A Note-
book is that is a document .ipynb including:
• Python code, text, figures (plots), equations, and other multimedia resources.
• The Notebook allows interactive execution of blocs of codes or text.
• Notebook is edited using a Web browsers and it is executed by (possibly remote) IPython
kernel.
jupyter notebook
New/kernel
Advantages:
• Rapid and one shot data analysis
• Share all-in-one data analysis documents: inluding code, text and figures
Drawbacks (source):
• Difficult to maintain and keep in sync when collaboratively working on code.
• Difficult to operationalize your code when using Jupyter notebooks as they don’t feature
any built-in integration or tools for operationalizing your machine learning models.
• Difficult to scale: Jupyter notebooks are designed for single-node data science. If your
data is too big to fit in your computer’s memory, using Jupyter notebooks becomes signif-
icantly more difficult.
TWO
PYTHON LANGUAGE
import math
math.sqrt(25)
5.0
import a function
5.0
import nltk
import numpy as np
np.sqrt(9)
np.float64(3.0)
content = dir(math)
13
Statistics and Machine Learning in Python, Release 0.8
Numbers
Boolean operations
comparisons (these return True)
5 > 3
5 >= 3
5 != 3
5 == 5
True
True
True
float(2)
int(2.9)
str(2.9)
'2.9'
bool(0)
bool(None)
bool('') # empty string
bool([]) # empty list
bool({}) # empty dictionary
False
bool(2)
bool('two')
bool([2])
True
2.3.1 Lists
Different objects categorized along a certain ordered sequence, lists are ordered, iterable, mu-
table (adding or removing objects changes the list size), can contain multiple data types.
Creation
Empty list (two ways)
empty_list = []
empty_list = list()
Examine a list
Insert
simpsons.insert(0, 'maggie')
# searches for first instance and removes it
Remove
simpsons.remove('bart')
simpsons.pop(0) # removes element 0 and returns it
# removes element 0 (does not return it)
del simpsons[0]
simpsons[0] = 'krusty' # replace element 0
Replicate
'lisa' in simpsons
simpsons.count('lisa') # counts the number of instances
simpsons.index('itchy') # returns index of first instance
Reverse list
Sort list
Sort a list in place (modifies but does not return the list)
simpsons.sort()
simpsons.sort(reverse=True) # sort in reverse
simpsons.sort(key=len) # sort by a key
Return a sorted list (but does not modify the original list)
sorted(simpsons)
sorted(simpsons, reverse=True)
sorted(simpsons, key=len)
2.3.2 Tuples
Like lists, but their size cannot change: ordered, iterable, immutable, can contain multiple data
types
# create a tuple
digits = (0, 1, 'two') # create a tuple directly
digits = tuple([0, 1, 'two']) # create a tuple from a list
# trailing comma is required to indicate it's a tuple
zero = (0,)
# examine a tuple
digits[2] # returns 'two'
len(digits) # returns 3
digits.count(0) # counts the number of instances of that value (1)
digits.index(1) # returns the index of the first instance of that value (1)
# concatenate tuples
digits = digits + (3, 4)
# create a single tuple with elements repeated (also works with lists)
(3, 4) * 2 # returns (3, 4, 3, 4)
# tuple unpacking
bart = ('male', 10, 'simpson') # create a tuple
2.3.3 Strings
# create a string
s = str(42) # convert another data type into a string
s = 'I like you'
# examine a string
s[0] # returns 'I'
len(s) # returns 10
# concatenate strings
s3 = 'The meaning of life is'
s4 = '42'
s3 + ' ' + s4 # returns 'The meaning of life is 42'
s3 + ' ' + str(42) # same thing
Strings formatting
# String formatting
# See: https://realpython.com/python-formatted-output/
# Old method
print('6 %s' % 'bananas')
print('%d %s cost $%.1f' % (6, 'bananas', 3.14159))
6 bananas
6 bananas cost $3.1
6 bananas cost $3.1
Strings encoding
Normal strings allow for escaped characters. The default strings use unicode string (u string)
first line
second line
first line
second line
True
Sequence of bytes are not strings, should be decoded before some operations
2.3.4 Dictionaries
Dictionary is the must-known data structure. Dictionaries are structures which can contain
multiple data types, and is ordered with key-value pairs: for each (unique) key, the dictionary
outputs one value. Keys can be strings, numbers, or tuples, while the corresponding values can
be any Python object. Dictionaries are: unordered, iterable, mutable
Creation
print(simpsons_roles_dict)
Access
# examine a dictionary
simpsons_roles_dict['Homer'] # 'father'
len(simpsons_roles_dict) # 5
simpsons_roles_dict.keys() # list: ['Homer', 'Marge', ...]
simpsons_roles_dict.values() # list:['father', 'mother', ...]
simpsons_roles_dict.items() # list of tuples: [('Homer', 'father') ...]
'Homer' in simpsons_roles_dict # returns True
'John' in simpsons_roles_dict # returns False (only checks keys)
try:
simpsons_roles_dict['John'] # throws an error
except KeyError as e:
print("Error", e)
simpsons_roles_dict.get('John') # None
# returns 'not found' (the default)
simpsons_roles_dict.get('John', 'not found')
Error 'John'
'not found'
l = list()
for n in inter:
l.append([n, simpsons_ages_dict[n], simpsons_roles_dict[n]])
String substitution using a dictionary: syntax %(key)format, where format is the formatting
character e.g. s for string.
2.3.5 Sets
Like dictionaries, but with unique keys only (no corresponding values). They are: unordered, it-
erable, mutable, can contain multiple data types made up of unique elements (strings, numbers,
or tuples)
Creation
# create a set
languages = {'python', 'r', 'java'} # create a set directly
snakes = set(['cobra', 'viper', 'python']) # create a set from a list
Examine a set
len(languages) # 3
'python' in languages # True
True
Set operations
try:
languages.remove('c') # remove a non-existing element: throws an error
except KeyError as e:
print("Error", e)
Error 'c'
[0, 1, 2, 9]
if statement
x = 3
if x > 0:
print('positive')
positive
if/else statement
if x > 0:
print('positive')
else:
print('zero or negative')
positive
positive
if/elif/else statement
if x > 0:
print('positive')
elif x == 0:
print('zero')
else:
print('negative')
positive
2.4.2 Loops
Loops are a set of instructions which repeat until termination conditions are met. This can
include iterating through all values in an object, go through a range of values, etc
range(0, 5, 2)
APPLE
BANANA
CHERRY
for i in range(len(fruits)):
print(fruits[i].lower())
apple
banana
cherry
0 APPLE
1 BANANA
2 CHERRY
List comprehensions provides an elegant syntax for the most common processing pattern:
1. iterate over a list,
2. apply some operation
3. store the result in a new list
Classical iteration over a list
nums = [1, 2, 3, 4, 5]
cubes = []
for num in nums:
cubes.append(num ** 3)
Classical iteration over a list with if condition: create a list of cubes of even numbers
cubes_of_even = []
for num in nums:
if num % 2 == 0:
cubes_of_even.append(num**3)
Classical iteration over a list with if else condition: for loop to cube even numbers and square
odd numbers
cubes_and_squares = []
for num in nums:
if num % 2 == 0:
cubes_and_squares.append(num**3)
else:
cubes_and_squares.append(num**2)
Equivalent list comprehension (using a ternary expression) for loop to cube even numbers
and square odd numbers syntax: [true_condition if condition else false_condition for
variable in iterable]
print(items)
[1, 2, 3, 4]
{5, 6}
Combine two dictionaries sharing key. Example, a function that joins two dictionaries (inter-
secting keys) into a dictionary of lists
import itertools
[['a', 1], ['a', 2], ['b', 1], ['b', 2], ['c', 1], ['c', 2]]
2.5.5 Example, use loop, dictionary and set to count words in a sentence
quote = """Tick-tow
our incomes are like our shoes; if too small they gall and pinch us
but if too large they cause us to stumble and to trip
"""
words = quote.split()
len(words)
print(count)
import numpy as np
freq_veq = np.array(list(count.values())) / len(words)
key = 'c'
try:
dct[key]
except:
print("Key %s is missing. Add it with empty value" % key)
dct['c'] = []
print(dct)
2.6 Functions
Functions are sets of instructions launched when called upon, they can have multiple input
values and a return value
Function with no arguments and no return values
def print_text():
print('this is text')
this is text
def print_this(x):
print(x)
3
3
None
Dynamic typing
Important remarque: Python is a dynamically typed language, meaning that the Python in-
terpreter does type checking at runtime (as opposed to compiled language that are statically
typed). As a consequence, the function behavior, decided, at execution time, will be different
and specific to parameters type. Python function are polymorphic.
Default arguments
4 8
Docstring to describe the effect of a function IDE, ipython (type: ?power_this) to provide
function documentation.
Args:
x (float): the number
power (int, optional): the power. Defaults to 2.
"""
return x ** power
def min_max(nums):
return min(nums), max(nums)
# return values can be assigned into multiple variables using tuple unpacking
min_num, max_num = min_max([1, 2, 3]) # min_num = 1, max_num = 3
References are used to access objects in memory, here lists. A single object may have multiple
references. Modifying the content of the one reference will change the content of all other
references.
Modify a a reference of a list
num = [1, 2, 3]
same_num = num # create a second reference to the same list
same_num[0] = 0 # modifies both 'num' and 'same_num'
print(num, same_num)
[0, 2, 3] [0, 2, 3]
Copies are references to different objects. Modifying the content of the one reference, will not
affect the others.
2.6. Functions 29
Statistics and Machine Learning in Python, Release 0.8
new_num = num.copy()
new_num = num[:]
new_num = list(num)
new_num[0] = -1 # modifies 'new_num' but not 'num'
print(num, new_num)
[0, 2, 3] [-1, 2, 3]
Examine objects
False
Functions’ arguments are references to objects. Thus functions can modify their arguments with
possible side effect.
l = [0, 1, 2]
change(x=l, index=1, newval=33)
print(l)
[0, 33, 2]
print("Roles:", simpsons_roles_dict)
print("Ages:", simpsons_ages_dict)
(continues on next page)
Regular Expression (RE, or RegEx) allow to search and patterns in strings. See this page for the
syntax of the RE patterns.
import re
Usual patterns
• . period symbol matches any single character (except newline ‘n’).
• pattern``+`` plus symbol matches one or more occurrences of the pattern.
• [] square brackets specifies a set of characters you wish to match
• [abc] matches a, b or c
• [a-c] matches a to z
• [0-9] matches 0 to 9
• [a-zA-Z0-9]+ matches words, at least one alphanumeric character (digits and alphabets)
• [\w]+ matches words, at least one alphanumeric character including underscore.
• \s Matches where a string contains any whitespace character, equivalent to [ tnrfv].
• [^\s] Caret ^ symbol (the start of a square-bracket) inverts the pattern selection .
# regex = re.compile("^.+(firstname:.+)_(lastname:.+)_(mod-.+)")
# regex = re.compile("(firstname:.+)_(lastname:.+)_(mod-.+)")
Compile (re.compile(string)) regular expression with a pattern that captures the pattern
firstname:<subject_id>_lastname:<session_id>
pattern = re.compile("firstname:[\w]+_lastname:[\w]+")
/home/ed203246/git/pystatsml/python_lang/python_lang.py:936: SyntaxWarning:␣
˓→invalid escape sequence '\w'
pattern = re.compile("firstname:[\w]+_lastname:[\w]+")
Match (re.match(string)) to be used in test, loop, etc. Determine if the RE matches at the
beginning of the string.
Match (re.search(string)) to be used in test, loop, etc. Determine if the RE matches at any
location in the string.
Find (re.findall(string)) all substrings where the RE matches, and returns them as a list.
# Find words
print(re.compile("[a-zA-Z0-9]+").findall("firstname:John_lastname:Doe"))
/home/ed203246/git/pystatsml/python_lang/python_lang.py:963: SyntaxWarning:␣
˓→invalid escape sequence '\w'
pattern = re.compile("firstname:[\w]+_lastname:[\w]+")
/home/ed203246/git/pystatsml/python_lang/python_lang.py:970: SyntaxWarning:␣
˓→invalid escape sequence '\w'
print(re.compile("[\w]+").findall("firstname:John_lastname:Doe"))
['firstname:John_lastname:Doe']
['firstname', 'John', 'lastname', 'Doe']
['firstname', 'John_lastname', 'Doe']
Extract specific parts of the RE: use parenthesis (part of pattern to be matched) Extract John
and Doe, such as John is suffixed with firstname: and Doe is suffixed with lastname:
pattern = re.compile("firstname:([\w]+)_lastname:([\w]+)")
print(pattern.findall("firstname:John_lastname:Doe \
firstname:Bart_lastname:Simpson"))
/home/ed203246/git/pystatsml/python_lang/python_lang.py:978: SyntaxWarning:␣
˓→invalid escape sequence '\w'
Split (re.split(string)) splits the string where there is a match and returns a list of strings
where the splits have occurred. Example, match any non alphanumeric character (digits and
alphabets) [^a-zA-Z0-9] to split the string.
print(re.compile("[^a-zA-Z0-9]").split("firstname:John_lastname:Doe"))
/home/ed203246/git/pystatsml/python_lang/python_lang.py:995: SyntaxWarning:␣
˓→invalid escape sequence '\s'
/home/ed203246/git/pystatsml/python_lang/python_lang.py:1001: SyntaxWarning:␣
˓→invalid escape sequence '\s'
'Hello World'
import os
/home/ed203246/git/pystatsml/python_lang
Temporary directory
import tempfile
tmpdir = tempfile.gettempdir()
print(tmpdir)
/tmp
Join paths
Create a directory
# list containing the names of the entries in the directory given by path.
os.listdir(mytmpdir)
['myfile.txt', 'plop']
/tmp/foobar/myfile.txt
fd = open(filename, "w")
fd.write(lines[0] + "\n")
fd.write(lines[1] + "\n")
fd.close()
Read read one line at a time (entire file does not have to fit into memory)
f = open(filename, "r")
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()
# use list comprehension to duplicate readlines without reading entire file at␣
˓→once
f = open(filename, 'r')
[line for line in f]
f.close()
WD = os.path.join(tmpdir, "foobar")
import glob
filenames = glob.glob(os.path.join(tmpdir, "*", "*.txt"))
print(filenames)
def split_filename_inparts(filename):
dirname_ = os.path.dirname(filename)
filename_noext_, ext_ = os.path.splitext(filename)
basename_ = os.path.basename(filename_noext_)
return dirname_, basename_, ext_
(continues on next page)
import shutil
Copy
try:
print("Copy tree %s under %s" % (src, dst))
# Note that by default (dirs_exist_ok=True), meaning that copy will fail
# if destination exists.
shutil.copytree(src, dst, dirs_exist_ok=True)
For more advanced use cases, the underlying Popen interface can be used directly.
import subprocess
subprocess.run([command, args*])
• Run the command described by args.
• Wait for command to complete
• return a CompletedProcess instance.
• Does not capture stdout or stderr by default. To do so, pass PIPE for the stdout and/or
stderr arguments.
p = subprocess.run(["ls", "-l"])
print(p.returncode)
Capture output
out = subprocess.run(
["ls", "-a", "/"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# out.stdout is a sequence of bytes that should be decoded into a utf-8 string
print(out.stdout.decode('utf-8').split("\n")[:5])
import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import balanced_accuracy_score
# Toy dataset
X, y = make_classification(n_features=1000, n_samples=5000, n_informative=20,
random_state=1, n_clusters_per_class=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8,
random_state=42)
Random forest algorithm: (i) In parallel, fit decision trees on bootstrapped data samples. Make
predictions. (ii) Majority vote on predictions
1. In parallel, fit decision trees on bootstrapped data sample. Make predictions.
for i in range(5):
y_test_boot = boot_decision_tree(X_train, X_test, y_train)
print("%.2f" % balanced_accuracy_score(y_test, y_test_boot))
0.66
0.61
0.58
0.65
0.61
def vote(predictions):
maj = np.apply_along_axis(
lambda x: np.argmax(np.bincount(x)),
axis=1,
arr=predictions
)
return maj
Sequential execution
Sequentially fit decision tree on bootstrapped samples, then apply majority vote
nboot = 2
start = time.time()
y_test_boot = np.dstack([boot_decision_tree(X_train, X_test, y_train)
for i in range(nboot)]).squeeze()
y_test_vote = vote(y_test_boot)
print("Balanced Accuracy: %.2f" % balanced_accuracy_score(y_test, y_test_vote))
print("Sequential execution, elapsed time:", time.time() - start)
Multithreading
predictions_list = list()
thread1 = Thread(target=boot_decision_tree,
args=(X_train, X_test, y_train, predictions_list))
thread2 = Thread(target=boot_decision_tree,
args=(X_train, X_test, y_train, predictions_list))
Multiprocessing
Concurrent (parallel) execution of the function with processes (jobs) executed in different ad-
dress (memory) space. Process-based parallelism
Process() for parallel execution and Manager() for data sharing
Sharing data between process with Managers Therefore, sharing data requires specific mech-
anism using . Managers provide a way to create data which can be shared between different
processes, including sharing over a network between processes running on different machines.
A manager object controls a server process which manages shared objects.
predictions_list = Manager().list()
p1 = Process(target=boot_decision_tree,
args=(X_train, X_test, y_train, predictions_list))
p2 = Process(target=boot_decision_tree,
args=(X_train, X_test, y_train, predictions_list))
Pool() of workers (processes or Jobs) for concurrent (parallel) execution of multiples tasks.
Pool can be used when N independent tasks need to be executed in parallel, when there are
more tasks than cores on the computer.
1. Initialize a Pool(), map(), apply_async(), of P workers (Process, or Jobs), where P <
number of cores in the computer. Use cpu_count to get the number of logical cores in the
current system, See: Number of CPUs and Cores in Python.
2. Map N tasks to the P workers, here we use the function Pool.apply_async() that runs the
jobs asynchronously. Asynchronous means that calling pool.apply_async does not block
the execution of the caller that carry on, i.e., it returns immediately with a AsyncResult
object for the task.
that the caller (than runs the sub-processes) is not blocked by the to the process pool does not
block, allowing the caller that issued the task to carry on.# 3. Wait for all jobs to complete
pool.join() 4. Collect the results
pool = Pool(njobs)
# Run multiple tasks each with multiple arguments
async_results = [pool.apply_async(boot_decision_tree,
args=(X_train, X_test, y_train))
for i in range(ntasks)]
# Close the process pool & wait for all jobs to complete
pool.close()
pool.join()
y_test_vote = vote(y_test_boot)
print("Balanced Accuracy: %.2f" % balanced_accuracy_score(y_test, y_test_vote))
print("Concurrent execution with processes, elapsed time:", time.time() - start)
import os
import os.path
import argparse
import re
import pandas as pd
if __name__ == "__main__":
# parse command line options
output = "word_count.csv"
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input',
help='list of input files.',
nargs='+', type=str)
parser.add_argument('-o', '--output',
help='output csv file (default %s)' % output,
type=str, default=output)
options = parser.parse_args()
if options.input is None :
parser.print_help()
raise SystemExit("Error: input files are missing")
else:
filenames = [f for f in options.input if os.path.isfile(f)]
# Match words
regex = re.compile("[a-zA-Z]+")
count = dict()
for filename in filenames:
fd = open(filename, "r")
for line in fd:
for word in regex.findall(line.lower()):
if not word in count:
count[word] = 1
(continues on next page)
fd = open(options.output, "w")
# Pandas
df = pd.DataFrame([[k, count[k]] for k in count], columns=["word", "count"])
df.to_csv(options.output, index=False)
2.10 Networking
# TODO
2.10.1 FTP
import ftplib
ftp = ftplib.FTP("ftp.cea.fr")
ftp.login()
ftp.cwd('/pub/unati/people/educhesnay/pystatml')
ftp.retrlines('LIST')
'221 Goodbye.'
import urllib
ftp_url = 'ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/README.md'
urllib.request.urlretrieve(ftp_url, os.path.join(tmpdir, "README2.md"))
2.10. Networking 43
Statistics and Machine Learning in Python, Release 0.8
2.10.2 HTTP
# TODO
2.10.3 Sockets
# TODO
2.10.4 xmlrpc
# TODO
Sources
• http://python-textbok.readthedocs.org/en/latest/Object_Oriented_Programming.html
Principles
• Encapsulate data (attributes) and code (methods) into objects.
• Class = template or blueprint that can be used to create objects.
• An object is a specific instance of a class.
• Inheritance: OOP allows classes to inherit commonly used state and behavior from other
classes. Reduce code duplication
• Polymorphism: (usually obtained through polymorphism) calling code is agnostic as to
whether an object belongs to a parent class or one of its descendants (abstraction, modu-
larity). The same method called on 2 objects of 2 different classes will behave differently.
class Shape2D:
def area(self):
raise NotImplementedError()
# Inheritance + Encapsulation
class Square(Shape2D):
def __init__(self, width):
self.width = width
def area(self):
return self.width ** 2
def area(self):
return math.pi * self.radius ** 2
Object creation
square = Square(2)
square.area()
# Polymorphism
print([s.area() for s in shapes])
s = Shape2D()
try:
s.area()
except NotImplementedError as e:
print("NotImplementedError", e)
[4, 28.274333882308138]
NotImplementedError
See PEP 8
• Spaces (four) are the preferred indentation method.
• Two blank lines for top level function or classes definition.
• One blank line to indicate logical sections.
• Never use: from lib import *
• Bad: Capitalized_Words_With_Underscores
• Function and Variable Names: lower_case_with_underscores
• Class Names: CapitalizedWords (aka: CamelCase)
2.13 Documenting
Parameters
----------
a : float
First operand.
b : float, optional
Second operand. The default is 2.
Returns
-------
Sum of operands.
Example
-------
>>> my_function(3)
5
"""
# Add a with b (this is a comment)
return a + b
print(help(my_function))
my_function(a, b=2)
This function ...
Parameters
----------
a : float
First operand.
b : float, optional
Second operand. The default is 2.
Returns
-------
(continues on next page)
Example
-------
>>> my_function(3)
5
None
"""
Created on Thu Nov 14 12:08:41 CET 2019
@author: firstname.lastname@email.com
Some description
"""
Python packages and modules structure python code into modular “libraries” to be shared.
2.14.1 Package
Packages are a way of structuring Python’s module namespace by using “dotted module names”.
A package is a directory (here, stat_pkg) containing a __init__.py file.
Example, package
stat_pkg/
__init__.py
datasets_mod.py
The __init__.py can be empty. Or it can be used to define the package API, i.e., the modules
(*.py files) that are exported and those that remain internal.
Example, file stat_pkg/__init__.py
2.14.2 Module
import numpy as np
def make_regression(n_samples=10, n_features=2, add_intercept=False):
...
return X, y, coef
Usage
X, y, coef = pkg.make_regression()
print(X.shape)
(10, 2)
import sys
sys.path.append("/home/ed203246/git/pystatsml/python_lang")
When developing a library (e.g., a python package) that is bound to evolve and being corrected,
we want to ensure that: (i) The code correctly implements some expected functionalities; (ii)
the modifications and additions don’t break those functionalities;
Unit testing is a framework to asses to those two points. See sources:
• Unit testing reference doc
• Getting Started With Testing in Python
import unittest
import numpy as np
from stat_pkg import make_regression
class TestDatasets(unittest.TestCase):
def test_make_regression(self):
X, y, coefs = make_regression(n_samples=10, n_features=3,
add_intercept=True)
self.assertTrue(np.allclose(X.shape, (10, 4)))
self.assertTrue(np.allclose(y.shape, (10, )))
self.assertTrue(np.allclose(coefs.shape, (4, )))
if __name__ == '__main__':
unittest.main()
python tests/test_datasets_mod.py
Unittest test discovery: (-m unittest discover) within (-s) tests directory, with verbose (-v)
outputs.
Doctest is an inbuilt test framework that comes bundled with Python by default. The doctest
module searches for code fragments that resemble interactive Python sessions and runs those
sessions to confirm they operate as shown. It promotes Test-driven (TDD) methodology.
1) Add doc test in the docstrings, see python stat_pkg/supervised_models.py:
class LinearRegression:
"""Ordinary least squares Linear Regression.
...
Examples
--------
>>> import numpy as np
>>> from stat_pkg import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.coef_
array([3., 1., 2.0])
>>> reg.predict(np.array([[3, 5]]))
array([16.])
"""
def __init__(self, fit_intercept=True):
self.fit_intercept = fit_intercept
...
2) Add the call to doctest module ad the end of the python file:
if __name__ == "__main__":
import doctest
doctest.testmod()
python stat_pkg/supervised_models.py
**********************************************************************
File ".../supervised_models.py", line 36, in __main__.LinearRegression
Failed example:
reg.coef_
Expected:
array([3., 1., 2.0])
Got:
array([3., 1., 2.])
(continues on next page)
2.16 Exercises
Create a function that acts as a simple calculator taking three parameters: the two operand
and the operation in “+”, “-”, and “*”. As default use “+”. If the operation is misspecified,
return a error message Ex: calc(4,5,"*") returns 20 Ex: calc(3,5) returns 8 Ex: calc(1, 2,
"something") returns error message
Given a list of numbers, return a list where all adjacent duplicate elements have been reduced
to a single element. Ex: [1, 2, 2, 3, 2] returns [1, 2, 3, 2]. You may create a new list or
modify the passed in list.
Remove all duplicate values (adjacent or not) Ex: [1, 2, 2, 3, 2] returns [1, 2, 3]
2.16. Exercises 51
Statistics and Machine Learning in Python, Release 0.8
THREE
NumPy is an extension to the Python programming language, adding support for large, multi-
dimensional (numerical) arrays and matrices, along with a large library of high-level mathe-
matical functions to operate on these arrays.
Numpy functions are executed by compiled in C or Fortran libraries, providing the perfor-
mance of compiled languages.
Sources: Kevin Markham
Computation time:
import numpy as np
import time
start_time = time.time()
l = [v for v in range(10 ** 8)]
s = 0
for v in l: s += v
print("Python code, time ellapsed: %.2fs" % (time.time() - start_time))
start_time = time.time()
arr = np.arange(10 ** 8)
arr.sum()
print("Numpy code, time ellapsed: %.2fs" % (time.time() - start_time))
53
Statistics and Machine Learning in Python, Release 0.8
Create ndarrays from lists. note: every element must be the same type (will be converted if
possible)
[[1 2 3 4]
[5 6 7 8]]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Examining arrays
3.1.2 Selection
np.int64(7)
Slicing
Syntax: start:stop:step with start (default 0) stop (default last) step (default 1)
• : is equivalent to 0:last:1; ie, take all elements, from 0 to the end with step = 1.
• :k is equivalent to 0:k:1; ie, take all elements, from 0 to k with step = 1.
• k: is equivalent to k:end:1; ie, take all elements, from k to the end with step = 1.
• ::-1 is equivalent to 0:end:-1; ie, take all elements, from k to the end in reverse order,
with step = -1.
arr2[0, 0] = 33
print(arr2)
print(arr)
[[2 3 4]
[6 7 8]]
[[33 3 4]
[ 6 7 8]]
[[ 1 33 3 4]
[ 5 6 7 8]]
print(arr[0, ::-1])
[ 4 3 33 1]
[[33 3 4]
[ 6 7 8]]
[[44 3 4]
[ 6 7 8]]
[[ 1 33 3 4]
[ 5 6 7 8]]
print(arr2)
arr2[0] = 44
print(arr2)
print(arr)
[33 6 7 8]
[44 6 7 8]
[[ 1 33 3 4]
[ 5 6 7 8]]
However, In the context of lvalue indexing (left hand side value of an assignment) Fancy autho-
rizes the modification of the original array
arr[arr > 5] = 0
print(arr)
[[1 0 3 4]
[5 0 0 0]]
General rules:
• Slicing always returns a view.
• Fancy indexing (boolean mask, integers) returns copy
• lvalue indexing i.e. the indices are placed in the left hand side value of an assignment,
provides a view.
Reshaping
(2, 5)
[[0. 1.]
[2. 3.]
[4. 5.]
[6. 7.]
[8. 9.]]
Add an axis
a = np.array([0, 1])
print(a)
a_col = a[:, np.newaxis]
print(a_col)
#or
a_col = a[:, None]
[0 1]
[[0]
[1]]
Transpose
print(a_col.T)
[[0 1]]
arr_flt = arr.flatten()
arr_flt[0] = 33
print(arr_flt)
print(arr)
[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[0. 1. 2. 3. 4.]
[5. 6. 7. 8. 9.]]
arr_flt = arr.ravel()
arr_flt[0] = 33
print(arr_flt)
print(arr)
[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[33. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]
a = np.array([0, 1])
b = np.array([2, 3])
Horizontal stacking
np.hstack([a, b])
array([0, 1, 2, 3])
Vertical stacking
np.vstack([a, b])
array([[0, 1],
[2, 3]])
Default Vertical
np.stack([a, b])
array([[0, 1],
[2, 3]])
Numpy internals: By default Numpy use C convention, ie, Row-major language: The matrix is
stored by rows. In C, the last index changes most rapidly as one moves through the array as
stored in memory.
For 2D arrays, sequential move in the memory will:
• iterate over rows (axis 0)
– iterate over columns (axis 1)
x = np.arange(2 * 3 * 4)
print(x)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
x = x.reshape(2, 3, 4)
print(x)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
print(x[0, :, :])
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
print(x[:, 0, :])
[[ 0 1 2 3]
[12 13 14 15]]
print(x[:, :, 0])
[[ 0 4 8]
[12 16 20]]
Ravel
print(x.ravel())
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
nums = np.arange(5)
nums * 10 # multiply each element by 10
nums = np.sqrt(nums) # square root of each element
np.ceil(nums) # also floor, rint (round to nearest int)
np.isnan(nums) # checks for NaN
nums + np.arange(5) # add element-wise
np.maximum(nums, np.array([1, -2, 3, -4, 5])) # compare element-wise
# random numbers
(continues on next page)
array([0, 0, 0, 1, 1, 0, 1, 1, 1, 1])
3.1.6 Broadcasting
Rules
Starting with the trailing axis and working backward, Numpy compares arrays dimensions.
• If two dimensions are equal then continues
• If one of the operand has dimension 1 stretches it to match the largest one
• When one of the shapes runs out of dimensions (because it has less dimensions than
the other shape), Numpy will use 1 in the comparison process until the other shape’s
dimensions run out as well.
a = np.array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.array([0, 1, 2])
print(a + b)
[[ 0 1 2]
[10 11 12]
[20 21 22]
[30 31 32]]
a - a.mean(axis=0)
(a - a.mean(axis=0)) / a.std(axis=0)
Examples
Shapes of operands A, B and result:
A (2d array): 5 x 4
B (1d array): 1
Result (2d array): 5 x 4
A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4
A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5
3.1.7 Exercises
• For each column find the row index of the minimum value.
• Write a function standardize(X) that return an array whose columns are centered and
scaled (by std-dev).
Total running time of the script: (0 minutes 6.367 seconds)
It is often said that 80% of data analysis is spent on the cleaning and small, but important,
aspect of data manipulation and cleaning with Pandas.
Sources:
• Kevin Markham: https://github.com/justmarkham
• Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html
Data structures
• Series is a one-dimensional labeled array capable of holding any data type (inte-
gers, strings, floating point numbers, Python objects, etc.). The axis labels are col-
lectively referred to as the index. The basic method to create a Series is to call
pd.Series([1,3,5,np.nan,6,8])
• DataFrame is a 2-dimensional labeled data structure with columns of potentially different
types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It
stems from the R data.frame() object.
import pandas as pd
import numpy as np
print(user3)
Concatenate DataFrame
Join DataFrame
name height
0 alice 165
1 john 180
2 eric 175
3 julie 171
Reshaping by pivoting
3.2.3 Summarizing
Meta-information
print(users.columns)
0 F
1 M
2 M
3 F
4 F
5 M
Name: gender, dtype: object
df = users.copy()
df.iloc[0] # first row
df.iloc[0, :] # first row
df.iloc[[0, 1], :] # Two first row
df = users[users.gender == "F"]
print(df)
df = users[:2].copy()
alice 19
eric 22
alice 19
eric 22
for i in range(df.shape[0]):
df.loc[i, "age"] *= 10 # df is modified
users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]
3.2.8 Sorting
df = users.copy()
print(df)
print(df.describe())
age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000
print(df.describe(include='all'))
print(df.describe(include=['object'])) # limit to one (or more) types
df['job'].value_counts()
df['job'].value_counts(normalize=True).round(2)
job
student 0.50
engineer 0.17
manager 0.17
scientist 0.17
Name: proportion, dtype: float64
df['job'].str.len()
5 8
4 7
3 9
0 7
1 7
2 7
Name: job, dtype: int64
print(df.groupby("job")["age"].mean())
# print(df.groupby("job").describe(include='all'))
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
Groupby in a loop
df = users.copy()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Missing data
df.describe(include='all')
name 0
age 0
gender 0
job 0
height 2
dtype: int64
df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()
print(df)
df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")
3.2.12 Renaming
Rename columns
df = users.copy()
df.rename(columns={'name': 'NAME'})
Rename values
Assume random variable follows the normal distribution Exclude data outside 3 standard-
deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample
lies within 3 sd: 99.73% (68.27 + 2 * 15.73)
size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())
248.48963819938044
Median absolute deviation (MAD) is based on the median, is a robust non-parametric statistics.
173.80000467192673 178.7023568870694
csv
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv
˓→'
salary = pd.read_csv(url)
Excel
# Write
users.to_excel(xls_filename, sheet_name='users', index=False)
# Read
pd.read_excel(xls_filename, sheet_name='users')
# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
df.to_excel(writer, sheet_name='salary', index=False)
pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')
SQL (SQLite)
import pandas as pd
import sqlite3
Connect
conn = sqlite3.connect(db_filename)
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv
˓→'
salary = pd.read_csv(url)
46
Push modifications
cur = conn.cursor()
values = (100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()
3.2.15 Exercises
Data Frame
Missing data
df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None
1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column
with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets:
original, imputed.
Total running time of the script: (0 minutes 0.854 seconds)
Sources:
• Matplotlib - Quick Guide
3.3.1 Parameter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Set style
print(plt.style.available)
plt.style.use('seaborn-v0_8-whitegrid')
˓→colorblind10']
plt.figure(figsize=(9, 3))
x = np.linspace(0, 10, 50)
sinus = np.sin(x)
plt.plot(x, sinus)
plt.show()
plt.figure(figsize=(9, 3))
# Rapid multiplot
plt.figure(figsize=(9, 3))
cosinus = np.cos(x)
plt.plot(x, sinus, "-b", x, sinus, "ob", x, cosinus, "-r", x, cosinus, "or")
plt.xlabel('this is x!')
plt.ylabel('this is y!')
plt.title('My First Plot')
plt.show()
# Step by step
plt.figure(figsize=(9, 3))
plt.plot(x, sinus, label='sinus', color='blue', linestyle='--', linewidth=2)
plt.plot(x, cosinus, label='cosinus', color='red', linestyle='-', linewidth=2)
plt.legend()
plt.show()
Load dataset
import pandas as pd
try:
salary = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_
˓→table.csv'
salary = pd.read_csv(url)
df = salary
print(df.head())
Legend outside
Linear model
# Prefer vectorial format (SVG: Scalable Vector Graphics) can be edited with
# Inkscape, Adobe Illustrator, Blender, etc.
plt.plot(x, sinus)
plt.savefig("sinus.svg")
plt.close()
# Or pdf
plt.plot(x, sinus)
plt.savefig("sinus.pdf")
plt.close()
Box plots are non-parametric: they display variation in samples of a statistical population with-
out making any assumptions of the underlying statistical distribution.
numpy.histogram can be used to probability density function at the each histogram bin, setting
density=True parameter.
Warning, histogram doesn’t sum to 1. Histogram as PDF estimator should be multiplied by dx’s
to sum to 1.
i = 0
for edu, d in salary.groupby(['education']):
sns.kdeplot(x="salary", hue="management", data=d, fill=True, ax=axes[i],␣
˓→palette="muted")
axes[i].set_title(edu)
i += 1
ax = sns.pairplot(salary, hue="management")
FOUR
import numpy as np
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
#import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline
Sources:
• Patrick Walls course of Dept of Mathematics, University of British Columbia.
• Wikipedia
The derivative of a function at is the limit
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
𝑓 ′ (𝑥) = lim
ℎ→0 ℎ
For a fixed step size ℎ, the previous formula provides the slope of the function using the forward
difference approximation of the derivative. Equivalently, the slope could be estimated using
backward approximation with positions 𝑥 − ℎ and 𝑥.
The most efficient numerical derivative use the central difference formula with step size is the
average of the forward and backward approximation (known as symmetric difference quotient):
(︂ )︂
′ 1 𝑓 (𝑎 + ℎ) − 𝑓 (𝑎) 𝑓 (𝑎) − 𝑓 (𝑎 − ℎ) 𝑓 (𝑎 + ℎ) − 𝑓 (𝑎 − ℎ)
𝑓 (𝑎) ≈ + =
2 ℎ ℎ 2ℎ
91
Statistics and Machine Learning in Python, Release 0.8
eps = np.finfo(np.float64).eps
print("Machine epsilon: {:e}, Min step size: {:e}".format(eps, np.cbrt(eps)))
2. The error of the central difference approximation is upper bounded by a function in 𝒪(ℎ2 ).
I.e., large step size ℎ = 10−2 leads to large error of 10−4 . Small step size e.g., ℎ = 10−4
provide accurate slope estimation in 10−8 .
Those two points argue for a step size ℎ ∈ [10−3 , 10−6 ]
Example: Numerical differentiation of the function:
7𝑥3 − 5𝑥 + 1
𝑓 (𝑥) = , 𝑥 ∈ [−5, 5]
2𝑥4 + 𝑥2 + 1
Numerical differentiation with Numpy gradient given values y and x (or spacing dx) of a
function.
range_ = [-5, 5]
dx = 1e-3
n = int((range_[1] - range_[0]) / dx)
x = np.linspace(range_[0], range_[1], n)
f = lambda x: (7 * x ** 3 - 5 * x + 1) / (2 * x ** 4 + x ** 2 + 1)
y = f(x) # values
dydx = np.gradient(y, dx) # values
import sympy as sp
from sympy import lambdify
x_s = sp.symbols('x', real=True) # defining the variables
plt.plot(x, y, label="f")
plt.plot(x[1:-1], dydx[1:-1], lw=4, label="f' Num. Approx.")
plt.plot(x, dfdx_sym(x), "--", label="f'")
plt.legend()
plt.show()
import numdifftools as nd
# Example f(x) = x ** 2
Example with 𝑓 = 𝑥3 − 27𝑥 − 1. We have 𝑓 ′ = 3𝑥2 − 27, with roots (−3, 3), and 𝑓 ′′ = 6𝑥, with
root 0.
Second-order derivative, “the rate of change of the rate of change” corresponds to the curvature
or concavity of the function.
x = np.linspace(range_[0], range_[1], n)
plt.plot(x, f(x), color=colors[0], lw=2, label="f")
plt.plot(x, dfdx(x), color=colors[1], label="f'")
plt.plot(x, df2dx2(x), color=colors[2], label="f''")
plt.axvline(x=-3, ls='--', color=colors[1])
plt.axvline(x= 3, ls='--', color=colors[1])
plt.axvline(x= 0, ls='--', color=colors[2])
plt.legend()
plt.show()
# Make data.
x = np.arange(-5, 5, 0.25)
y = np.arange(-5, 5, 0.25)
xx, yy = np.meshgrid(x, y)
zz = f([xx, yy])
# Plot
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
(continues on next page)
The Gradient at a given point x is the vector of partial derivative of 𝑓 at gives the direction of
fastest increase.
⎡ ⎤
𝜕𝑓 /𝜕𝑥1
∇𝑓 (x) = ⎣ ..
⎦,
⎢ ⎥
.
𝜕𝑓 /𝜕𝑥𝑝
f_grad = nd.Gradient(f)
print(f_grad([0, 0]))
print(f_grad([1, 1]))
print(f_grad([-1, 2]))
[0. 0.]
[2. 2.]
[-2. 4.]
The Hessian matrix contains the second-order partial derivatives of 𝑓 . It describes the local
curvature of a function of many variables. It is noted:
⎡ 𝜕2𝑓 𝜕2𝑓
⎤
2 . . .
⎢ 𝜕 𝑥1 . 𝜕𝑥𝑝 𝜕𝑥1 ⎥
𝑓 ′′ (x𝑘 ) = ∇2 𝑓 (x𝑘 ) = H𝑓 (x𝑘 ) = ⎢
⎣
.. ⎥,
⎦
2
𝜕 𝑓 𝜕 𝑓2
𝜕𝑥1 𝜕𝑥𝑝 . . . 𝜕 2 𝑥𝑝
H = nd.Hessian(f)([0, 0])
print(H)
[[2. 0.]
[0. 2.]]
import numpy as np
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
#import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
Methods for integrating functions given fixed samples: [(𝑥1 , 𝑓 (𝑥1 )), ..., (𝑥𝑖 , 𝑓 (𝑥𝑖 )), ...(𝑥𝑁 , 𝑓 (𝑥𝑁 )].
Riemann sums of rectangles to approximate the area.
𝑁
∑︁
𝑓 (𝑥*𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) , 𝑥*𝑖 ∈ [𝑥𝑖−1 , 𝑥𝑖 ]
𝑖=1
The error is in 𝒪( 𝑁1 )
f = lambda x : 1 / (1 + x ** 2)
a, b, N = 0, 5, 10
dx = (b - a) / N
x = np.linspace(a, b, N+1)
y = f(x)
a, b, N = 0, 5, 50
dx = (b - a) / N
x = np.linspace(a, b, N+1)
y = f(x)
print("Integral:", np.sum(f(x[:-1]) * np.diff(x)))
Integral: 1.4214653634808756
Trapezoid Rule sum the trapezoids connecting the points. The error is in 𝒪( 𝑁12 ). Use
scipy.integrate.trapezoid function:
np.float64(1.369466163161004)
np.float64(1.3694791829077122)
(1.3734007669450166, 7.167069904541812e-09)
The return values are the estimated of the integral and the estimate of the absolute integration
error.
Tools:
• Pandas
• Pandas user guide
• Time Series analysis (TSA) from statsmodels
References:
• Basic
• Detailed
• PennState Time Series course
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
2018-01-01 0
2019-01-01 1
2020-01-01 2
2021-01-01 3
2022-01-01 4
Freq: YS-JAN, dtype: int64
Stationarity
A TS is said to be stationary if its statistical properties such as mean, variance remain constant
over time.
• constant mean
• constant variance
• an autocovariance that does not depend on time.
what is making a TS non-stationary. There are 2 major reasons behind non-stationary of a TS:
1. Trend - varying mean over time. For eg, in this case we saw that on average, the number
of passengers was growing over time.
2. Seasonality - variations at specific time-frames. eg people might have a tendency to buy
cars in a particular month because of pay increment or festivals.
Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over time
while learning about trends and seasonality in time series data.
In the Facebook Live code along session on the 4th of January, we checked out Google trends
data of keywords ‘diet’, ‘gym’ and ‘finance’ to see how they vary over time. We asked ourselves
if there could be more searches for these terms in January when we’re all trying to turn over a
new leaf?
In this tutorial, you’ll go through the code that we put together during the session step by step.
You’re not going to do much mathematics but you are going to do the following:
• Read data
• Recode data
• Exploratory Data Analysis
Read data
try:
url = "https://github.com/datacamp/datacamp_facebook_live_ny_resolution/raw/
˓→master/data/multiTimeline.csv"
df = pd.read_csv(url, skiprows=2)
except:
df = pd.read_csv("../datasets/multiTimeline.csv", skiprows=2)
print(df.head())
# Rename columns
df.columns = ['month', 'diet', 'gym', 'finance']
# Describe
print(df.describe())
Recode data
Next, you’ll turn the ‘month’ column into a DateTime data type and make it the index of the
DataFrame.
Note that you do this because you saw in the result of the .info() method that the ‘Month’
column was actually an of data type object. Now, that generic data type encapsulates everything
from strings to integers, etc. That’s not exactly what you want when you want to be looking
at time series data. That’s why you’ll use .to_datetime() to convert the ‘month’ column in your
DataFrame to a DateTime.
Be careful! Make sure to include the in place argument when you’re setting the index of the
DataFrame df so that you actually alter the original index and set it to the ‘month’ column.
df.month = pd.to_datetime(df.month)
df.set_index('month', inplace=True)
df = df[["diet", "gym"]]
print(df.head())
diet gym
month
2004-01-01 100 31
2004-02-01 75 26
2004-03-01 67 24
2004-04-01 70 22
2004-05-01 72 22
df.plot()
plt.xlabel('Year');
Note that this data is relative. As you can read on Google trends:
Numbers represent search interest relative to the highest point on the chart for the given region
and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term
is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the
peak.
diet = df['diet']
diet_resamp_yr = diet.resample('YE').mean()
diet_roll_yr = diet.rolling(12).mean()
x = np.asarray(df[['diet']])
win = 12
win_half = int(win / 2)
diet_smooth = np.array([x[(idx-win_half):(idx+win_half)].mean()
for idx in np.arange(win_half, len(x))])
_ = plt.plot(diet_smooth)
df_trend.plot()
plt.xlabel('Year')
Text(0.5, 0, 'Year')
Text(0.5, 0, 'Year')
First-order approximation using diff method which compute original - shifted data:
df.diff().plot()
plt.xlabel('Year')
Text(0.5, 0, 'Year')
Correlation matrix
print(df.corr())
diet gym
diet 1.000000 -0.100764
gym -0.100764 1.000000
‘diet’ and ‘gym’ are negatively correlated! Remember that you have a seasonal and a trend
component. The correlation is actually capturing both of those. Decomposing into separate
components provides a better insight of the data:
Trends components that are negatively correlated:
df_trend.corr()
print(df_dtrend.corr())
print(df.diff().corr())
diet gym
diet 1.000000 0.600208
gym 0.600208 1.000000
diet gym
diet 1.000000 0.758707
gym 0.758707 1.000000
Seasonal_decompose function of statsmodels. “The results are obtained by first estimating the
trend by applying a using moving averages or a convolution filter to the data. The trend is
then removed from the series and the average of this de-trended series for each period is the
returned seasonal component.”
We use additive (linear) model, i.e., TS = Level + Trend + Seasonality + Noise
• Level: The average value in the series.
axis[0].plot(x, label='Original')
axis[0].legend(loc='best')
axis[1].plot(trend, label='Trend')
axis[1].legend(loc='best')
axis[2].plot(seasonal,label='Seasonality')
axis[2].legend(loc='best')
axis[3].plot(residual, label='Residuals')
axis[3].legend(loc='best')
plt.tight_layout()
A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Autocorrelation Function (ACF): It is a measure of the correlation between the TS with a lagged
version of itself. For instance at lag 5, ACF would compare series at time instant 𝑡 with series at
instant 𝑡 − ℎ.
• The autocorrelation measures the linear relationship between an observation and its pre-
vious observations at different lags (ℎ).
• Represents the overall correlation structure of the time series.
• Used to identify the order of a moving average (MA) process.
# We could have considered the first order differences to capture the seasonality
# x = df["gym"].astype(float).diff().dropna()
plt.plot(x)
plt.show()
• Partial autocorrelation measures the direct linear relationship between an observation and
its previous observations at a specific offset, excluding contributions from intermediate
offsets.
• Highlights direct relationships between observations at specific lags.
• Used to identify the order of an autoregressive (AR) process. The partial autocorrelation
of an AR(p) process equals zero at lags larger than p, so the appropriate maximum lag p
is the one after which the partial autocorrelations are all zero.
plot_pacf(x)
plt.show()
PACF peaks every 12 months, i.e., the signal is correlated with itself shifted by 12 months. Its,
then slowly decrease is due to the trend.
Sources:
• Simple modeling with AutoReg
The autoregressive orders. In general, we can define an AR(p) model with 𝑝 autoregressive
terms as follows:
𝑝
∑︁
𝑥𝑡 = 𝑎𝑖 𝑥𝑡−𝑖 + 𝜀𝑡
𝑖
# We set the frequency for the time series to “MS” (month-start) to avoid␣
˓→warnings when using AutoReg.
x = df_dtrend.gym.dropna().asfreq("MS")
ar1 = AutoReg(x, lags=1).fit()
print(ar1.summary())
axis[0].plot(x, label='Original')
axis[0].plot(ar1.predict(), label='AR(1)')
axis[0].legend(loc='best')
axis[1].plot(x, label='Original')
axis[1].plot(ar12.predict(), label='AR(12)')
_ = axis[1].legend(loc='best')
Automatic model selection using Akaike information criterion (AIC). AIC drops at 𝑝 = 12.
Fourier Analysis
• Fourier analysis is a mathematical method used to decompose functions or signals into
their constituent frequencies, known as sine and cosine components, that form a orthog-
onal basis.
𝑦(𝑡) = 𝐴 cos(2𝜋𝑓 𝑡 + 𝜑)
sr = 200
where
• N is the number of samples
• n ie the current sample
• k ie the current frequency, where 𝑘 ∈ [0, 𝑁 − 1]
• 𝑥𝑛 is the sine value at sample 𝑛
• 𝑋𝑘 are the (frequency terms or DCT) which include information of amplitude. It is called
the spectrum of the signal.
Relation between
• 𝑓𝑠 : Sampling rate or Frequency of Sampling, where
• 𝑇 : Duration
𝑁
𝑓𝑠 =
𝑇
Generate Signal, as an addition of three cosines at different frequencies: 1 Hz, 10 Hz, and 50
Hz:
T = 2. # duration
fs = 100 # Sampling rate/frequency: number of samples per second
ts = 1.0 / fs # sampling interval
t = np.arange(0, T, ts) # time axis
N = len(t)
# Generate Signal
x = 0
x += 3.0 * np.cos(2 * np.pi * 1.00 * t)
x += 1.0 * np.cos(2 * np.pi * 10.0 * t)
x += 1.0 * np.cos(2 * np.pi * 50.0 * t)
# Plot
Cosines Basis
N = len(x)
n = np.arange(N)
k = n.reshape((N, 1))
cosines = np.cos(2 * np.pi * k * n / N)
# Plot
plt.imshow(cosines[:100, :])
plt.xlabel('Time [s]')
plt.ylabel('Freq. [Hz]')
plt.title('Cosines Basis')
plt.show()
Decompose signal on cosine basis (dot product), i.e., DCT without signal normalization
X = np.dot(cosines, x)
# Frequencies = N / T
freqs = np.arange(N) / T
freq val
2 1.0 300.0
20 10.0 100.0
100 50.0 200.0
The amplitude is the and phase of the signal can be calculated as:
TODO FFT
x = 3 * np.sin(2 * np.pi * 1 * t)
x += np.sin(2 * np.pi * 4 * t)
x += 0.5* np.sin(2 * np.pi * 7 * t)
X = fft(x)
# Frequencies
plt.figure()
plt.subplot(121)
(continues on next page)
plt.subplot(122)
plt.plot(t, ifft(X), 'r')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()
freq val
1 1.0 150.0
4 4.0 50.0
7 7.0 25.0
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→matplotlib/cbook.py:1762: ComplexWarning: Casting complex values to real␣
return math.isfinite(val)
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→matplotlib/cbook.py:1398: ComplexWarning: Casting complex values to real␣
x = df['diet']
x -= x.mean()
x.plot()
X = fft(x)
# Frequencies
Xn = abs(X)
print(pd.Series(Xn, index=freqs).describe())
count 1.680000e+02
mean 6.938104e+01
std 7.745030e+01
min 5.115908e-13
25% 3.132051e+01
50% 4.524089e+01
75% 6.551233e+01
max 4.429661e+02
dtype: float64
freq_year freq_month val
2 0.142857 84.0 442.966103
14 1.000000 12.0 422.372698
28 2.000000 6.0 271.070102
42 3.000000 4.0 215.675682
56 4.000000 3.0 248.131014
70 5.000000 2.4 216.030794
84 6.000000 2.0 240.000000
Therefore, to minimize 𝑓 (𝑤𝑘 +𝑡) we just have to move in the opposite direction of the derivative
𝑓 ′ (𝑤𝑘 ):
𝑤𝑘+1 = 𝑤𝑘 − 𝛾𝑓 ′ (𝑤𝑘 )
With a learning rate 𝛾 that determines the step size at each iteration while moving toward a
minimum of a cost function.
In multidimensional problems w𝑘 ∈ R𝑝 , where:
⎡ ⎤
𝑤1
w𝑘 = ⎣ ... ⎦ ,
⎢ ⎥
𝑤𝑝 𝑘
With large learning rate 𝛾 we can cover more ground each step, but we risk overshooting the
lowest point since the slope of the hill is constantly changing.
With a very small learning rate**, we can confidently move in the direction of the negative
gradient since we are recalculating it so frequently. A small learning rate is more precise, but
calculating the gradient is time-consuming, leading too slow convergence
Fig. 3: jeremyjordan
Line search can be used (or more sophisticated Backtracking line search) to find value of 𝛾
such that 𝑓 (w𝑘 − 𝛾∇𝑓 (w𝑘 )) is minimum. However such simple method ignore possible change
of the curvature.
• Benefit of gradient decent: simplicity and versatility, almost any function with a gradient
can be minimized.
• Limitations:
– Local minima (local optimization) for non-convex problem.
– Convergence speed: With fast changing curvature (gradient direction) the estimation
of gradient will rapidly become wrong moving away from 𝑥𝑘 suggesting small step-
size. This also suggest the integration of change of the gradient direction in the
calculus of the step size.
Libraries
import numpy as np
import pandas as pd
from scipy.optimize import minimize
import numdifftools as nd
# Plot
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * 1.)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
#%matplotlib inline
Parameters
----------
fun : callable
The objective function to be minimized.
x0 : ndarray, shape (n_features,)
Initial guess.
args : tuple, optional
Extra arguments passed to the objective function and its derivatives
(fun, jac and hess functions)
method : string, optional
the solver, by default "first-order" if the basic first-order gradient
descent.
jac : callable, optional
Method for computing the gradient vector, the Jacobian.,
(continues on next page)
Returns
-------
ndarray, shape (n_features,): the solution, intermediate_res dict
"""
# Initialize parameters
weights_k = x0.copy()
# Termination criteria
k, eps = 0, np.inf
# Dict to store intermediate results
intermediate_res = dict(eps=[], weights=[])
if options["intermediate_res"]:
intermediate_res["eps"].append(eps)
intermediate_res["weights"].append(weights_prev)
Minimize:
𝑓 (w) = 𝑓 (𝑥, 𝑦) = 𝑥2 + 𝑦 2 + 𝑥𝑦
[︂ ]︂ [︂ ]︂
𝜕𝑓 /𝜕𝑥 2𝑥 + 1
∇𝑓 (w) = = ,
𝜕𝑓 /𝜕𝑦 2𝑦 + 1
def f(x):
x = np.asarray(x)
x, y = (x[0], x[1]) if x.ndim == 1 else (x[:, 0], x[:, 1])
return x ** 2 + y ** 2 + 1 * x * y
def f_grad(x):
x = np.asarray(x, dtype=float)
x, y = (x[0], x[1]) if x.ndim == 1 else (x[:, 0], x[:, 1])
return np.array([2 * x + 1, 2 * y + 1])
f: [ 7 37]
f: 7 37
Grad f: [3. 3.]
Grad f: [3. 5.]
Grad f: [11. 11.]
x0 = np.array([30., 40.])
lr = 0.1
weights_sol, intermediate_res = \
gradient_descent(fun=f, x0=x0, jac=f_grad,
options=dict(learning_rate=lr,
maxiter=100,
intermediate_res=True))
res_ = pd.DataFrame(intermediate_res)
print(res_.head(5))
print(res_.tail(5))
print("Solution: ", weights_sol)
eps weights
0 102.820000 [30.0, 40.0]
1 65.804800 [23.9, 31.9]
2 42.115072 [19.02, 25.419999999999998]
3 26.953646 [15.116, 20.235999999999997]
(continues on next page)
# Figure
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
ax.set_xlabel('x'); ax.set_ylabel('y')
return ax, (xx, yy, zz)
sols = crop(np.array(intermediate_res["weights"]),
x_range, y_range)
plot_path(sols[:, 0], sols[:, 1], f(sols), colors[0],
'lr:%.02f' % lr, ax)
lr = 0.1
weights_sol, intermediate_res = \
gradient_descent(fun=f, x0=x0, jac=f_grad,
options=dict(learning_rate=lr,
maxiter=10,
intermediate_res=True))
sols = crop(np.array(intermediate_res["weights"]),
x_range, y_range)
plot_path(sols[:, 0], sols[:, 1], f(sols), colors[1],
'lr:%.02f' % lr, ax)
lr = 0.9
weights_sol, intermediate_res = \
gradient_descent(fun=f, x0=x0, jac=f_grad,
options=dict(learning_rate=lr,
maxiter=10,
intermediate_res=True))
sols = crop(np.array(intermediate_res["weights"]),
x_range, y_range)
plot_path(sols[:, 0], sols[:, 1], f(sols), colors[2],
'lr:%.02f' % lr, ax)
lr = 1.
weights_sol, intermediate_res = \
gradient_descent(fun=f, x0=x0, jac=f_grad,
options=dict(learning_rate=lr,
maxiter=10,
intermediate_res=True))
sols = crop(np.array(intermediate_res["weights"]),
x_range, y_range)
plot_path(sols[:, 0], sols[:, 1], f(sols), colors[3],
'lr:%.02f' % lr, ax)
(continues on next page)
plt.legend()
plt.show()
# Numerical approximation
f_grad = nd.Gradient(f)
print(f_grad([1, 1]))
print(f_grad([1, 2]))
print(f_grad([5, 5]))
lr = 0.1
weights_sol, intermediate_res = \
gradient_descent(fun=f, x0=x0, jac=f_grad,
options=dict(learning_rate=lr,
maxiter=10,
intermediate_res=True))
res_ = pd.DataFrame(intermediate_res)
print(res_.head(5))
print(res_.tail(5))
print("Solution: ", weights_sol)
[3. 3.]
(continues on next page)
where:
• 𝑦𝑖 is the predicted output for the i-th sample,
• 𝑤𝑝 are the weights (parameters) of the model,
• 𝑥𝑖𝑝 is the p-th feature of the i-th sample.
The objective in least squares minimization is to minimize the following cost function 𝐽(w):
(︃ )︃2
1 ∑︁ ∑︁
𝐽(w) = 𝑦𝑖 − 𝑤𝑝 𝑥𝑖𝑝
2 𝑝
𝑖
Note that the gradient is also called the Jacobian which is the vector of first-order partial deriva-
tives of a scalar-valued function of several variables.
Parameters
----------
weights: coefficients of the linear model, (n_features) numpy array
X: input variables, (n_samples x n_features) numpy array
y: target variable, (n_samples,) numpy array
Returns
-------
Least Squared Error, scalar
"""
Parameters
----------
weights: coefficients of the linear model, (n_features) numpy array
X: input variables, (n_samples x n_features) numpy array
y: target variable, (n_samples,) numpy array
Returns
-------
Gradient array, shape (n_features,)
"""
y_pred = np.dot(X, weights)
err = y_pred - y
grad = np.dot(err, X)
return grad
import numpy as np
n_sample, n_features = 100, 2
X = np.random.randn(n_sample, n_features)
weights = np.array((3, 2))
y = np.dot(X, weights)
lr = 0.01
weights_sol, intermediate_res = \
gradient_descent(fun=lse, x0=np.zeros(weights.shape), args=(X, y),
jac=gradient_lse_lr,
options=dict(learning_rate=lr,
maxiter=15,
(continues on next page)
import pandas as pd
print(pd.DataFrame(intermediate_res))
print("Solution: ", weights_sol)
eps weights
0 1.827068e+01 [0.0, 0.0]
1 2.533290e+00 [2.9606455406811665, 3.083060577054058]
2 5.688808e-01 [2.9474878311671824, 1.4914837597307526]
3 1.285869e-01 [3.0215058147349247, 2.242084930205261]
4 2.906710e-02 [2.989606930891622, 1.884916456416018]
5 6.570631e-03 [3.004933187739197, 2.0547169361452395]
6 1.485294e-03 [2.997654130404577, 1.9739849976431851]
7 3.357514e-04 [3.0011153188655446, 2.01236877321922]
8 7.589676e-05 [2.9994697233396193, 1.994119296045902]
9 1.715650e-05 [3.000252118741238, 2.0027959668215187]
10 3.878234e-06 [2.999880130737226, 1.9986706641729648]
11 8.766766e-07 [3.0000569915580053, 2.0006320295819435]
12 1.981731e-07 [2.9999729034982328, 1.999699503026759]
13 4.479713e-08 [3.0000128829678214, 2.0001428705768003]
14 1.012641e-08 [2.999993874823351, 1.9999320725214131]
Solution: [3.00000291 2.0000323 ]
Newton’s method integrates the change of the curvature (ie, change of gradient direction) in the
minimization process. Since gradient direction is the change of 𝑓 , i.e., the first order derivative,
thus the change of gradient is second order derivative of 𝑓 . See Visually Explained: Newton’s
Method in Optimization
For univariate functions. Like gradient descent Newton’s method try to locally minimize 𝑓 (𝑤𝑘 +
𝑡) given a current position 𝑤𝑘 . However, while gradient descent use first order local estimation
of 𝑓 , Newton’s method increases this approximation using second-order Taylor expansion of 𝑓
around an iterate 𝑤𝑘 :
1
𝑓 (𝑤𝑘 + 𝑡) ≈ 𝑓 (𝑤𝑘 ) + 𝑓 ′ (𝑤𝑘 )𝑡 + 𝑓 ′′ (𝑤𝑘 )𝑡2
2
Cancelling the derivative of this expression: 𝑑𝑡 𝑑
(𝑓 (𝑤𝑘 ) + 𝑓 ′ (𝑤𝑘 )𝑡 + 12 𝑓 ′′ (𝑤𝑘 )𝑡2 ) = 0, provides
′
𝑓 ′ (𝑤𝑘 ) + 𝑓 ′′ (𝑤𝑘 )𝑡 = 0, and thus 𝑡 = 𝑓𝑓′′(𝑤 𝑘) 1
(𝑤𝑘 ) . The learning rate is 𝛾 = 𝑓 ′′ (𝑤𝑘 ) , and optimization
scheme becomes:
1
𝑤𝑘+1 = 𝑤𝑘 − 𝑓 ′ (𝑤𝑘 ).
𝑓 ′′ (𝑤𝑘 )
𝐻 = ∇2 𝐽(w)
For the least squares cost function, 𝐽(w), the Hessian is calculated as follows:
The Hessian 𝐻 is the matrix of second derivatives of 𝐽(w) with respect to 𝑤𝑝 and 𝑤𝑞 . 𝐻 is a
measure of the curvature of 𝐽: The eigenvectors of 𝐻 point in the directions of the major and
minor axes. The eigenvalues measure the steepness of 𝐽 along the corresponding eigendirec-
tion. Thus, each eigenvalue of 𝐻 is also a measure of the covariance or spread of the inputs
along the corresponding eigendirection.
𝜕 2 𝐽(w)
𝐻𝑝𝑞 =
𝜕𝑤𝑝 𝜕𝑤𝑞
Given the form of the gradient, the second derivative with respect to 𝑤𝑝 and 𝑤𝑞 simplifies to:
∑︁
𝐻𝑝𝑞 = 𝑥𝑖𝑝 𝑥𝑖𝑞
𝑖
𝐻 = X𝑇 X
where 𝑋 is the matrix of input features (each row corresponds to a sample, and each column
corresponds to a feature) with 𝑋𝑖𝑝 = 𝑥𝑖𝑝 .
In this case the Hessian turns out the be the same as the covariance matrix of the inputs. Thus,
each eigenvalue of 𝐻 is also a measure of the covariance or spread of the inputs along the
corresponding eigendirection.
Parameters
----------
weights: coefficients of the linear model, (n_features) numpy array
It is not used, you can safely give None.
X: input variables, (n_samples x n_features) numpy array
y: target variable, (n_samples,) numpy array
Returns
-------
Hessian array, shape (n_features, n_features)
"""
return np.dot(X.T, X)
weights_sol, intermediate_res = \
gradient_descent(fun=lse, x0=np.zeros(weights.shape), args=(X, y),
jac=gradient_lse_lr, hess=hessian_lse_lr,
options=dict(learning_rate=0.01,
maxiter=15,
intermediate_res=True))
print(pd.DataFrame(intermediate_res))
print("Solution: ", weights_sol)
eps weights
0 1.827068e+01 [0.0, 0.0]
1 2.533290e+00 [2.9606455406811665, 3.083060577054058]
2 5.688808e-01 [2.9474878311671824, 1.4914837597307526]
3 1.285869e-01 [3.0215058147349247, 2.242084930205261]
4 2.906710e-02 [2.989606930891622, 1.884916456416018]
5 6.570631e-03 [3.004933187739197, 2.0547169361452395]
6 1.485294e-03 [2.997654130404577, 1.9739849976431851]
7 3.357514e-04 [3.0011153188655446, 2.01236877321922]
8 7.589676e-05 [2.9994697233396193, 1.994119296045902]
9 1.715650e-05 [3.000252118741238, 2.0027959668215187]
10 3.878234e-06 [2.999880130737226, 1.9986706641729648]
11 8.766766e-07 [3.0000569915580053, 2.0006320295819435]
12 1.981731e-07 [2.9999729034982328, 1.999699503026759]
13 4.479713e-08 [3.0000128829678214, 2.0001428705768003]
14 1.012641e-08 [2.999993874823351, 1.9999320725214131]
Solution: [3.00000291 2.0000323 ]
Note, that we provide the function to be minimized (Mean Squared Error) but the expression
of the gradient which is estimated numerically.
There are three variants of gradient descent, which differ on the use of the dataset made of 𝑛
samples of input data x𝑖 ’s, and possibly their corresponding targets 𝑦𝑖 ’s.
Batch gradient descent, known also as Vanilla gradient descent, computes the gradient of the
cost function with respect to the parameters 𝜃 for the entire training dataset :
• Choose an initial vector of parameters w0 and learning rate 𝛾.
• Repeat until an approximate minimum is obtained:
– w𝑘+1 = w𝑘 − 𝛾 𝑛𝑖=1 ∇𝑓 (w𝑘 , x𝑖 , 𝑦𝑖 )
∑︀
Advantages:
• Batch Gradient Descent is suited for convex or relatively smooth error manifolds. Since it
directly towards an optimum solution.
Limitations:
• Fast convergence toward “bad” local minimum (on non-convex functions)
• As we need to calculate the gradients for the whole dataset is intractable for datasets that
don’t fit in memory and doesn’t allow us to update our model online.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training
example 𝑥(𝑖) and 𝑦 (𝑖) . A complete passes through the training dataset is called an epoch. The
number of epochs is a hyperparameter to be determined observing the convergence.
• Choose an initial vector of parameters w0 and learning rate 𝛾.
• Repeat epochs until an approximate minimum is obtained:
– Randomly shuffle examples in the training set.
– For 𝑖 ∈ 1, . . . , 𝑛
Mini-batch gradient descent finally takes the best of both worlds and performs an update for
every mini-batch (subset of) training samples:
• Divide the training set in subsets of size 𝑚.
• Choose an initial vector of parameters w0 and learning rate 𝛾.
• Repeat epochs until an approximate minimum is obtained:
– Randomly pick a mini-batch.
– For each mini-batch 𝑏
∑︀𝑏+𝑚
* w𝑘+1 = w𝑘 − 𝛾 𝑖=𝑏 ∇𝑓 (w𝑘 , x𝑖 , 𝑦𝑖 )
Advantages:
• Reduces the variance of the parameter updates, which can lead to more stable conver-
gence.
• Make use of highly optimized matrix optimizations common to state-of-the-art deep learn-
ing libraries that make computing the gradient very efficient. Common mini-batch sizes
range between 50 and 256, but can vary for different applications.
Mini-batch gradient descent is typically the algorithm of choice when training a neural network.
SGD has trouble navigating ravines (areas where the surface curves much more steeply in one
dimension than in another), which are common around local optima. In these scenarios, SGD
oscillates across the slopes of the ravine while only making hesitant progress, along the bottom
towards the local optimum as in the image below.
Source
v = 0
while True:
dw = gradient(J, w)
vx = beta * v + dw
w -= learning_rate * v
Note: The momentum term :math:`beta` is usually set to 0.9 or a similar value.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momen-
tum as it rolls downhill, becoming faster and faster on the way, until it reaches its terminal
velocity if there is air resistance, i.e. 𝛽 <1.
The same thing happens to our parameter updates: The momentum term increases for dimen-
sions whose gradients point in the same directions and reduces updates for dimensions whose
gradients change directions. As a result, we gain faster convergence and reduced oscillation.
• Added element-wise scaling of the gradient based on the historical sum of squares in each
dimension.
• “Per-parameter learning rates” or “adaptive learning rates”
grad_squared = 0
while True:
dw = gradient(J, w)
grad_squared += dw * dw
w -= learning_rate * dw / (np.sqrt(grad_squared) + 1e-7)
grad_squared = 0
while True:
dw = gradient(J, w)
grad_squared += decay_rate * grad_squared + (1 - decay_rate) * dw * dw
w -= learning_rate * dw / (np.sqrt(grad_squared) + 1e-7)
However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d
like to have a smarter ball, a ball that has a notion of where it is going so that it knows to
slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) is a way to
give our momentum term this kind of prescience. We know that we will use our momentum
term 𝛾𝑣𝑡−1 to move the parameters 𝜃.
Computing 𝜃 − 𝛾𝑣𝑡−1 thus gives us an approximation of the next position of the parameters
(the gradient is missing for the full update), a rough idea where our parameters are going to
be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current
parameters 𝜃 but w.r.t. the approximate future position of our parameters:
Again, we set the momentum term 𝛾 to a value of around 0.9. While Momentum first com-
putes the current gradient and then takes a big jump in the direction of the updated
accumulated gradient , NAG first makes a big jump in the direction of the previous ac-
cumulated gradient, measures the gradient and then makes a correction, which results
in the complete NAG update. This anticipatory update prevents us from going too fast and
results in increased responsiveness, which has significantly increased the performance of
RNNs on a number of tasks
Adam
Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates for
each parameter. In addition to storing an exponentially decaying average of past squared
gradients :math:`v_t`, Adam also keeps an exponentially decaying average of past gradi-
ents :math:`m_t`, similar to momentum. Whereas momentum can be seen as a ball running
down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima
in the error surface. We compute the decaying averages of past and past squared gradients m𝑡
and v𝑡 respectively as follows:
m𝑡 and v𝑡 are estimates of the first moment (the mean) and the second moment (the uncentered
variance) of the gradients respectively, hence the name of the method. Adam (almost)
first_moment = 0
second_moment = 0
while True:
dx = gradient(J, x)
# Momentum:
first_moment = beta1 * first_moment + (1 - beta1) * dx
# AdaGrad/RMSProp
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7)
As m𝑡 and v𝑡 are initialized as vectors of 0’s, the authors of Adam observe that they are biased
towards zero, especially during the initial time steps, and especially when the decay rates are
small (i.e. 𝛽1 and 𝛽2 are close to 1). They counteract these biases by computing bias-corrected
first and second moment estimates:
𝑚𝑡
𝑚
ˆ𝑡 = (4.4)
1 − 𝛽1𝑡
𝑣𝑡
𝑣ˆ𝑡 = (4.5)
1 − 𝛽2𝑡
They then use these to update the parameters (Adam update rule):
𝜂
𝜃𝑡+1 = 𝜃𝑡 − √ 𝑚
ˆ𝑡
𝑣ˆ𝑡 + 𝜖
• 𝑚
ˆ 𝑡 Accumulate gradient: velocity.
• 𝑣ˆ𝑡 Element-wise scaling of the gradient based on the historical sum of squares in each
dimension.
• Choose Adam as default optimizer
• Default values of 0.9 for 𝛽1 , 0.999 for 𝛽2 , and 10−7 for 𝜖.
• learning rate in a range between 1𝑒 − 3 and 5𝑒 − 4
4.4.6 Conclusion
Sources:
• LeCun Y.A., Bottou L., Orr G.B., Müller KR. (2012) Efficient BackProp. In: Montavon
G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in
Computer Science, vol 7700. Springer, Berlin, Heidelberg
• Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning:
Gradient Descent with Momentum, ADAGRAD and ADAM.
Summary:
• Choosing a proper learning rate can be difficult. A learning rate that is too small leads to
painfully slow convergence, while a learning rate that is too large can hinder convergence
and cause the loss function to fluctuate around the minimum or even to diverge.
• Learning rate schedules try to adjust the learning rate during training by e.g. annealing,
i.e. reducing the learning rate according to a pre-defined schedule or when the change
in objective between epochs falls below a threshold. These schedules and thresholds,
however, have to be defined in advance and are thus unable to adapt to a dataset’s char-
acteristics.
• Additionally, the same learning rate applies to all parameter updates. If our data is sparse
and our features have very different frequencies, we might not want to update all of them
to the same extent, but perform a larger update for rarely occurring features.
• Another key challenge of minimizing highly non-convex error functions common for neu-
ral networks is avoiding getting trapped in their numerous suboptimal local minima.
These saddle points are usually surrounded by a plateau of the same error, which makes
it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
Recommendation:
• Shuffle the examples (SGD)
• Center the input variables by subtracting the mean
• Normalize the input variable to a standard deviation of 1
• Initializing the weight
• Adaptive learning rates (momentum), using separate learning rate for each weight
FIVE
STATISTICS
5.1.1 Libraries
Statistics
• Descriptive statistics and distributions: Numpy
• Distributions and tests: scipy.stats
• Advanced statistics (linear models, tests, time series): Statsmodels, see also Statsmodels
API:
– statsmodels.api: Imported using import statsmodels.api as sm.
– statsmodels.formula.api: A convenience interface for specifying models using for-
mula strings and DataFrames. Canonically imported using import statsmodels.
formula.api as smf
– statsmodels.tsa.api: Time-series models and methods. Canonically imported us-
ing import statsmodels.tsa.api as tsa.
# Manipulate data
import numpy as np
import pandas as pd
# Statistics
import scipy.stats
import statsmodels.api as sm
#import statsmodels.stats.api as sms
(continues on next page)
137
Statistics and Machine Learning in Python, Release 0.8
Plots
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
Datasets
Salary
try:
salary = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_
˓→table.csv'
salary = pd.read_csv(url)
Iris
Mean
The estimator 𝑥
¯ on a sample of size 𝑛: 𝑥 = 𝑥1 , ..., 𝑥𝑛 is given by
1 ∑︁
𝑥
¯= 𝑥𝑖
𝑛
𝑖
Variance
Note here the subtracted 1 degree of freedom (df) in the divisor. In standard statistical practice,
𝑑𝑓 = 1 provides an unbiased estimator of the variance of a hypothetical infinite population.
With 𝑑𝑓 = 0 it instead provides a maximum likelihood estimate of the variance for normally
distributed variables.
Standard deviation
√︀
Std(𝑋) = Var(𝑋)
√︀
The estimator is simply 𝜎𝑥 = 𝜎𝑥2 .
Covariance
Correlation
Cov(𝑋, 𝑌 )
Cor(𝑋, 𝑌 ) =
Std(𝑋) Std(𝑌 )
The estimator is
𝜎𝑥𝑦
𝜌𝑥𝑦 = .
𝜎𝑥 𝜎𝑦
The standard error (SE) is the standard deviation (of the sampling distribution) of a statistic:
Std(𝑋)
SE(𝑋) = √ .
𝑛
• Generate 2 random samples: 𝑥 ∼ 𝑁 (1.78, 0.1) and 𝑦 ∼ 𝑁 (1.66, 0.1), both of size 10.
¯, 𝜎𝑥 , 𝜎𝑥𝑦 (xbar, xvar, xycov) using only the np.sum() operation.
• Compute 𝑥
Explore the np. module to find out which Numpy functions performs the same computations
and compare them (using assert) with your previous results.
Caution! By default np.var() used the biased estimator (with ddof=0). Set ddof=1 to use
unbiased estimator.
n = 10
np.random.seed(seed=42) # make the example reproducible
x = np.random.normal(loc=1.78, scale=.1, size=n)
y = np.random.normal(loc=1.66, scale=.1, size=n)
xbar = np.mean(x)
assert xbar == np.sum(x) / x.shape[0]
Covariance
xycov = np.cov(x, y)
print(xycov)
ybar = np.sum(y) / n
(continues on next page)
[[ 0.00522741 -0.00060351]
[-0.00060351 0.00570515]]
With Pandas
Columns’ means
SepalLength 5.843333
SepalWidth 3.057333
PetalLength 3.758000
PetalWidth 1.199333
dtype: float64
SepalLength 0.828066
SepalWidth 0.435866
PetalLength 1.765298
PetalWidth 0.762238
dtype: float64
With Numpy
Columns’ std-dev. Numpy normalizes by N by default. Set ddof=1 to normalize by N-1 to get
the unbiased estimator.
X.std(axis=0, ddof=1)
numpy.histogram can be used to probability density function at the each histogram bin, setting
density=True parameter. Warning, histogram doesn’t sum to 1. Histogram as PDF estimator
should be multiplied by dx’s to sum to 1.
x = np.random.normal(size=50000)
hist, bins = np.histogram(x, bins=50, density=True)
dx = np.diff(bins)
print("Sum(Hist)=", np.sum(hist), "Sum(Hist * dx)=", np.sum(hist * dx))
TODO
Normal distribution
The normal distribution, noted 𝒩 (𝜇, 𝜎) with parameters: 𝜇 mean (location) and 𝜎 > 0 std-dev.
Estimators: 𝑥
¯ and 𝜎𝑥 .
The normal distribution, noted 𝒩 , is useful because of the central limit theorem (CLT) which
states that: given certain conditions, the arithmetic mean of a sufficiently large number of iter-
ates of independent random variables, each with a well-defined expected value and well-defined
variance, will be approximately normally distributed, regardless of the underlying distribution.
Documentation:
• numpy.random.normal
• scipy.stats.norm
Random number generator using Numpy
# using numpy:
x = np.random.normal(loc=10, scale=10, size=(3, 2))
# PDF: P(values)
pdf_x_range = scipy.stats.norm.pdf(x_range, loc=mean, scale=sd)
['P(X<-1.96)=2.5%',
'P(X<-1.28)=10.0%',
'P(X<-0.67)=25.0%',
'P(X<0.00)=50.0%',
'P(X<0.67)=75.0%',
'P(X<1.28)=90.0%',
'P(X<1.96)=97.5%']
label="CDF=P(X<{:.02f})={:.01%}".format(x_for_percentile_of_
˓→cdf,
percentile_of_cdf),
color='r')
_ = plt.legend()
The chi-square or 𝜒2𝑛 distribution with 𝑛 degrees of freedom (df) is the distribution of a sum of
the squares of 𝑛 independent standard normal random variables 𝒩 (0, 1). Let 𝑋 ∼ 𝒩 (𝜇, 𝜎 2 ),
then, 𝑍 = (𝑋 − 𝜇)/𝜎 ∼ 𝒩 (0, 1), then:
• The squared standard 𝑍 2 ∼ 𝜒21 (one df).
∑︀𝑛
• The distribution of sum of squares of 𝑛 normal random variables: 𝑖 𝑍𝑖2 ∼ 𝜒2𝑛
The sum of two 𝜒2 RV with 𝑝 and 𝑞 df is a 𝜒2 RV with 𝑝 + 𝑞 df. This is useful when sum-
ming/subtracting sum of squares.
The 𝜒2 -distribution is used to model errors measured as sum of squares or the distribution of
the sample variance.
The chi-squared distribution is a special case of the gamma distribution, with gamma parame-
ters a = df/2, loc = 0 and scale = 2.
Documentation: - numpy.random.chisquare - scipy.stats.chi2
The 𝐹 -distribution, 𝐹𝑛,𝑝 , with 𝑛 and 𝑝 degrees of freedom is the ratio of two independent 𝜒2
variables. Let 𝑋 ∼ 𝜒2𝑛 and 𝑌 ∼ 𝜒2𝑝 then:
𝑋/𝑛
𝐹𝑛,𝑝 =
𝑌 /𝑝
The 𝐹 -distribution plays a central role in hypothesis testing answering the question: Are two
variances equals?, is the ratio or two errors significantly large ?.
Documentation: - scipy.stats.f
Let 𝑀 ∼ 𝒩 (0, 1) and 𝑉 ∼ 𝜒2𝑛 . The 𝑡-distribution, 𝑇𝑛 , with 𝑛 degrees of freedom is the ratio:
𝑀
𝑇𝑛 = √︀
𝑉 /𝑛
The distribution of the difference between an estimated parameter and its true (or assumed)
value divided by the standard deviation of the estimated parameter (standard error) follow a
𝑡-distribution.
Documentation: scipy.stats.t
Let 𝑆𝑛 = 𝑛𝑖 𝑋𝑖 the sum of those RV. Then, the sum of RV converge in distribution to a normal
∑︀
distribution:
∑︀𝑛
𝑋𝑖 √
𝑆𝑛 = 𝑖 → 𝒩 (𝑛𝜇𝑋 , 𝑛𝜎𝑋 )
𝑛
Note that the centered
∑︀𝑛 and scaled sum converge in distribution to a normal distribution of
𝑖√𝑋𝑖 −𝑛𝜇
parameters 0, 1: 𝑛𝜎
→ 𝒩 (0, 1)
𝑋
Examples
n_sample = 1000
n_repeat = 10000
# Xn's
xn_s = np.array([scipy.stats.uniform.rvs(size=n_sample).sum() for i in range(n_
˓→repeat)])
# Normal distribution
x_range = np.linspace(-3, 3, 30)
prob_x_range = scipy.stats.norm.pdf(x_range, loc=0, scale=1)
plt.plot(x_range, prob_x_range, 'r-', label="N(0, 1)")
_ = plt.legend()
n_sample = 1000
n_repeat = 10000
# Xn's
xn_s = np.array([scipy.stats.expon.rvs(size=n_sample).sum() for i in range(n_
˓→repeat)])
# Normal distribution
x_range = np.linspace(-3, 3, 30)
prob_x_range = scipy.stats.norm.pdf(x_range, loc=0, scale=1)
plt.plot(x_range, prob_x_range, 'r-', label="N(0, 1)")
_ = plt.legend()
Central Limit Theorem also apply for the sample mean: Let i.i.d. samples 𝑋𝑖 from from almost
any distribution of parameters 𝜇𝑋 , 𝜎𝑋 . Then the sample mean 𝑋, ¯ for samples of size 30 or
more, is approximately normally distributed:
∑︀𝑛
¯ 𝑋𝑖 𝜎𝑋
𝑋= 𝑖 → 𝒩 (𝜇𝑋 , √ )
𝑛 𝑛
Simple but useful demonstrations:
𝑛
¯ 𝑛 ] =𝐸[ 1 ∑︁
𝐸[𝑋 𝑋𝑖 ]
𝑛
𝑖
𝑛
1 ∑︁
𝐸[𝑋𝑖 ], 𝑋𝑖 are i.i.d., i.e., 𝐸[𝑋𝑖 ] = 𝜇𝑋 ∀𝑖
𝑛
𝑖
1
𝑛𝜇𝑋
𝑛
𝜇𝑋
𝑛
¯ 𝑛 ] =𝑉 𝑎𝑟[ 1
∑︁
𝑉 𝑎𝑟[𝑋 𝑋𝑖 ]
𝑛
𝑖
2
1 ∑︁
( )2 2
, 𝑋𝑖 are i.i.d., i.e., 𝑉 𝑎𝑟[𝑋𝑖 ] = 𝜎𝑋 ∀𝑖
𝑛
𝑋
1
( )2 𝑛𝜎𝑋
2
𝑛
2
𝜎𝑋 /𝑛
√
¯
𝑆𝑑[𝑋𝑛 ] =𝜎𝑋 / 𝑛
Note that the standard deviation of the sample mean is the standard deviation of the parent
√
RV scaled by 𝑛. The larger the sample size, the better the approximation.
The Central Limit Theorem is illustrated for several common population distributions in The
Sampling Distribution of the Sample Mean.
n_sample = 1000
n_repeat = 10000
# Xbar's
xbar_s = np.array([scipy.stats.binom.rvs(n=n, p=p, size=n_sample).mean()
for i in range(n_repeat)])
Inferential statistics involves the use of a sample (1) to estimate some characteristic in a large
population; and (2) to test a research hypothesis about a given population.
Typology of tests
E.g., the height of males and females can be represented by their means, i.e., assuming two
normal distributions. Then fit the model to the data, i.e., estimate the model parameters (fre-
quency, mean, correlation, regression coefficient). E.g., compute the means of females and
males height.
• Formulate the null hypothesis 𝐻0 , i.e., what would be situation under pure chance? E.g.,
if sex has no effect on individuals’ height males and females means height will be equals.
• Derive a test statistic on the data capturing deviation from null hypothesis taking account
the number of samples. For parametric statistics, the test statistic is derived from model
parameters, e.g., the differences of means of males and females height, taking account
the number of samples.
3. Inference
Assess the deviation of the test statistic from its expected value under 𝐻0 (pure chance). Two
possibilities:
P-value based on null hypothesis:
What is the probability that the computed test statistic 𝑋 ¯ would be observed under pure chance?
I.e., What is the probability that the test statistics under 𝐻0 would be more extreme, i.e., “larger”
or “smaller” than 𝑋?¯ * Calculate the distribution of test statistic 𝑋 under 𝐻0 . * Compute the
probability (P-value) to obtain a larger test statistic by chance (under the null hypothesis).
¯ ≤ 𝑋|𝐻0 ) = 𝑃/2, or -
For a symmetric distribution the two sided p-value 𝑃 is defined as: - 𝑃 (𝑋
¯ = 𝑃/2
𝑃 (𝑋 ≥ 𝑋)
𝑋¯ is declared to be significantly different to the null hypothesis if the p-value is less than a
significance level 𝛼 generally considered as 5%.
Confidence interval (CI)
CI is a range of values 𝑥1 , 𝑥2 that is likely (given a confidence level, e.g., 95%) to contain the
¯ Outside this range the value is considered to be unlikely. Note that
true value of the statistic 𝑋.
the confidence level is 1 − 𝛼, the significance level. See Interpreting Confidence Intervals
¯ < 𝑥2 ) = 95%.
The 95% CI (Confidence Interval) is the range of values 𝑥1 , 𝑥2 such 𝑃 (𝑥1 < 𝑋
For a symmetric distribution the two sided 95%(= 1 − 5%) confidence interval, is defined as: -
¯ ≤ 𝑥1 ) = 2.5% = 5%/2 - 𝑥2 such 𝑃 (𝑥2 ≤ 𝑋)
𝑥1 such 𝑃 (𝑋 ¯ = 2.5%
Terminology
¯ − 𝑥1 (for symmetric distribution).
• Margin of error = 𝑋
• Confidence Interval = [𝑥1 , 𝑥2 ].
• Confidence level is 1 - significance level
Simplified example (small sample) of the the binomial test: Three voters are questioned
about their vote. Two voted for candidate A and one for B. How likely this difference Is this a
significant difference,
1. Model the data: Let 𝑥 the number of vote for A. It follows a Binomial distribution. Compute
the model parameters: 𝑁 = 3, and 𝑝ˆ = 2/3 (the frequency of number of vote A over the number
of voters).
2. Compute a test statistic measuring the deviation of the number of vote for A (𝑥 = 2) over
three voters from the expected values under the null hypothesis where 𝑥 would be 1.5. Similarly,
we could consider the deviation of the observed proportion 𝑝ˆ = 2/3 from 𝜋0 = 50%.
3. To make inference, we have to compute the probability to obtain more than two votes for A
by chance. We need the distribution of 𝑥 under 𝐻0 (𝑃 (𝑥|𝐻0)) to sum all the probabilities where
𝑥 is larger or equal to 2, i.e., 𝑃 (𝑥 ≥ 2|𝐻0 ). With such small sample size (𝑛 = 3) this distribution
is obtained by enumerating all configurations that produce a given number of heads 𝑥:
H 1
H 1
H 1
H H 2
H H 2
H H 2
H H H 3
• 𝑃 (𝑥 = 2) = 3/8
• 𝑃 (𝑥 = 3) = 1/8
Plot of the distribution of the number of 𝑥 (A vote over 3 voters) under the null hypothesis:
Finally, we compute the probability (P-value) to observe a value larger or equal that 𝑥 = 2 (or
𝑃 = 2/3) under the null hypothesis? This probability is the 𝑝-value:
P-value = 0.5, meaning that there is 50% of chance to get 𝑥 = 2 or larger by chance.
large sample example: 100 voters are questioned about their vote. 60 declared they voted for
candidate A and 40 for candidate B. Is this a significant difference?
1. Model the data: Let 𝑥 the number of vote for A. 𝑥 follows a binomial distribution. Compute
model parameters: 𝑛 = 100, 𝑝ˆ = 60/100. Where 𝑝ˆ is the observed proportion of A.
2. Compute a test statistic that measure the deviation of 𝑥 = 60 (vote for A) from the expected
value: 𝑛𝜋0 = 50 under the null hypothesis, i.e., where 𝜋0 = 50%. The distribution of the number
of vote for A (𝑥) follow the Binomial distribution of parameters 𝑁 = 100, 𝑃 = 0.5 approximated
with normal distribution when 𝑛 is large enough.
For large sample, the most usual (and easiest) approximation is through the standard normal
distribution, in which a z-test is performed of the test statistic 𝑍, given by:
𝑥 − 𝑛𝜋0
𝑍 = √︀
𝑛𝑝0 (1 − 𝜋0 )
one may rearrange and write the z-test above as deviation of 𝑝ˆ from 𝜋0 = 50%
𝑝ˆ − 𝜋0 √
𝑍 = √︀ 𝑛
𝜋0 (1 − 𝜋0 )
Large statistic is obtained with large deviation with large sample size.
5. Inference
Compute the p-value using Scipy to compute the CDF of the binomial distribution:
n = 100
z = (0.6 - 0.5) / (0.5 * (1 - 0.5)) * np.sqrt(n)
#z = (60 - n * 0.5 + 1/2) / (n * 0.5 * (1 - 0.5)) * np.sqrt(n)
scipy.stats.norm.sf(z, loc=0) * 2
np.float64(6.334248366623996e-05)
Plot of the binomial distribution and the probability to observe more than 60 vote for A by
chance:
pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals, stat_probs,
stat_obs=60, stat_h0=50,
thresh_low=40, thresh_high=60)
The one sample t-test is used to determine whether a sample comes from a population with a
specific mean. For example you want to test if the average height of a population is 1.75 𝑚.
This test is used when we have two measurements for each individual at two different times or
under two conditions: for each individual, we calculate the difference between the two condi-
tions and test the positivity (increase) or negativity (decrease) of the mean of the differences.
Example: is the arterial hypertension of 50 patients measured before and after some medication
has been reduced by the treatment?
Example: Monthly revenue figures of for 100 stores before and after a marketing campaigns.
We compute after − 𝑥before ) for each store 𝑖. If the average difference
∑︀ the difference (𝑥𝑖 = 𝑥𝑖 𝑖
¯ = 1/𝑛 𝑖 𝑥𝑖 is significantly positive (resp. negative), then the marketing campaigns will be
𝑥
considered as efficient (resp. detrimental).
3. Inference
P-value (null hypothesis) is computed using Scipy to compute the CDF of the student distribu-
tion.
pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals, stat_probs,
stat_obs=tval, stat_h0=0,
thresh_low=-tval, thresh_high=tval)
print("Estimate: {:.2f}, t-val: {:.2f}, p-val: {:e}, df: {}, CI: [{:.5f}, {:.5f}]
(continues on next page)
Estimate: 2.26, t-val: 2.36, p-val: 2.500022e-02, df: 29, CI: [0.30515, 4.22272]
print("Estimate: {:.2f}, t-val: {:.2f}, p-val: {:e}, df: {}, CI: [{:.5f}, {:.5f}]
˓→".\
Estimate: 2.26, t-val: 2.36, p-val: 2.500022e-02, df: 29, CI: [0.30515, 4.22272]
Test the correlation coefficient of two quantitative variables. The test calculates a Pearson cor-
relation coefficient and the 𝑝-value for testing non-correlation.
Let 𝑥 and 𝑦 two quantitative variables, where 𝑛 samples were obeserved. The linear correlation
coeficient is defined as :
∑︀𝑛
(𝑥𝑖 − 𝑥¯)(𝑦𝑖 − 𝑦¯)
𝑟 = √︀∑︀𝑛 𝑖=1 √︀∑︀𝑛 .
𝑖=1 (𝑥𝑖 − 𝑥¯)2 ¯)2
𝑖=1 (𝑦𝑖 − 𝑦
√ 𝑟
Under 𝐻0 , the test statistic 𝑡 = 𝑛 − 2 √1−𝑟 2
follow Student distribution with 𝑛 − 2 degrees of
freedom.
n = 50
x = np.random.normal(size=n)
y = 2 * x + np.random.normal(size=n)
0.8838265556020786 1.8786054559764617e-17
The two-sample 𝑡-test (Snedecor and Cochran, 1989) is used to determine if two population
means are equal. There are several variations on this test. If data are paired (e.g. 2 measures,
before and after treatment for each individual) use the one-sample 𝑡-test of the difference. The
variances of the two samples may be assumed to be equal (a.k.a. homoscedasticity) or unequal
(a.k.a. heteroscedasticity).
1. Model the data
Assumptions:
• Independence of residuals (𝜀𝑖 ). This assumptions must be satisfied.
• Normality of residuals. Approximately normally distributed can be accepted.
• Homosedasticity use T-test, Heterosedasticity use Welch t-test.
Assume that the two random variables are normally distributed: 𝑦1 ∼ 𝒩 (𝜇1 , 𝜎1 ), 𝑦2 ∼
𝒩 (𝜇2 , 𝜎2 ).
Fit: estimate the model parameters, means and variances: 𝑦¯1 , 𝑠2𝑦1 , 𝑦¯2 , 𝑠2𝑦2 .
2. t-test
The general principle is
𝑠2𝑦1 𝑠2𝑦
𝑠2𝑦¯1 −𝑦¯2 = 𝑠2𝑦¯1 + 𝑠2𝑦¯2 = + 2 (5.11)
𝑛1 𝑛2
thus (5.12)
√︃
𝑠2𝑦1 𝑠2𝑦
𝑠𝑦¯1 −𝑦¯2 = + 2 (5.13)
𝑛1 𝑛2
To compute the 𝑝-value one needs the degrees of freedom associated with this variance estimate.
It is approximated using the Welch–Satterthwaite equation:
(︂ 2 )︂2
𝑠𝑦1 𝑠2𝑦2
𝑛1 + 𝑛2
𝜈≈ 𝑠4𝑦1 𝑠4𝑦2
.
𝑛2 (𝑛 −1)
+ 𝑛2 (𝑛 −1)
1 1 2 2
If we assume equal variance (ie, 𝑠2𝑦1 = 𝑠2𝑦1 = 𝑠2 ), where 𝑠2 is an estimator of the common
variance of the two samples:
then
√︃ √︂
𝑠2 𝑠2 1 1
𝑠𝑦¯1 −𝑦¯2 = + =𝑠 +
𝑛1 𝑛2 𝑛1 𝑛2
Therefore, the 𝑡 statistic, that is used to test whether the means are different is:
𝑦¯ − 𝑦¯2
𝑡= √︁1 ,
𝑠 · 𝑛11 + 1
𝑛2
𝑦¯1 − 𝑦¯2 √
𝑡= · 𝑛 (5.16)
𝑠
difference of means √ √
≈ · 𝑛= ≈ effect size · 𝑛 (5.17)
standard deviation of the noise
Example
Given the following two samples, test whether their means are equal using the standard t-test,
assuming equal variance.
height = np.array([1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87,
1.66, 1.71, 1.73, 1.64, 1.70, 1.60, 1.79, 1.73, 1.62, 1.77])
ANOVA F-test: Quantitative as a function of Categorical Factor with Three Levels or More
Analysis of variance (ANOVA) provides a statistical test of whether or not the means of several
(k) groups are equal, and therefore generalizes the 𝑡-test to more than two groups. ANOVAs
are useful for comparing (testing) three or more means (groups or variables) for statistical
significance. It is conceptually similar to multiple two-sample 𝑡-tests, but is less conservative.
Here we will consider one-way ANOVA with one independent variable, ie one-way anova.
Wikipedia:
• Test if any group is on average superior, or inferior, to the others versus the null hypothesis
that all four strategies yield the same mean response
• Detect any of several possible differences.
• The advantage of the ANOVA 𝐹 -test is that we do not need to pre-specify which strategies
are to be compared, and we do not need to adjust for making multiple comparisons.
• The disadvantage of the ANOVA 𝐹 -test is that if we reject the null hypothesis, we do not
know which strategies can be said to be significantly different from the others.
1. Model the data
Assumptions
• The samples are randomly selected in an independent manner from the k populations.
• All k populations have distributions that are approximately normal. Check by plotting
groups distribution.
• The k population variances are equal. Check by plotting groups distribution.
The question is: Is there a difference in Petal Width in species from iris dataset? Let 𝑦1 , 𝑦2 and
𝑦3 be Petal Width in three species.
Here we assume (see assumptions) that the three populations were sampled from three random
variables that are normally distributed. I.e., 𝑌1 ∼ 𝑁 (𝜇1 , 𝜎1 ), 𝑌2 ∼ 𝑁 (𝜇2 , 𝜎2 ) and 𝑌3 ∼ 𝑁 (𝜇3 , 𝜎3 ).
2. Fit: estimate the model parameters
Estimate means and variances: 𝑦¯𝑖 , 𝜎𝑖 , ∀𝑖 ∈ {1, 2, 3}.
3. :math:`F`-test
The formula for the one-way ANOVA F-test statistic is
Explained variance
𝐹 = (5.18)
Unexplained variance
Between-group variability 𝑠2
= = 2𝐵 . (5.19)
Within-group variability 𝑠𝑊
where 𝑦¯𝑖· denotes the sample mean in the 𝑖th group, 𝑛𝑖 is the number of observations in the 𝑖th
group, 𝑦¯ denotes the overall mean of the data, and 𝐾 denotes the number of groups.
The “unexplained variance”, or “within-group variability” is
∑︁
𝑠2𝑊 = (𝑦𝑖𝑗 − 𝑦¯𝑖· )2 /(𝑁 − 𝐾),
𝑖𝑗
where 𝑦𝑖𝑗 is the 𝑗th observation in the 𝑖th out of 𝐾 groups and 𝑁 is the overall sample size.
This 𝐹 -statistic follows the 𝐹 -distribution with 𝐾 − 1 and 𝑁 − 𝐾 degrees of freedom under the
null hypothesis. The statistic will be large if the between-group variability is large relative to
the within-group variability, which is unlikely to happen if the population means of the groups
all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, 𝐹 = 𝑡2 where 𝑡 is the
Student’s 𝑡 statistic.
# Group means
means = iris.groupby("Species").mean().reset_index()
print(means)
# Plot groups
ax = sns.violinplot(x="Species", y="SepalLength", data=iris)
ax = sns.swarmplot(x="Species", y="SepalLength", data=iris,
color="white")
ax = sns.swarmplot(x="Species", y="SepalLength", color="black",
data=means, size=10)
# ANOVA
lm = smf.ols('SepalLength ~ Species', data=iris).fit()
sm.stats.anova_lm(lm, typ=2) # Type 2 ANOVA DataFrame
Computes the chi-square, 𝜒2 , statistic and 𝑝-value for the hypothesis test of independence of
frequencies in the observed contingency table (cross-table). The observed frequencies are tested
against an expected contingency table obtained by computing expected frequencies based on
the marginal sums under the assumption of independence.
Example
# Dataset:
# 15 samples:
# 10 first exposed
exposed = np.array([1] * 10 + [0] * 10)
# 8 first with cancer, 10 without, the last two with.
cancer = np.array([1] * 8 + [0] * 10 + [1] * 2)
Observed table:
---------------
cancer 0 1
exposed
0 8 2
1 2 8
Statistics:
-----------
Chi2 = 5.000000, pval = 0.025347
Expected table:
---------------
[[5. 5.]
[5. 5.]]
cancer_marg = crosstab.sum(axis=1)
cancer_freq = cancer_marg / cancer_marg.sum()
print('Expected frequencies:')
print(np.outer(exposed_freq, cancer_freq))
np.random.seed(3)
sns.regplot(x=age, y=sbp)
# Non-Parametric Spearman
cor, pval = scipy.stats.spearmanr(age, sbp)
print("Non-Parametric Spearman cor test, cor: %.4f, pval: %.4f" % (cor, pval))
Wikipedia: The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used
when comparing two related samples, matched samples, or repeated measurements on a single
sample to assess whether their population mean ranks differ (i.e. it is a paired difference test).
It is equivalent to one-sample test of the difference of paired samples.
It can be used as an alternative to the paired Student’s 𝑡-test, 𝑡-test for matched pairs, or the 𝑡-
test for dependent samples when the population cannot be assumed to be normally distributed.
When to use it? Observe the data distribution: - presence of outliers - the distribution of the
residuals is not Gaussian
It has a lower sensitivity compared to 𝑡-test. May be problematic to use when the sample size is
small.
Null hypothesis 𝐻0 : difference between the pairs follows a symmetric distribution around zero.
n = 20
# Buisness Volume time 0
bv0 = np.random.normal(loc=3, scale=.1, size=n)
(continues on next page)
# create an outlier
bv1[0] -= 10
# Paired t-test
print(scipy.stats.ttest_rel(bv0, bv1))
# Wilcoxon
print(scipy.stats.wilcoxon(bv0, bv1))
TtestResult(statistic=np.float64(0.7766377807752968), pvalue=np.float64(0.
˓→44693401731548044), df=np.int64(19))
WilcoxonResult(statistic=np.float64(23.0), pvalue=np.float64(0.
˓→001209259033203125))
n = 20
# Buismess Volume group 0
bv0 = np.random.normal(loc=1, scale=.1, size=n)
# create an outlier
bv1[0] -= 10
# Two-samples t-test
print(scipy.stats.ttest_ind(bv0, bv1))
# Wilcoxon
print(scipy.stats.mannwhitneyu(bv0, bv1))
TtestResult(statistic=np.float64(0.6104564820307219), pvalue=np.float64(0.
˓→5451934484051324), df=np.float64(38.0))
MannwhitneyuResult(statistic=np.float64(41.0), pvalue=np.float64(1.
˓→8074477738835562e-05))
Given 𝑛 random samples (𝑦𝑖 , 𝑥1𝑖 , . . . , 𝑥𝑝𝑖 ), 𝑖 = 1, . . . , 𝑛, the linear regression models the relation
between the observations 𝑦𝑖 and the independent variables 𝑥𝑝𝑖 is formulated as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + · · · + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖 𝑖 = 1, . . . , 𝑛
• The 𝛽’s are the model parameters, ie, the regression coeficients.
• 𝛽0 is the intercept or the bias.
• 𝜀𝑖 are the residuals.
• An independent variable (IV). It is a variable that stands alone and isn’t changed by
the other variables you are trying to measure. For example, someone’s age might be an
independent variable. Other factors (such as what they eat, how much they go to school,
how much television they watch) aren’t going to change a person’s age. In fact, when
you are looking for some kind of relationship between variables you are trying to see if
the independent variable causes some kind of change in the other variables, or dependent
variables. In Machine Learning, these variables are also called the predictors.
• A dependent variable. It is something that depends on other factors. For example, a test
score could be a dependent variable because it could change depending on several factors
such as how much you studied, how much sleep you got the night before you took the
test, or even how hungry you were when you took it. Usually when you are looking for
a relationship between two things you are trying to find out what makes the dependent
variable change the way it does. In Machine Learning this variable is called a target
variable.
Assumptions
Using the dataset “salary”, explore the association between the dependant variable (e.g. Salary)
and the independent variable (e.g.: Experience is quantitative), considering only non-managers.
df = salary[salary.management == 'N']
salary𝑖 = 𝛽0 + 𝛽 experience𝑖 + 𝜖𝑖 ,
more generally
𝑦𝑖 = 𝛽0 + 𝛽 𝑥𝑖 + 𝜖𝑖
This can be rewritten in the matrix form using the design matrix made of values of independant
variable and the intercept:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦1 1 𝑥1 𝜖1
⎢𝑦2 ⎥ ⎢1 𝑥2 ⎥ [︂ ]︂ ⎢𝜖2 ⎥
⎢𝑦3 ⎥ = ⎢1 𝑥3 ⎥ 𝛽0 + ⎢𝜖3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ 𝛽1 ⎢ ⎥
⎣𝑦4 ⎦ ⎣1 𝑥4 ⎦ ⎣𝜖4 ⎦
𝑦5 1 𝑥5 𝜖5
Recall from calculus that an extreme point can be found by computing where the derivative is
zero, i.e. to find the intercept, we perform the steps:
𝜕𝑆𝑆𝐸 ∑︁
= (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 ) = 0
𝜕𝛽0
𝑖
∑︁ ∑︁
𝑦𝑖 = 𝛽 𝑥𝑖 + 𝑛 𝛽0
𝑖 𝑖
𝑛 𝑦¯ = 𝑛 𝛽 𝑥
¯ + 𝑛 𝛽0
𝛽0 = 𝑦¯ − 𝛽 𝑥
¯
Plug in 𝛽0 :
∑︁
𝑥𝑖 (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝑦¯ + 𝛽 𝑥
¯) = 0
𝑖
∑︁ ∑︁ ∑︁
𝑥𝑖 𝑦𝑖 − 𝑦¯ 𝑥𝑖 = 𝛽 (𝑥𝑖 − 𝑥
¯)
𝑖 𝑖 𝑖
y, x = df.salary, df.experience
beta, beta0, r_value, p_value, std_err = scipy.stats.linregress(x,y)
print("y = %f x + %f, r: %f, r-squared: %f,\np-value: %f, std_err: %f"
% (beta, beta0, r_value, r_value**2, p_value, std_err))
Multiple Regression
Theory
Multiple Linear Regression is the most basic supervised learning algorithm.
Given: a set of training data {𝑥1 , ..., 𝑥𝑁 } with corresponding targets {𝑦1 , ..., 𝑦𝑁 }.
In linear regression, we assume that the model that generates the data involves only a linear
combination of the input variables, i.e.
or, simplified
𝑃 −1
𝛽𝑗 𝑥𝑗𝑖 + 𝜀𝑖 .
∑︁
𝑦𝑖 = 𝛽0 +
𝑗=1
Extending each sample with an intercept, 𝑥𝑖 := [1, 𝑥𝑖 ] ∈ 𝑅𝑃 +1 allows us to use a more general
notation based on linear algebra and write it as a simple dot product:
𝑦𝑖 = x𝑇𝑖 𝛽 + 𝜀𝑖 ,
where 𝛽 ∈ 𝑅𝑃 +1 is a vector of weights that define the 𝑃 + 1 parameters of the model. From
now we have 𝑃 regressors + the intercept.
Using the matrix notation:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦1 1 𝑥11 ... 𝑥1𝑃 ⎡ 𝛽 ⎤ 𝜖1
0
⎢𝑦2 ⎥ ⎢1 𝑥21 ... 𝑥2𝑃 ⎥ ⎢ 𝛽 ⎥ ⎢𝜖2 ⎥
⎥ ⎢
⎢ ⎥ ⎢
⎢𝑦3 ⎥ = ⎢1 ⎢ 1⎥ ⎢ ⎥
𝑥31 ... 𝑥3𝑃 ⎥
⎥⎢ .. ⎥ + ⎢𝜖3 ⎥
𝑥4𝑃 ⎦ .
⎢ ⎥ ⎢ ⎦ ⎣ ⎥
⎣𝑦4 ⎦ ⎣1 𝑥41 ...
⎣
𝜖4 ⎦
𝑦5 1 𝑥5 ... 𝑥5 𝛽 𝑃 𝜖5
Let 𝑋 = [𝑥𝑇0 , ..., 𝑥𝑇𝑁 ] be the (𝑁 × 𝑃 + 1) design matrix of 𝑁 samples of 𝑃 input features with
one column of one and let be 𝑦 = [𝑦1 , ..., 𝑦𝑁 ] be a vector of the 𝑁 targets.
𝑦 = 𝑋𝛽 + 𝜀
Using the matrix notation, the mean squared error (MSE) loss can be rewritten:
1
𝑀 𝑆𝐸(𝛽) = ||𝑦 − 𝑋𝛽||22 .
𝑁
The 𝛽 that minimizes the MSE can be found by:
(︂ )︂
1
∇𝛽 ||𝑦 − 𝑋𝛽||22 =0 (5.20)
𝑁
1
∇𝛽 (𝑦 − 𝑋𝛽)𝑇 (𝑦 − 𝑋𝛽) = 0 (5.21)
𝑁
1
∇𝛽 (𝑦 𝑇 𝑦 − 2𝛽 𝑇 𝑋 𝑇 𝑦 + 𝛽 𝑇 𝑋 𝑇 𝑋𝛽) = 0 (5.22)
𝑁
−2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝛽 = 0 (5.23)
𝑋 𝑇 𝑋𝛽 = 𝑋 𝑇 𝑦 (5.24)
𝑇 −1 𝑇
𝛽 = (𝑋 𝑋) 𝑋 𝑦, (5.25)
# Dataset
N, P = 50, 4
X = np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[:, 0] = 1
print(X[:5, :])
Xpinv = linalg.pinv(X)
betahat = np.dot(Xpinv, y)
print("Estimated beta:\n", betahat)
Estimated beta:
[10.14742501 0.57938106 0.51654653 0.17862194]
Statmodels examples
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
df = pd.DataFrame(np.column_stack([X, y]),
columns=['inter', 'x1','x2', 'x3', 'y'])
print(df.columns, df.shape)
# Build a model excluding the intercept, it is implicit
model = smf.ols("y~x1 + x2 + x3", df).fit()
print(model.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Analysis of covariance (ANCOVA) is a linear model that blends ANOVA and linear regression.
ANCOVA evaluates whether population means of a dependent variable (DV) are equal across
levels of a categorical independent variable (IV) often called a treatment, while statistically
controlling for the effects of other quantitative or continuous variables that are not of primary
interest, known as covariates (CV).
df = salary.copy()
Normality assumption of the residuals can be rejected (p-value < 0.05). There is an efect of the
“management” factor, take it into account.
One-way AN(C)OVA
sum_sq df F PR(>F)
management 5.755739e+08 1.0 183.593466 4.054116e-17
experience 3.334992e+08 1.0 106.377768 3.349662e-13
Residual 1.348070e+08 43.0 NaN NaN
Jarque-Bera normality test p-value 0.004
Distribution of residuals is still not normal but closer to normality. Both management and
experience are significantly associated with salary.
Two-way AN(C)OVA
df["residuals"] = twoway.resid
sns.displot(df, x='residuals', kind="kde", fill=True,
aspect=1, height=fig_h*0.7)
print(sm.stats.anova_lm(twoway, typ=2))
sum_sq df F PR(>F)
education 9.152624e+07 2.0 43.351589 7.672450e-11
management 5.075724e+08 1.0 480.825394 2.901444e-24
experience 3.380979e+08 1.0 320.281524 5.546313e-21
Residual 4.328072e+07 41.0 NaN NaN
Jarque-Bera normality test p-value 0.506
Normality assumtion cannot be rejected. Assume it. Education, management and experience
are significantly associated with salary.
oneway is nested within twoway. Comparing two nested models tells us if the additional predic-
tors (i.e. education) of the full model significantly decrease the residuals. Such comparison can
be done using an 𝐹 -test on residuals:
Factor Coding
print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, :])
[[1. 0. 0. 1. 1.]
[1. 0. 1. 0. 1.]
[1. 0. 1. 1. 1.]
[1. 1. 0. 0. 1.]
[1. 0. 1. 0. 1.]
[1. 1. 0. 1. 2.]
[1. 1. 0. 0. 2.]
(continues on next page)
# Dataset
n_samples, n_features = 100, 1000
n_info = int(n_features/10) # number of features with information
n1, n2 = int(n_samples/2), n_samples - int(n_samples/2)
snr = .5
Y = np.random.randn(n_samples, n_features)
grp = np.array(["g1"] * n1 + ["g2"] * n2)
axis[2].hist([pvals[n_info:], pvals[:n_info]],
stacked=True, bins=100, label=["Negatives", "Positives"])
axis[2].set_xlabel("p-value histogram")
axis[2].set_ylabel("density")
axis[2].legend()
plt.tight_layout()
Note that under the null hypothesis the distribution of the p-values is uniform.
Statistical measures:
• True Positive (TP) equivalent to a hit. The test correctly concludes the presence of an
effect.
• True Negative (TN). The test correctly concludes the absence of an effect.
• False Positive (FP) equivalent to a false alarm, Type I error. The test improperly con-
cludes the presence of an effect. Thresholding at 𝑝-value < 0.05 leads to 47 FP.
• False Negative (FN) equivalent to a miss, Type II error. The test improperly concludes the
absence of an effect.
The Bonferroni correction is based on the idea that if an experimenter is testing 𝑃 hypothe-
ses, then one way of maintaining the Family-wise error rate FWER is to test each individual
hypothesis at a statistical significance level of 1/𝑃 times the desired maximum overall level.
So, if the desired significance level for the whole family of tests is 𝛼 (usually 0.05), then the
Bonferroni correction would test each individual hypothesis at a significance level of 𝛼/𝑃 . For
example, if a trial is testing 𝑃 = 8 hypotheses with a desired 𝛼 = 0.05, then the Bonferroni
correction would test each individual hypothesis at 𝛼 = 0.05/8 = 0.00625.
FDR-controlling procedures are designed to control the expected proportion of rejected null
hypotheses that were incorrect rejections (“false discoveries”). FDR-controlling procedures pro-
vide less stringent control of Type I errors compared to the familywise error rate (FWER) con-
trolling procedures (such as the Bonferroni correction), which control the probability of at least
one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased
rates of Type I errors.
The study provides the brain volumes of grey matter (gm), white matter (wm) and cerebrospinal
fluid) (csf) of 808 anatomical MRI scans.
import os
import tempfile
import urllib.request
import pandas as pd
# Stat
import statsmodels.formula.api as smfrmla
import statsmodels.api as sm
import scipy.stats
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
WD = os.path.join(tempfile.gettempdir(), "brainvol")
os.makedirs(WD, exist_ok=True)
#os.chdir(WD)
Fetch data
• Demographic data demo.csv (columns: participant_id, site, group, age, sex) and tissue
volume data: group is Control or Patient. site is the recruiting site.
• Gray matter volume gm.csv (columns: participant_id, session, gm_vol)
• White matter volume wm.csv (columns: participant_id, session, wm_vol)
• Cerebrospinal Fluid csf.csv (columns: participant_id, session, csf_vol)
base_url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/brain_
˓→volumes/%s'
data = dict()
for file in ["demo.csv", "gm.csv", "wm.csv", "csf.csv"]:
urllib.request.urlretrieve(base_url % file, os.path.join(WD, "data", file))
brain_vol = brain_vol.dropna()
assert brain_vol.shape == (766, 9)
brain_vol["tiv_vol"] = brain_vol["gm_vol"] + \
brain_vol["wm_vol"] + brain_vol["csf_vol"]
Descriptive statistics Most of participants have several MRI sessions (column session) Select
on rows from session one “ses-01”
desc_glob_num = brain_vol1.describe()
print(desc_glob_num)
gm_vol
count mean std min 25% 50% 75% max
group
Control 86.00 0.72 0.09 0.48 0.66 0.71 0.78 1.03
Patient 157.00 0.70 0.08 0.53 0.65 0.70 0.76 0.90
5.2.3 Statistics
Objectives:
1. Site effect of gray matter atrophy
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
3. Test for differences of atrophy between the patients and the controls
4. Test for interaction between age and clinical status, ie: is the brain atrophy process in
patient population faster than in the control population.
5. The effect of the medication in the patient population.
1 Site effect on Grey Matter atrophy
The model is Oneway Anova gm_f ~ site The ANOVA test has important assumptions that must
be satisfied in order for the associated p-value to be valid.
• The samples are independent.
• Each sample is from a normally distributed population.
• The population standard deviations of the groups are all equal. This property is known as
homoscedasticity.
Plot
brain_vol1.groupby('site')['age'].describe()
print(sm.stats.anova_lm(anova, typ=2))
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
Plot
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Before testing for differences of atrophy between the patients ans the controls Preliminary tests
for age x group effect (patients would be older or younger than Controls)
Plot
print(scipy.stats.ttest_ind(brain_vol1_ctl.age, brain_vol1_pat.age))
TtestResult(statistic=np.float64(-1.2155557697674162), pvalue=np.float64(0.
˓→225343592508479), df=np.float64(241.0))
----------------------------------------------------------------------------------
˓→--
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Preliminary tests for sex x group (more/less males in patients than in Controls)
3. Test for differences of atrophy between the patients and the controls
sum_sq df F PR(>F)
group 0.00 1.00 0.01 0.92
Residual 0.46 241.00 NaN NaN
No significant difference in atrophy between patients and controls
print(sm.stats.anova_lm(smfrmla.ols(
"gm_f ~ group + age + site", data=brain_vol1).fit(), typ=2))
print("No significant difference in GM between patients and controls")
sum_sq df F PR(>F)
group 0.00 1.00 1.82 0.18
site 0.11 5.00 19.79 0.00
age 0.09 1.00 86.86 0.00
Residual 0.25 235.00 NaN NaN
No significant difference in GM between patients and controls
print("%.3f%% of grey matter loss per year (almost %.1f%% per decade)" %
(ancova.params.age * 100, ancova.params.age * 100 * 10))
sum_sq df F PR(>F)
site 0.11 5.00 20.28 0.00
age 0.10 1.00 89.37 0.00
group:age 0.00 1.00 3.28 0.07
Residual 0.25 235.00 NaN NaN
= Parameters =
Intercept 0.52
site[T.S3] 0.01
site[T.S4] 0.03
site[T.S5] 0.01
site[T.S7] 0.06
site[T.S8] 0.02
age -0.00
group[T.Patient]:age -0.00
dtype: float64
-0.148% of grey matter loss per year (almost -1.5% per decade)
grey matter loss in patients is accelerated by -0.232% per decade
Acknowledgements: Firstly, it’s right to pay thanks to the blogs and sources I have used in
writing this tutorial. Many parts of the text are quoted from the brillant book from Brady T.
West, Kathleen B. Welch and Andrzej T. Galecki, see [Brady et al. 2014] in the references section
below.
5.3.1 Introduction
Quoted from [Brady et al. 2014]:A linear mixed model (LMM) is a parametric linear model for
clustered, longitudinal, or repeated-measures data that quantifies the relationships between
a continuous dependent variable and various predictor variables. An LMM may include both
fixed-effect parameters associated with one or more continuous or categorical covariates and
random effects associated with one or more random factors. The mix of fixed and random
effects gives the linear mixed model its name. Whereas fixed-effect parameters describe the re-
lationships of the covariates to the dependent variable for an entire population, random effects
are specific to clusters or subjects within a population. LMM is closely related with hierarchical
linear model (HLM).
Clustered/structured datasets
Quoted from [Bruin 2006]: Random effects, are used when there is non independence in the
data, such as arises from a hierarchical structure with clustered data. For example, students
could be sampled from within classrooms, or patients from within doctors. When there are
multiple levels, such as patients seen by the same doctor, the variability in the outcome can be
thought of as being either within group or between group. Patient level observations are not
independent, as within a given doctor patients are more similar. Units sampled at the highest
level (in our example, doctors) are independent.
The continuous outcome variables is structured or clustered into units within observations
are not independents. Types of clustered data:
1. studies with clustered data, such as students in classrooms, or experimental designs with
random blocks, such as batches of raw material for an industrial process
2. longitudinal or repeated-measures studies, in which subjects are measured repeatedly
over time or under different conditions.
Fixed effects may be associated with continuous covariates, such as weight, baseline test score,
or socioeconomic status, which take on values from a continuous (or sometimes a multivalued
ordinal) range, or with factors, such as gender or treatment group, which are categorical. Fixed
effects are unknown constant parameters associated with either continuous covariates or the
levels of categorical factors in an LMM. Estimation of these parameters in LMMs is generally of
intrinsic interest, because they indicate the relationships of the covariates with the continuous
outcome variable.
Example: Suppose we want to study the relationship between the height of individuals and their
gender. We will: sample individuals in a population (first source of randomness), measure their
height (second source of randomness), and consider their gender (fixed for a given individual).
Finally, these measures are modeled in the following linear model:
height𝑖 = 𝛽0 + 𝛽1 gender𝑖 + 𝜀𝑖
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
(continues on next page)
df = pd.read_csv('datasets/score_parentedu_byclass.csv')
print(df.head())
_ = sns.scatterplot(x="edu", y="score", hue="classroom", data=df)
Global effect regresses the the independant variable 𝑦 = score on the dependant variable 𝑥 =
edu without considering the any classroom effect. For each individual 𝑖 the model is:
where, 𝛽0 is the global intercept, 𝛽1 is the slope associated with edu and 𝜀𝑖𝑗 is the random error
at the individual level. Note that the classeroom, 𝑗 index is not taken into account by the model
and could be removed from the equation.
The general R formula is: y ~ x which in this case is score ~ edu. This model is:
• Not sensitive since it does not model the classroom effect (high standard error).
• Wrong because, residuals are not normals, and it considers samples from the same class-
room to be indenpendant.
#print(lm_glob.summary())
print(lm_glob.t_test('edu'))
print("MSE=%.3f" % lm_glob.mse_resid)
results.loc[len(results)] = ["LM-Global (biased)"] +\
list(rmse_coef_tstat_pval(mod=lm_glob, var='edu'))
Plot
Model diagnosis: plot the normality of the residuals and residuals vs prediction.
plot_lm_diagnosis(residual=lm_glob.resid,
prediction=lm_glob.predict(df), group=df.classroom)
Remember ANCOVA = ANOVA with covariates. Model the classroom 𝑧 = classroom (as a fixed
effect), ie a vertical shift for each classroom. The slope is the same for all classrooms. For each
individual 𝑖 and each classroom 𝑗 the model is:
where, 𝑢𝑗 is the coefficient (an intercept, or a shift) associated with classroom 𝑗 and 𝑧𝑖𝑗 = 1 if
subject 𝑖 belongs to classroom 𝑗 else 𝑧𝑖𝑗 = 0.
The general R formula is: y ~ x + z which in this case is score ~ edu + classroom.
This model is:
• Sensitive since it does not model the classroom effect (lower standard error). But,
• questionable because it considers the classroom to have a fixed constant effect without
any uncertainty. However, those classrooms have been sampled from a larger samples of
classrooms within the country.
print("MSE=%.3f" % ancova_inter.mse_resid)
results.loc[len(results)] = ["ANCOVA-Inter (biased)"] +\
list(rmse_coef_tstat_pval(mod=ancova_inter, var='edu'))
Plot
plot_ancova_oneslope_grpintercept(x="edu", y="score",
group="classroom", model=ancova_inter, df=df)
mod = ancova_inter
print("## Statistics:")
print(mod.tvalues, mod.pvalues)
plot_lm_diagnosis(residual=ancova_inter.resid,
prediction=ancova_inter.predict(df), group=df.classroom)
Fixed effect is the coeficient or parameter (𝛽1 in the model) that is associated with a continuous
covariates (age, education level, etc.) or (categorical) factor (sex, etc.) that is known without
uncertainty once a subject is sampled.
Random effect, in contrast, is the coeficient or parameter (𝑢𝑗 in the model below) that is as-
sociated with a continuous covariates or factor (classroom, individual, etc.) that is not known
without uncertainty once a subject is sampled. It generally conrespond to some random sam-
pling. Here the classroom effect depends on the teacher which has been sampled from a larger
samples of classrooms within the country. Measures are structured by units or a clustering
structure that is possibly hierarchical. Measures within units are not independant. Measures
between top level units are independant.
There are multiple ways to deal with structured data with random effect. One simple approach
is to aggregate.
Aggregation of measure at classroom level: average all values within classrooms to perform
statistical analysis between classroom. 1. Level 1 (within unit): Average by classrom:
𝑦 𝑗 = 𝛽 0 + 𝛽 1 𝑥 𝑗 + 𝜀𝑗
agregate = df.groupby('classroom').mean()
lm_agregate = smf.ols('score ~ edu', agregate).fit()
#print(lm_agregate.summary())
print(lm_agregate.t_test('edu'))
print("MSE=%.3f" % lm_agregate.mse_resid)
results.loc[len(results)] = ["Aggregation"] +\
list(rmse_coef_tstat_pval(mod=lm_agregate, var='edu'))
Plot
agregate = agregate.reset_index()
fig, axes = plt.subplots(1, 2, figsize=(9, 3), sharex=True, sharey=True)
sns.scatterplot(x='edu', y='score', hue='classroom',
data=df, ax=axes[0], s=20, legend=False)
sns.scatterplot(x='edu', y='score', hue='classroom',
data=agregate, ax=axes[0], s=150)
axes[0].set_title("Level 1: Average within classroom")
Hierarchical/multilevel modeling
Another approach to hierarchical data is analyzing data from one unit at a time. Thus, we
run three separate linear regressions - one for each classroom in the sample leading to three
estimated parameters of the score vs edu association. Then the paramteres are tested across
the classrooms:
1. Run three separate linear regressions - one for each classroom
The general R formula is: y ~ x which in this case is score ~ edu within classrooms.
2. Test across the classrooms if is the mean𝑗 (𝛽1𝑗 ) = 𝛽0 ̸= 0 :
𝛽1𝑗 = 𝛽0 + 𝜀𝑗
• sensitive since it allows to model differents slope for each classroom (see fixed interaction
or random slope below). But it is but not optimally designed since there are many
models, and each one does not take advantage of the information in data from other
classroom. This can also make the results “noisy” in that the estimates from each model
are not based on very much data
results.loc[len(results)] = ["Hierarchical"] + \
list(rmse_coef_tstat_pval(mod=lm_hm, var='Intercept'))
classroom beta
0 c0 0.129084
1 c1 0.177567
2 c2 0.055772
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.1208 0.035 3.412 0.076 -0.032 0.273
==============================================================================
MSE=0.004
Plot
Linear mixed models (also called multilevel models) can be thought of as a trade off between
these two alternatives. The individual regressions has many estimates and lots of data, but is
noisy. The aggregate is less noisy, but may lose important differences by averaging all samples
within each classroom. LMMs are somewhere in between.
Model the classroom 𝑧 = classroom (as a random effect). For each individual 𝑖 and each
classroom 𝑗 the model is:
results.loc[len(results)] = ["LMM-Inter"] + \
list(rmse_coef_tstat_pval(mod=lmm_inter, var='edu'))
Explore model
print("Fixed effect:")
print(lmm_inter.params)
print("Random effect:")
print(lmm_inter.random_effects)
intercept = lmm_inter.params['Intercept']
var = lmm_inter.params["Group Var"]
Fixed effect:
Intercept 9.865327
edu 0.131193
Group Var 10.844222
dtype: float64
Random effect:
{'c0': Group -2.889009
dtype: float64, 'c1': Group -0.323129
dtype: float64, 'c2': Group 3.212138
dtype: float64}
Plot
plot_lmm_oneslope_randintercept(x='edu', y='score',
group='classroom', df=df, model=lmm_inter)
/home/ed203246/git/pystatsml/statistics/lmm/stat_lmm_utils.py:122: FutureWarning:␣
˓→Series.__getitem__ treating keys as positions is deprecated. In a future␣
group_offset = model.random_effects[group_lab][0]
/home/ed203246/git/pystatsml/statistics/lmm/stat_lmm_utils.py:122: FutureWarning:␣
˓→Series.__getitem__ treating keys as positions is deprecated. In a future␣
group_offset = model.random_effects[group_lab][0]
Now suppose that the classroom random effect is not just a vertical shift (random intercept)
but that some teachers “compensate” or “amplify” educational disparity. The slope of the linear
relation between score and edu for teachers that amplify will be larger. In the contrary, it will
be smaller for teachers that compensate.
Model the classroom intercept and slope as a fixed effect: ANCOVA with interactions
1. Model the global association between edu and score: 𝑦𝑖𝑗 = 𝛽0 + 𝛽1 𝑥𝑖𝑗 , in R: score ~ edu.
2. Model the classroom 𝑧𝑗 = classroom (as a fixed effect) as a vertical shift (intercept, 𝑢1𝑗 )
for each classroom 𝑗 indicated by 𝑧𝑖𝑗 : 𝑦𝑖𝑗 = 𝑢1𝑗 𝑧𝑖𝑗 , in R: score ~ classroom.
3. Model the classroom (as a fixed effect) specitic slope (𝑢𝛼𝑗 ): 𝑦𝑖 = 𝑢𝛼𝑗 𝑥𝑖 𝑧𝑗 score ~
edu:classroom. The 𝑥𝑖 𝑧𝑗 forms 3 new columns with values of 𝑥𝑖 for each edu level, ie.:
for 𝑧𝑗 classroom 1, 2 and 3.
4. Put everything together:
# print(sm.stats.anova_lm(lm_fx, typ=3))
# print(lm_fx.summary())
print(ancova_full.t_test('edu'))
print("MSE=%.3f" % ancova_full.mse_resid)
results.loc[len(results)] = ["ANCOVA-Full (biased)"] + \
list(rmse_coef_tstat_pval(mod=ancova_full, var='edu'))
The graphical representation of the model would be the same than the one provided for “Model
a classroom intercept as a fixed effect: ANCOVA”. The same slope (associated to edu) with
different interpcept, depicted as dashed black lines. Moreover we added, as solid lines, the
model’s prediction that account different slopes.
print("Model parameters:")
print(ancova_full.params)
plot_ancova_fullmodel(x='edu', y='score',
group='classroom', df=df, model=ancova_full)
Model parameters:
Intercept 6.973753
classroom[T.c1] 2.316540
(continues on next page)
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood␣
warnings.warn(
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→statsmodels/regression/mixed_linear_model.py:1634: UserWarning: Random effects␣
˓→covariance is singular
warnings.warn(msg)
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE␣
warnings.warn(msg, ConvergenceWarning)
The warning results in a singular fit (correlation estimated at 1) caused by too little variance
among the random slopes. It indicates that we should considere to remove random slopes.
print(results)
Random intercepts
1. LM-Global is wrong (consider residuals to be independent) and has a large error (RMSE,
Root Mean Square Error) since it does not adjust for classroom effect.
2. ANCOVA-Inter is “wrong” (consider residuals to be independent) but it has a small error
since it adjusts for classroom effect.
3. Aggregation is ok (units average are independent) but it looses a lot of degrees of freedom
(df = 2 = 3 classroom - 1 intercept) and a lot of informations.
4. Hierarchical model is ok (unit average are independent) and it has a reasonable error
(look at the statistic, not the RMSE).
5. LMM-Inter (with random intercept) is ok (it models residuals non-independence) and it
has a small error.
6. ANCOVA-Inter, Hierarchical model and LMM provide similar coefficients for the fixed ef-
fect. So if statistical significance is not the key issue, the “biased” ANCOVA is a reasonable
choice.
7. Hierarchical and LMM with random intercept are the best options (unbiased and sensi-
tive), with an advantage to LMM.
Random slopes
Modeling individual slopes in both ANCOVA-Full and LMM-Full decreased the statistics, sug-
gesting that the supplementary regressors (one per classroom) do not significantly improve the
fit of the model (see errors).
If we consider only 6 samples (𝑖 ∈ {1, 6}, two sample for each classroom 𝑗 ∈ {c0, c1, c2}) and
the random intercept model. Stacking the 6 observations, the equation 𝑦𝑖𝑗 = 𝛽0 + 𝛽1 𝑥𝑖𝑗 + 𝑢𝑗 𝑧𝑗 +
𝜀𝑖𝑗 gives :
where u1 = 𝑢1 , 𝑢2 , 𝑢3 are the 3 parameters associated with the 3 level of the single random
factor classroom.
This can be re-written in a more general form as:
y = X𝛽 + Zu + 𝜀,
The 𝑘 𝜎𝑘 ZZ′ define the 𝑁 × 𝑁 variance structure, using 𝑘 variance components, modeling the
∑︀
non-independance between the observations. In our case with only one component we get:
⎡ ⎤ ⎡ ⎤
𝜎𝑘 𝜎𝑘 0 0 0 0 𝜎 0 0 0 0 0
⎢𝜎𝑘 𝜎𝑘 0 0 0 0 ⎥ ⎢ 0 𝜎 0 0 0 0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 0 𝜎𝑘 𝜎𝑘 0 0 ⎥ ⎢ 0 0 𝜎 0 0 0⎥
V=⎢
⎢ ⎥+⎢ ⎥
⎢ 0 0 𝜎𝑘 𝜎𝑘 0 0 ⎥ ⎢ 0 0 0 𝜎 0 0⎥
⎥ ⎢
⎥
⎣ 0 0 0 0 𝜎𝑘 𝜎𝑘 ⎦ ⎣ 0 0 0 0 𝜎 0⎦
0 0 0 0 𝜎𝑘 𝜎𝑘 0 0 0 0 0 𝜎
⎡ ⎤
𝜎𝑘 + 𝜎 𝜎𝑘 0 0 0 0
⎢ 𝜎𝑘 𝜎 𝑘 + 𝜎 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 0 𝜎 𝑘 + 𝜎 𝜎 𝑘 0 0 ⎥
=⎢
⎢ 0
⎥
⎢ 0 𝜎 𝑘 𝜎 𝑘 + 𝜎 0 0 ⎥ ⎥
⎣ 0 0 0 0 𝜎𝑘 + 𝜎 𝜎𝑘 ⎦
0 0 0 0 𝜎𝑘 𝜎𝑘 + 𝜎
LMM introduces the variance-covariance matrix V to reweigtht the residuals according to the
non-independance between observations. If V is known, of. The optimal value of be can be
obtained analytically using generalized least squares (GLS, minimisation of mean squared error
associated with Mahalanobis metric):
𝛽ˆ = X′ V
^ −1 X−1 X′ V
^ −1 y
In the general case, V is unknown, therefore iterative solvers should be use to estimate the fixed
effect 𝛽 and the parameters (𝜎, 𝜎𝑘 , . . .) of variance-covariance matrix V. The ML Maximum
Likelihood estimates provide biased solution for V because they do not take into account the
loss of degrees of freedom that results from estimating the fixed-effect parameters in 𝛽. For this
reason, REML (restricted (or residual, or reduced) maximum likelihood) is often preferred to
ML estimation.
Tests for Fixed-Effect Parameters
Quoted from [Brady et al. 2014]: “The approximate methods that apply to both t-tests and
F-tests take into account the presence of random effects and correlated residuals in an LMM.
Several of these approximate methods (e.g., the Satterthwaite method, or the “between-within”
method) involve different choices for the degrees of freedom used in” the approximate t-tests
and F-tests”.
5.3.7 References
• Brady et al. 2014: Brady T. West, Kathleen B. Welch, Andrzej T. Galecki, Linear Mixed
Models: A Practical Guide Using Statistical Software (2nd Edition), 2014
• Bruin 2006: Introduction to Linear Mixed Models, UCLA, Statistical Consulting Group.
• Statsmodel: Linear Mixed Effects Models
• Comparing R lmer to statsmodels MixedLM
• Statsmoels: Variance Component Analysis with nested groups
Multivariate statistics includes all statistical techniques for analyzing samples made of two or
more variables. The data set (a 𝑁 × 𝑃 matrix X) is a collection of 𝑁 independent samples
column vectors [x1 , . . . , x𝑖 , . . . , x𝑁 ] of length 𝑃
⎡ 𝑇 ⎤ ⎡ ⎤ ⎡ ⎤
−x1 − 𝑥11 · · · 𝑥1𝑗 · · · 𝑥1𝑃 𝑥11 . . . 𝑥1𝑃
⎢ .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. ⎥
⎢ . ⎥ ⎢ . . . ⎥⎥ ⎢ . . ⎥
⎢
⎢ 𝑇 ⎥ ⎢ ⎥
X=⎢ ⎢−x𝑖 −⎥ = ⎢ 𝑥𝑖1 · · · 𝑥𝑖𝑗 · · · 𝑥𝑖𝑃 ⎥ = ⎢ X .
⎥ ⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. ⎥
⎥
⎣ . ⎦ ⎣ . . . ⎦ ⎣ . . ⎦
−x𝑇𝑃 − 𝑥𝑁 1 · · · 𝑥𝑁 𝑗 · · · 𝑥𝑁 𝑃 𝑥𝑁 1 . . . 𝑥𝑁 𝑃 𝑁 ×𝑃
Source: Wikipedia
Algebraic definition
The dot product, denoted ’‘·” of two 𝑃 -dimensional vectors a = [𝑎1 , 𝑎2 , ..., 𝑎𝑃 ] and a =
[𝑏1 , 𝑏2 , ..., 𝑏𝑃 ] is defined as
⎡ ⎤
𝑏1
⎢ .. ⎥
]︀ ⎢ . ⎥
⎢ ⎥
∑︁
𝑇 𝑇
[︀
a·b=a b= 𝑎𝑖 𝑏𝑖 = 𝑎1 . . . a . . . 𝑎𝑃 ⎢ ⎢ b ⎥.
⎥
.⎥
𝑖
⎣ .. ⎦
⎢
𝑏𝑃
The Euclidean norm of a vector can be computed using the dot product, as
√
‖a‖2 = a · a.
a · b = 0.
At the other extreme, if they are codirectional, then the angle between them is 0° and
a · b = ‖a‖2 ‖b‖2
a · a = ‖a‖22 .
The scalar projection (or scalar component) of a Euclidean vector a in the direction of a Eu-
clidean vector b is given by
𝑎𝑏 = ‖a‖2 cos 𝜃,
Fig. 4: Projection
import numpy as np
import scipy
import pandas as pd
# Plot
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * 1.)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
#%matplotlib inline
import numpy as np
np.random.seed(42)
a = np.random.randn(10)
b = np.random.randn(10)
np.dot(a, b)
np.float64(-4.085788532659923)
• The covariance matrix ΣXX is a symmetric positive semi-definite matrix whose element
in the 𝑗, 𝑘 position is the covariance between the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ elements of a random vector
i.e. the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ columns of X.
• The covariance matrix generalizes the notion of covariance to multiple dimensions.
• The covariance matrix describe the shape of the sample distribution around the mean
assuming an elliptical distribution:
where
𝑁
1 1 ∑︁
𝑠𝑗𝑘 = 𝑠𝑘𝑗 = xj 𝑇 x k = 𝑥𝑖𝑗 𝑥𝑖𝑘
𝑁 −1 𝑁 −1
𝑖=1
np.random.seed(42)
colors = sns.color_palette()
# Generate dataset
for i in range(len(mean)):
X[i] = np.random.multivariate_normal(mean[i], Cov[i], n_samples)
# Plot
for i in range(len(mean)):
# Points
plt.scatter(X[i][:, 0], X[i][:, 1], color=colors[i], label="class %i" % i)
# Means
plt.scatter(mean[i][0], mean[i][1], marker="o", s=200, facecolors='w',
edgecolors=colors[i], linewidth=2)
# Ellipses representing the covariance matrices
pystatsml.plot_utils.plot_cov_ellipse(Cov[i], pos=mean[i], facecolor='none',
linewidth=2, edgecolor=colors[i])
plt.axis('equal')
_ = plt.legend(loc='upper left')
url = 'https://raw.githubusercontent.com/plotly/datasets/master/mtcars.csv'
df = pd.read_csv(url)
df = df.drop('manufacturer', axis=1)
f, ax = plt.subplots(figsize=(5.5, 4.5))
cmap = sns.color_palette("RdBu_r", 11)
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(corr, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
lab=0
print(clusters)
reordered = np.concatenate(clusters)
R = corr.loc[reordered, reordered]
f, ax = plt.subplots(figsize=(5.5, 4.5))
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(R, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
[['mpg', 'cyl', 'disp', 'hp', 'wt', 'qsec', 'vs', 'carb'], ['am', 'gear'], ['drat
˓→']]
In statistics, precision is the reciprocal of the variance, and the precision matrix is the matrix
inverse of the covariance matrix.
It is related to partial correlations that measures the degree of association between two vari-
ables, while controlling the effect of other variables.
import numpy as np
print(Pcor.round(2))
# Precision matrix:
[[ 6.79 -3.21 -3.21 0. 0. 0. ]
[-3.21 6.79 -3.21 0. 0. 0. ]
[-3.21 -3.21 6.79 0. 0. 0. ]
[ 0. 0. 0. 5.26 -4.74 0. ]
[ 0. 0. 0. -4.74 5.26 0. ]
[ 0. 0. 0. 0. 0. 1. ]]
# Partial correlations:
[[ nan 0.47 0.47 -0. -0. -0. ]
[ nan nan 0.47 -0. -0. -0. ]
[ nan nan nan -0. -0. -0. ]
[ nan nan nan nan 0.9 -0. ]
[ nan nan nan nan nan -0. ]
[ nan nan nan nan nan nan]]
• The Mahalanobis distance is a measure of the distance between two points x and 𝜇 where
the dispersion (i.e. the covariance structure) of the samples is taken into account.
• The dispersion is considered through covariance matrix.
This is formally expressed as
√︁
𝐷𝑀 (x, 𝜇) = (x − 𝜇)𝑇 Σ−1 (x − 𝜇).
Intuitions
• Distances along the principal directions of dispersion are contracted since they correspond
to likely dispersion of points.
• Distances othogonal to the principal directions of dispersion are dilated since they corre-
spond to unlikely dispersion of points.
For example
√
𝐷𝑀 (1) = 1𝑇 Σ−1 1.
ones = np.ones(Cov.shape[0])
d_euc = np.sqrt(np.dot(ones, ones))
d_mah = np.sqrt(np.dot(np.dot(ones, Prec), ones))
The first dot product that distances along the principal directions of dispersion are contracted:
print(np.dot(ones, Prec))
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
%matplotlib inline
np.random.seed(40)
colors = sns.color_palette()
Covi = scipy.linalg.inv(Cov)
dm_m_x1 = scipy.spatial.distance.mahalanobis(mean, x1, Covi)
dm_m_x2 = scipy.spatial.distance.mahalanobis(mean, x2, Covi)
# Plot distances
vm_x1 = (x1 - mean) / d2_m_x1
vm_x2 = (x2 - mean) / d2_m_x2
jitter = .1
plt.plot([mean[0] - jitter, d2_m_x1 * vm_x1[0] - jitter],
[mean[1], d2_m_x1 * vm_x1[1]], color='k')
plt.plot([mean[0] - jitter, d2_m_x2 * vm_x2[0] - jitter],
[mean[1], d2_m_x2 * vm_x2[1]], color='k')
plt.legend(loc='lower right')
plt.text(-6.1, 3,
(continues on next page)
plt.text(-6.1, 3.5,
'Mahalanobis: d(m, x1) = %.1f>d(m, x2) = %.1f' % (dm_m_x1, dm_m_x2),␣
˓→color='r')
plt.axis('equal')
print('Euclidian d(m, x1) = %.2f < d(m, x2) = %.2f' % (d2_m_x1, d2_m_x2))
print('Mahalanobis d(m, x1) = %.2f > d(m, x2) = %.2f' % (dm_m_x1, dm_m_x2))
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Eu-
clidean distance. If the covariance matrix is diagonal, then the resulting distance measure is
called a normalized Euclidean distance.
More generally, the Mahalanobis distance is a measure of the distance between a point x and a
distribution 𝒩 (x|𝜇, Σ). It is a multi-dimensional generalization of the idea of measuring how
many standard deviations away x is from the mean. This distance is zero if x is at the mean,
and grows as x moves away from the mean: along each principal component axis, it measures
the number of standard deviations from x to the mean of the distribution.
The distribution, or probability density function (PDF) (sometimes just density), of a continuous
random variable is a function that describes the relative likelihood for this random variable to
take on a given value.
The multivariate normal distribution, or multivariate Gaussian distribution, of a 𝑃 -dimensional
random vector x = [𝑥1 , 𝑥2 , . . . , 𝑥𝑃 ]𝑇 is
1 1
𝒩 (x|𝜇, Σ) = exp{− (x − 𝜇)𝑇 Σ−1 (x − 𝜇)}.
(2𝜋)𝑃/2 |Σ|1/2 2
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import multivariate_normal
#from mpl_toolkits.mplot3d import Axes3D
P = X.shape[1]
det = np.linalg.det(sigma)
norm_const = 1.0 / (((2*np.pi) ** (P/2)) * np.sqrt(det))
X_mu = X - mu
inv = np.linalg.inv(sigma)
d2 = np.sum(np.dot(X_mu, inv) * X_mu, axis=1)
return norm_const * np.exp(-0.5 * d2)
# x, y grid
x, y = np.mgrid[-3:3:.1, -3:3:.1]
X = np.stack((x.ravel(), y.ravel())).T
norm = multivariate_normal_pdf(X, mean, sigma).reshape(x.shape)
# Do it with scipy
norm_scpy = multivariate_normal(mu, sigma).pdf(np.stack((x, y), axis=2))
assert np.allclose(norm, norm_scpy)
# Plot
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
surf = ax.plot_surface(x, y, norm, rstride=3,
cstride=3, cmap=plt.cm.coolwarm,
linewidth=1, antialiased=False
)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
5.4.8 Exercises
Sources:
• Scipy Resampling and Monte Carlo Methods
# Manipulate data
import numpy as np
import pandas as pd
# Statistics
import scipy.stats
import statsmodels.api as sm
#import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
from statsmodels.stats.stattools import jarque_bera
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
Realizations of random walks obtained by Monte Carlo simulation Plot Few random walks (tra-
jectories), ie, 𝑆𝑛 for 𝑛 = 0 to 200
√︀
Distribution of 𝑆𝑛 vs 𝒩 (0, (𝑛))
Permutation test:
• The test involves two or more samples assuming that values can be randomly permuted
under the null hypothesis.
• The test is Resampling procedure to estimate the distribution of a parameter or any
statistic under the null hypothesis, i.e., calculated on the permuted data. This parameter
or any statistic is called the estimator.
• Statistical inference is conducted by computing the proportion of permuted values of
the estimator that are “more extreme” than the true one, providing an estimation of the
p-value.
• Permutation tests are a subset of non-parametric statistics, useful when the distribution of
the estimator (under H0) is unknown or requires complicated formulas.
Permutation test procedure
1. Estimate the observed parameter or statistic 𝜃ˆ = 𝑆(𝑋) on the initial dataset 𝑋 of size 𝑁 .
We call it the observed statistic.
2. Generate 𝑅 samples (called randomized samples) [𝑋1 , . . . 𝑋𝑟 , . . . 𝑋𝑅 ] from the initial
dataset by permutation of the values between the two samples.
3. Distribution of the estimator under HO: For each random sample 𝑟, compute the estimator
𝜃ˆ𝑟 = 𝑆(𝑋𝑟 ). The set of {𝜃ˆ𝑟 } provides an estimate the distribution of 𝑃 (𝜃|𝐻0) (under the
null hypothesis).
4. Compute statistics of the estimator under the null hypothesis using randomized estimates
𝜃ˆ𝑟 ’s:
• Mean (under 𝐻0):
𝑅
1 ∑︁ ˆ
𝜃¯𝑅 = 𝜃𝑟
𝑟
𝑟=1
ˆ 𝜃 (under 𝐻0):
• Standard Error 𝑆𝐸 𝑅
⎯
⎸ 𝐾
⎸ 1 ∑︁ ¯
ˆ 𝜃
𝑆𝐸 𝑅
=⎷ (𝜃𝑅 − 𝜃ˆ𝑟 )2
𝑅−1
𝑟=1
ˆ card(𝜃ˆ𝑟 ≥ 𝜃)
ˆ
𝑃 (𝜃 ≥ 𝜃|𝐻0) ≈
𝑅
– Two-sided p-value:
card(𝜃ˆ𝑟 ≤ 𝜃)
ˆ + card(𝜃ˆ𝑟 ≥ 𝜃)
ˆ
𝑃 (𝜃 ≤ 𝜃ˆ or 𝜃 ≥ 𝜃|𝐻0)
ˆ ≈
𝑅
Parameters
----------
x : array
the datasets
estimator : callable
the estimator function taking x as argument returning the estimator value
(scalar)
n_perms : int, optional
the number of permutation, by default 1000
Return
------
Observed estimate
Mean of randomized estimates
Standard Error of randomized estimates
Two-sided p-value
Randomized distribution estimates (density_values, bins)
"""
# 2. Permuted samples
# Randomly pick the sign with the function:
# np.random.choice([-1, 1], size=len(x), replace=True)
# 4. Randomized Statistics
Example, we load the monthly revenue figures of for 100 stores before and after a market-
ing campaigns. We compute the difference (𝑥𝑖 = 𝑥after 𝑖 − 𝑥before
𝑖 ) for each store 𝑖. Under the
after
null hypothesis, i.e., no effect of the campaigns, 𝑥𝑖 and 𝑥before could be permuted, which
𝑖
is equivalent to ∑︀
randomly switch the sign of 𝑥𝑖 . Here we will focus on the sample mean
𝜃ˆ = 𝑆(𝑋) = 1/𝑛 𝑖 𝑥𝑖 as statistic of interest.
x = df.after - df.before
plt.hist(x, fill=False)
plt.axvline(x=0, color="b", label=r'No effect: 0')
plt.axvline(x=x.mean(), color="r", ls='-', label=r'$\bar{x}=%.2f$' % x.mean())
plt.legend()
_ = plt.title(r'Distribution of the sales changes $x_i = x_i^\text{after} - \
x_i^\text{before}$')
Plot
pystatsml.plot_utils.plot_pvalue_under_h0(stat_vals=bins[1:], stat_probs=theta_
˓→density_R,
Similar procedure can be conducted with many statistic e.g., the t-statistic (same results):
def ttstat(x):
return (np.mean(x) - 0) / np.std(x, ddof=1) * np.sqrt(len(x))
Or the median:
5.5.3 Bootstrapping
Bootstrapping:
• Resampling procedure to estimate the distribution of a statistic or parameter of interest,
called the estimator.
• Derive estimates of variability for estimator (bias, standard error, confidence intervals,
etc.).
• Statistical inference is conducted by looking the confidence interval (CI) contains the
null hypothesis.
• Nonparametric approach statistical inference, useful when model assumptions is in doubt,
unknown or requires complicated formulas.
• Bootstrapping with replacement has favorable performances (Efron 1979, and 1986)
compared to prior methods like the jackknife that sample without replacement
• Regularize models by fitting several models on bootstrap samples and averaging their
predictions (see Baging and random-forest). See machine learning chapter.
Note that compared to random permutation, bootstrapping sample the distribution under the
alternative hypothesis, it doesn’t consider the distribution under the null hypothesis. A great
advantage of bootstrap is its simplicity of the procedure:
ˆ𝑏𝜃 = 𝜃¯𝐵 − 𝜃ˆ
𝐵
𝐶𝐼95% = [𝜃ˆ1 = 𝑄(2.5%), 𝜃ˆ2 = 𝑄(97.5%)] i.e., the 2.5%, 97.5% quantiles estimators out of the 𝜃ˆ𝑏′ 𝑠
Application using the monthly revenue of 100 stores before and after a marketing campaigns,
using the difference (𝑥𝑖 = 𝑥after −𝑥before
∑︀
𝑖 𝑖 ) for each store 𝑖. If the average difference 𝑥
¯ = 1/𝑛 𝑖 𝑥𝑖
is positive (resp. negative), then the marketing campaigns will be considered as efficient (resp.
detrimental). We will use bootstrapping to compute the confidence interval (CI) and see if 0 in
comprised in the CI.
x = df.after - df.before
S = np.mean
# 1. Model parameters
theta_hat = S(x)
# 2. Bootstrapped samples
# 4. Bootstrap Statistics
# Bootstrap estimate
theta_bar_B = np.mean(theta_hats_B)
# Standard Error
se_hat_B = np.sqrt(1 / (B - 1) * np.sum((theta_bar_B - theta_hats_B) ** 2))
print(
"Est.: {:.2f}, Boot Est.: {:.2f}, bias: {:e},\
Boot SE: {:.2f}, CI: [{:.5f}, {:.5f}]"\
.format(theta_hat, theta_bar_B, bias_hat_B, se_hat_B, ci95[0], ci95[1]))
Est.: 2.26, Boot Est.: 2.19, bias: -7.201946e-02, Boot SE: 0.95, CI: [0.45256,
˓→ 4.08465]
SIX
UNSUPERVISED LEARNING
6.1 Introduction
In machine learning and statistics, dimensionality reduction or dimension reduction is the pro-
cess of reducing the number of features under consideration, and can be divided into feature
selection (not addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and builds derived values (fea-
tures) intended to be informative and non-redundant, facilitating the subsequent learning and
generalization steps, and in some cases leading to better human interpretations. Feature extrac-
tion is related to dimensionality reduction.
239
Statistics and Machine Learning in Python, Release 0.8
Decompose the data matrix X𝑁 ×𝑃 into a product of a mixing matrix U𝑁 ×𝐾 and a dictionary
matrix V𝑃 ×𝐾 .
X = UV𝑇 ,
If we consider only a subset of components 𝐾 < 𝑟𝑎𝑛𝑘(X) < min(𝑃, 𝑁 − 1) , X is approximated
by a matrix X̂:
X ≈ X̂ = UV𝑇 ,
Each line of xi is a linear combination (mixing ui ) of dictionary items V.
𝑁 𝑃 -dimensional data points lie in a space whose dimension is less than 𝑁 − 1 (2 dots lie on a
line, 3 on a plane, etc.).
X = UDV𝑇 ,
where
⎡ ⎤ ⎡ ⎤
𝑥11 𝑥1𝑃 𝑢11⎡ 𝑢1𝐾
⎤⎡ ⎤
⎢ ⎥ ⎢ ⎥ 𝑑1 0 𝑣11 𝑣1𝑃
⎢ ⎥ ⎢ ⎥
⎢
⎢ X ⎥=⎢
⎥ ⎢ U ⎥⎣
⎥ D ⎦⎣ V𝑇 ⎦.
⎣ ⎦ ⎣ ⎦ 0 𝑑𝐾 𝑣𝐾1 𝑣𝐾𝑃
𝑥𝑁 1 𝑥𝑁 𝑃 𝑢𝑁 1 𝑢𝑁 𝐾
U: right-singular
V transforms correlated variables (X) into a set of uncorrelated ones (UD) that better expose
the various relationships among the original data items.
X = UDV𝑇 , (6.1)
𝑇
XV = UDV V, (6.2)
XV = UDI, (6.3)
XV = UD (6.4)
At the same time, SVD is a method for identifying and ordering the dimensions along which
data points exhibit the most variation.
import numpy as np
import scipy
from sklearn.decomposition import PCA
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
print(X.shape)
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(U[:, 0], U[:, 1], s=50)
plt.axis('equal')
plt.title("U: Rotated and scaled data")
plt.subplot(132)
# Project data
PC = np.dot(X, Vh.T)
plt.scatter(PC[:, 0], PC[:, 1], s=50)
plt.axis('equal')
plt.title("XV: Rotated data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], s=50)
for i in range(Vh.shape[0]):
plt.arrow(x=0, y=0, dx=Vh[i, 0], dy=Vh[i, 1], head_width=0.2,
head_length=0.2, linewidth=2, fc='r', ec='r')
plt.text(Vh[i, 0], Vh[i, 1],'v%i' % (i+1), color="r", fontsize=15,
horizontalalignment='right', verticalalignment='top')
plt.axis('equal')
plt.ylim(-4, 4)
plt.tight_layout()
(100, 2)
Sources:
• PCA with scikit-learn
• C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
• Everything you did and didn’t know about PCA
• Principal Component Analysis in 3 Simple Steps
Principles
• Principal components analysis is the main method used for linear dimension reduction.
• The idea of principal component analysis is to find the 𝐾 principal components di-
rections (called the loadings) V𝐾×𝑃 that capture the variation in the data as much as
possible.
• It converts a set of 𝑁 𝑃 -dimensional observations N𝑁 ×𝑃 of possibly correlated variables
into a set of 𝑁 𝐾-dimensional samples C𝑁 ×𝐾 , where the 𝐾 < 𝑃 . The new variables are
linearly uncorrelated. The columns of C𝑁 ×𝐾 are called the principal components.
• The dimension reduction is obtained by using only 𝐾 < 𝑃 components that exploit corre-
lation (covariance) among the original variables.
• PCA is mathematically defined as an orthogonal linear transformation V𝐾×𝑃 that trans-
forms the data to a new coordinate system such that the greatest variance by some projec-
tion of the data comes to lie on the first coordinate (called the first principal component),
the second greatest variance on the second coordinate, and so on.
C𝑁 ×𝐾 = X𝑁 ×𝑃 V𝑃 ×𝐾
• PCA can be thought of as fitting a 𝑃 -dimensional ellipsoid to the data, where each axis of
the ellipsoid represents a principal component. If some axis of the ellipse is small, then the
variance along that axis is also small, and by omitting that axis and its corresponding prin-
cipal component from our representation of the dataset, we lose only a commensurately
small amount of information.
• Finding the 𝐾 largest axes of the ellipse will permit to project the data onto a space having
dimensionality 𝐾 < 𝑃 while maximizing the variance of the projected data.
Dataset preprocessing
Centering
Consider a data matrix, X , with column-wise zero empirical mean (the sample mean of each
column has been shifted to zero), ie. X is replaced by X − 1x̄𝑇 .
Standardizing
Optionally, standardize the columns, i.e., scale them by their standard-deviation. Without stan-
dardization, a variable with a high variance will capture most of the effect of the PCA. The
principal direction will be aligned with this variable. Standardization will, however, raise noise
variables to the save level as informative variables.
The covariance matrix of centered standardized data is the correlation matrix.
To begin with, consider the projection onto a one-dimensional space (𝐾 = 1). We can define
the direction of this space using a 𝑃 -dimensional vector v, which for convenience (and without
loss of generality) we shall choose to be a unit vector so that ‖v‖2 = 1 (note that we are only
interested in the direction defined by v, not in the magnitude of v itself). PCA consists of two
mains steps:
Projection in the directions that capture the greatest variance
Each 𝑃 -dimensional data point x𝑖 is then projected onto v, where the coordinate (in the co-
ordinate system of v) is a scalar value, namely x𝑇𝑖 v. I.e., we want to find the vector v that
maximizes these coordinates along v, which we will see corresponds to maximizing the vari-
ance of the projected data. This is equivalently expressed as
1 ∑︁ (︀ 𝑇 )︀2
v = arg max x𝑖 v .
‖v‖=1 𝑁
𝑖
where SXX is a biased estiamte of the covariance matrix of the data, i.e.
1 𝑇
SXX = X X.
𝑁
We now maximize the projected variance v𝑇 SXX v with respect to v. Clearly, this has to be a
constrained maximization to prevent ‖v2 ‖ → ∞. The appropriate constraint comes from the
normalization condition ‖v‖2 ≡ ‖v‖22 = v𝑇 v = 1. To enforce this constraint, we introduce a
Lagrange multiplier that we shall denote by 𝜆, and then make an unconstrained maximization
of
By setting the gradient with respect to v equal to zero, we see that this quantity has a stationary
point when
SXX v = 𝜆v.
v𝑇 SXX v = 𝜆,
and so the variance will be at a maximum when v is equal to the eigenvector corresponding to
the largest eigenvalue, 𝜆. This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing each new
direction to be that which maximizes the projected variance amongst all possible directions that
are orthogonal to those already considered. If we consider the general case of a 𝐾-dimensional
projection space, the optimal linear projection for which the variance of the projected data is
maximized is now defined by the 𝐾 eigenvectors, v1 , . . . , vK , of the data covariance matrix
SXX that corresponds to the 𝐾 largest eigenvalues, 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 𝜆𝐾 .
Back to SVD
X𝑇 X = (UDV𝑇 )𝑇 (UDV𝑇 )
= VD𝑇 U𝑇 UDV𝑇
= VD2 V𝑇
V𝑇 X𝑇 XV = D2
1 1
V𝑇 X𝑇 XV = D2
𝑁 −1 𝑁 −1
1
V𝑇 SXX V = D2
𝑁 −1
.
Considering only the 𝑘 𝑡ℎ right-singular vectors v𝑘 associated to the singular value 𝑑𝑘
1
vk 𝑇 SXX vk = 𝑑2 ,
𝑁 −1 𝑘
It turns out that if you have done the singular value decomposition then you already have
the Eigenvalue decomposition for X𝑇 X. Where - The eigenvectors of SXX are equivalent to
the right singular vectors, V, of X. - The eigenvalues, 𝜆𝑘 , of SXX , i.e. the variances of the
components, are equal to 𝑁 1−1 times the squared singular values, 𝑑𝑘 .
Moreover computing PCA with SVD do not require to form the matrix X𝑇 X, so computing the
SVD is now the standard way to calculate a principal components analysis from a data matrix,
unless only a handful of components are required.
PCA outputs
The SVD or the eigendecomposition of the data covariance matrix provides three main quanti-
ties:
1. Principal component directions or loadings are the eigenvectors of X𝑇 X. The V𝐾×𝑃
or the right-singular vectors of an SVD of X are called principal component directions of
X. They are generally computed using the SVD of X.
C𝑁 ×𝐾 = UDV𝑇𝑁 ×𝑃 V𝑃 ×𝐾 (6.5)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 I𝐾×𝐾 (6.6)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 (6.7)
(6.8)
Thus c𝑗 = Xv𝑗 = u𝑗 𝑑𝑗 , for 𝑗 = 1, . . . 𝐾. Hence u𝑗 is simply the projection of the row vectors of
X, i.e., the input predictor vectors, on the direction v𝑗 , scaled by 𝑑𝑗 .
⎡ ⎤
𝑥1,1 𝑣1,1 + . . . + 𝑥1,𝑃 𝑣1,𝑃
⎢ 𝑥2,1 𝑣1,1 + . . . + 𝑥2,𝑃 𝑣1,𝑃 ⎥
c1 = ⎢ ..
⎢ ⎥
⎥
⎣ . ⎦
𝑥𝑁,1 𝑣1,1 + . . . + 𝑥𝑁,𝑃 𝑣1,𝑃
1
𝑣𝑎𝑟(c𝑘 ) = (Xv𝑘 )2 (6.9)
𝑁 −1
1
= (u𝑘 𝑑𝑘 )2 (6.10)
𝑁 −1
1
= 𝑑2 (6.11)
𝑁 −1 𝑘
We must choose 𝐾 * ∈ [1, . . . , 𝐾], the number of required components. This can be done by
calculating the explained variance ratio of the 𝐾 * first components and by choosing 𝐾 * such
that the cumulative explained variance ratio is greater than some given threshold (e.g., ≈
90%). This is expressed as
∑︀𝐾 *
𝑗 𝑣𝑎𝑟(c𝑘 )
cumulative explained variance(c𝑘 ) = ∑︀𝐾 .
𝑗 𝑣𝑎𝑟(c𝑘 )
PCs
Plot the samples projeted on first the principal components as e.g. PC1 against PC2.
PC directions
Exploring the loadings associated with a component provides the contribution of each original
variable in the component.
Remark: The loadings (PC directions) are the coefficients of multiple regression of PC on origi-
nal variables:
c = Xv (6.12)
𝑇 𝑇
X c = X Xv (6.13)
(X𝑇 X)−1 X𝑇 c = v (6.14)
Another way to evaluate the contribution of the original variables in each PC can be obtained
by computing the correlation between the PCs and the original variables, i.e. columns of X,
denoted x𝑗 , for 𝑗 = 1, . . . , 𝑃 . For the 𝑘 𝑡ℎ PC, compute and plot the correlations with all original
variables
𝑐𝑜𝑟(c𝑘 , x𝑗 ), 𝑗 = 1 . . . 𝐾, 𝑗 = 1 . . . 𝐾.
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
PC = pca.transform(X)
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("x1"); plt.ylabel("x2")
plt.subplot(122)
plt.scatter(PC[:, 0], PC[:, 1])
plt.xlabel("PC1 (var=%.2f)" % pca.explained_variance_ratio_[0])
plt.ylabel("PC2 (var=%.2f)" % pca.explained_variance_ratio_[1])
plt.axis('equal')
plt.tight_layout()
[0.93646607 0.06353393]
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
n_row, n_col = 2, 3
n_components = n_row * n_col
image_shape = (64, 64)
# Utils function
def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray):
(continues on next page)
Preprocessing
# global centering
faces_centered = faces - faces.mean(axis=0)
# local centering
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
pca = decomposition.PCA(n_components=n_components)
pca.fit(faces_centered)
(continues on next page)
6.2.4 Exercises
• Compute the cumulative explained variance ratio. Determine the number of components
𝐾 by your computed values.
• Print the 𝐾 principal components directions and correlations of the 𝐾 principal compo-
nents with the original variables. Interpret the contribution of the original variables into
the PC.
• Plot the samples projected into the 𝐾 first PCs.
• Color samples by their species.
Sources:
• Scikit-learn documentation
• Wikipedia
Nonlinear dimensionality reduction or manifold learning cover unsupervised methods that
attempt to identify low-dimensional manifolds within the original 𝑃 -dimensional space that
represent high data density. Then those methods provide a mapping from the high-dimensional
space to the low-dimensional embedding.
Resources:
• wikipedia
• Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer, Second Edition.
The purpose of MDS is to find a low-dimensional projection of the data in which the pairwise
distances between data points is preserved, as closely as possible (in a least-squares sense).
• Let D be the (𝑁 × 𝑁 ) pairwise distance matrix where 𝑑𝑖𝑗 is a distance between points 𝑖
and 𝑗.
• The MDS concept can be extended to a wide variety of data types specified in terms of a
similarity matrix.
Given the dissimilarity (distance) matrix D𝑁 ×𝑁 = [𝑑𝑖𝑗 ], MDS attempts to find 𝐾-dimensional
projections of the 𝑁 points x1 , . . . , x𝑁 ∈ R𝐾 , concatenated in an X𝑁 ×𝐾 matrix, so that 𝑑𝑖𝑗 ≈
‖x𝑖 − x𝑗 ‖ are as close as possible. This can be obtained by the minimization of a loss function
called the stress function
∑︁
stress(X) = (𝑑𝑖𝑗 − ‖x𝑖 − x𝑗 ‖)2 .
𝑖̸=𝑗
The Sammon mapping performs better at preserving small distances compared to the least-
squares scaling.
Example
The eurodist datset provides the road distances (in kilometers) between 21 cities in Europe.
Given this matrix of pairwise (non-Euclidean) distances D = [𝑑𝑖𝑗 ], MDS can be used to recover
the coordinates of the cities in some Euclidean referential whose orientation is arbitrary.
import pandas as pd
import numpy as np
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
np.random.seed(42)
df = pd.read_csv(url)
print(df.iloc[:5, :5])
city = df["city"]
D = np.array(df.iloc[:, 1:]) # Distance matrix
X = mds.fit_transform(D)
(np.float64(-1894.091917806915),
np.float64(2914.3554370871243),
np.float64(-1712.973369719749),
np.float64(2145.4370687880146))
We must choose 𝐾 * ∈ {1, . . . , 𝐾} the number of required components. Plotting the values of
the stress function, obtained using 𝑘 ≤ 𝑁 − 1 components. In general, start with 1, . . . 𝐾 ≤ 4.
Choose 𝐾 * where you can clearly distinguish an elbow in the stress curve.
Thus, in the plot below, we choose to retain information accounted for by the first two compo-
nents, since this is where the elbow is in the stress curve.
print(stress)
plt.plot(k_range, stress)
plt.xlabel("k")
plt.ylabel("stress")
Exercises
6.3.2 Isomap
6.3.3 t-SNE
Sources:
• Wikipedia
• scikit-learn
Principles
1. Construct a (Gaussian) probability distribution between pairs of object in input (high-
dimensional) space.
ax = fig.add_subplot(131, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)
plt.title('2D "S shape" manifold in 3D')
ax = fig.add_subplot(132)
plt.scatter(X_isomap[:, 0], X_isomap[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("Isomap")
plt.xlabel("First component")
plt.ylabel("Second component")
ax = fig.add_subplot(133)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("t-SNE")
plt.xlabel("First component")
plt.ylabel("Second component")
plt.axis('tight')
(np.float64(-68.37603721618652),
np.float64(64.30499229431152),
np.float64(-14.287820672988891),
np.float64(17.26294598579407))
6.3.4 Exercises
Run Manifold learning on handwritten digits: Locally Linear Embedding, Isomap with scikit-
learn
6.4 Clustering
Wikipedia: Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense or another)
to each other than to those in other groups (clusters). Clustering is one of the main task of
exploratory data mining, and a common technique for statistical data analysis, used in many
fields, including machine learning, pattern recognition, image analysis, information retrieval,
and bioinformatics.
Source: Clustering with Scikit-learn.
This is known as the 1-of-𝐾 coding scheme. We can then define an objective function, denoted
inertia, as
𝑁 ∑︁
∑︁ 𝐾
𝐽(𝑟, 𝜇) = 𝑟𝑖𝑘 ‖𝑥𝑖 − 𝜇𝑘 ‖22
𝑖 𝑘
which represents the sum of the squares of the Euclidean distances of each data point to its
assigned vector 𝜇𝑘 . Our goal is to find values for the {𝑟𝑖𝑘 } and the {𝜇𝑘 } so as to minimize the
function 𝐽. We can do this through an iterative procedure in which each iteration involves two
successive steps corresponding to successive optimizations with respect to the 𝑟𝑖𝑘 and the 𝜇𝑘
. First we choose some initial values for the 𝜇𝑘 . Then in the first phase we minimize 𝐽 with
respect to the 𝑟𝑖𝑘 , keeping the 𝜇𝑘 fixed. In the second phase we minimize 𝐽 with respect to
the 𝜇𝑘 , keeping 𝑟𝑖𝑘 fixed. This two-stage optimization process is then repeated until conver-
gence. We shall see that these two stages of updating 𝑟𝑖𝑘 and 𝜇𝑘 correspond respectively to the
expectation (E) and maximization (M) steps of the expectation-maximisation (EM) algorithm,
and to emphasize this we shall use the terms E step and M step in the context of the 𝐾-means
algorithm.
Consider first the determination of the 𝑟𝑖𝑘 . Because 𝐽 in is a linear function of 𝑟𝑖𝑘 , this opti-
mization can be performed easily to give a closed form solution. The terms involving different
𝑖 are independent and so we can optimize for each 𝑖 separately by choosing 𝑟𝑖𝑘 to be 1 for
whichever value of 𝑘 gives the minimum value of ||𝑥𝑖 − 𝜇𝑘 ||2 . In other words, we simply assign
the 𝑖th data point to the closest cluster centre. More formally, this can be expressed as
{︃
1, if 𝑘 = arg min𝑗 ||𝑥𝑖 − 𝜇𝑗 ||2 .
𝑟𝑖𝑘 = (6.15)
0, otherwise.
Now consider the optimization of the 𝜇𝑘 with the 𝑟𝑖𝑘 held fixed. The objective function 𝐽 is a
quadratic function of 𝜇𝑘 , and it can be minimized by setting its derivative with respect to 𝜇𝑘 to
zero giving
∑︁
2 𝑟𝑖𝑘 (𝑥𝑖 − 𝜇𝑘 ) = 0
𝑖
The denominator in this expression is equal to the number of points assigned to cluster 𝑘, and so
this result has a simple interpretation, namely set 𝜇𝑘 equal to the mean of all of the data points
𝑥𝑖 assigned to cluster 𝑘. For this reason, the procedure is known as the 𝐾-means algorithm.
The two phases of re-assigning data points to clusters and re-computing the cluster means are
repeated in turn until there is no further change in the assignments (or until some maximum
number of iterations is exceeded). Because each phase reduces the value of the objective func-
tion 𝐽, convergence of the algorithm is assured. However, it may converge to a local rather than
global minimum of 𝐽.
# Plot
import matplotlib.pyplot as plt
(continues on next page)
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
colors = sns.color_palette()
iris = datasets.load_iris()
X = iris.data[:, :2] # use only 'sepal length and sepal width'
y_iris = iris.target
km2 = cluster.KMeans(n_clusters=2).fit(X)
km3 = cluster.KMeans(n_clusters=3).fit(X)
km4 = cluster.KMeans(n_clusters=4).fit(X)
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1],
c=[colors[lab] for lab in km3.predict(X)])
plt.title("K=3, J=%.2f" % km3.inertia_)
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1],
c=[colors[lab] for lab in km4.predict(X)])
plt.title("K=4, J=%.2f" % km4.inertia_)
Exercises
1. Analyse clusters
• Analyse the plot above visually. What would a good value of 𝐾 be?
• If you instead consider the inertia, the value of 𝐽, what would a good value of 𝐾 be?
• Explain why there is such difference.
• For 𝐾 = 2 why did 𝐾-means clustering not find the two “natural” clusters? See the
assumptions of 𝐾-means: See sklearn doc.
Write a function kmeans(X, K) that return an integer vector of the samples’ labels.
The Gaussian mixture model (GMM) is a simple linear superposition of Gaussian components
over the data, aimed at providing a rich class of density models. We turn to a formulation of
Gaussian mixtures in terms of discrete latent variables: the 𝐾 hidden classes to be discovered.
Differences compared to 𝐾-means:
• Whereas the 𝐾-means algorithm performs a hard assignment of data points to clusters, in
which each data point is associated uniquely with one cluster, the GMM algorithm makes
a soft assignment based on posterior probabilities.
• Whereas the classic 𝐾-means is only based on Euclidean distances, classic GMM use a
Mahalanobis distances that can deal with non-spherical distributions. It should be noted
that Mahalanobis could be plugged within an improved version of 𝐾-Means clustering.
The Mahalanobis distance is unitless and scale-invariant, and takes into account the cor-
relations of the data set.
The Gaussian mixture distribution can be written as a linear superposition of 𝐾 Gaussians in
the form:
𝐾
∑︁
𝑝(𝑥) = 𝒩 (𝑥 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘),
𝑘=1
where:
• The 𝑝(𝑘) are ∑︀
the mixing coefficients also know as the class probability of class 𝑘, and they
sum to one: 𝐾 𝑘=1 𝑝(𝑘) = 1.
To compute the classes parameters: 𝑝(𝑘), 𝜇𝑘 , Σ𝑘 we sum over all samples, by weighting each
sample 𝑖 by its responsibility or contribution
∑︀ to class 𝑘: 𝑝(𝑘 | 𝑥𝑖 ) such that for each point its
contribution to all classes sum to one 𝑘 𝑝(𝑘 | 𝑥𝑖 ) = 1. This contribution is the conditional
probability of class 𝑘 given 𝑥: 𝑝(𝑘 | 𝑥) (sometimes called the posterior). It can be computed
using Bayes’ rule:
𝑝(𝑥 | 𝑘)𝑝(𝑘)
𝑝(𝑘 | 𝑥) = (6.16)
𝑝(𝑥)
𝒩 (𝑥 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
= ∑︀𝐾 (6.17)
𝑘=1 𝒩 (𝑥 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
Since the class parameters, 𝑝(𝑘), 𝜇𝑘 and Σ𝑘 , depend on the responsibilities 𝑝(𝑘 | 𝑥) and the
responsibilities depend on class parameters, we need a two-step iterative algorithm: the
expectation-maximization (EM) algorithm. We discuss this algorithm next.
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect
to the parameters (comprised of the means and covariances of the components and the mixing
coefficients).
Initialize the means 𝜇𝑘 , covariances Σ𝑘 and mixing coefficients 𝑝(𝑘)
1. E step. For each sample 𝑖, evaluate the responsibilities for each class 𝑘 using the current
parameter values
𝒩 (𝑥𝑖 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
𝑝(𝑘 | 𝑥𝑖 ) = ∑︀𝐾
𝑘=1 𝒩 (𝑥𝑖 | 𝜇𝑘 , Σ𝑘 )𝑝(𝑘)
2. M step. For each class, re-estimate the parameters using the current responsibilities
𝑁
1 ∑︁
𝜇new
𝑘 = 𝑝(𝑘 | 𝑥𝑖 )𝑥𝑖 (6.18)
𝑁𝑘
𝑖=1
𝑁
1 ∑︁
Σnew
𝑘 = 𝑝(𝑘 | 𝑥𝑖 )(𝑥𝑖 − 𝜇new new 𝑇
𝑘 )(𝑥𝑖 − 𝜇𝑘 ) (6.19)
𝑁𝑘
𝑖=1
𝑁𝑘
𝑝new (𝑘) = (6.20)
𝑁
3. Evaluate the log-likelihood
𝑁
{︃ 𝐾 }︃
∑︁ ∑︁
ln 𝒩 (𝑥|𝜇𝑘 , Σ𝑘 )𝑝(𝑘) ,
𝑖=1 𝑘=1
and check for convergence of either the parameters or the log-likelihood. If the convergence
criterion is not satisfied return to step 1.
import numpy as np
from sklearn import datasets
import sklearn
(continues on next page)
import pystatsml.plot_utils
colors = sns.color_palette()
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm2.predict(X)])
for i in range(gmm2.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm2.covariances_[i, :],
pos=gmm2.means_[i, :],
facecolor='none', linewidth=2, edgecolor=colors[i])
plt.scatter(gmm2.means_[i, 0], gmm2.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w", linewidth=2)
plt.title("K=2")
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm3.predict(X)])
for i in range(gmm3.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm3.covariances_[i, :],
pos=gmm3.means_[i, :],
facecolor='none', linewidth=2, edgecolor=colors[i])
plt.scatter(gmm3.means_[i, 0], gmm3.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w", linewidth=2)
plt.title("K=3")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab] for lab in gmm4.predict(X)])
for i in range(gmm4.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm4.covariances_[i, :],
pos=gmm4.means_[i, :],
facecolor='none', linewidth=2, edgecolor=colors[i])
plt.scatter(gmm4.means_[i, 0], gmm4.means_[i, 1], edgecolor=colors[i],
marker="o", s=100, facecolor="w", linewidth=2)
_ = plt.title("K=4")
Models of covariances: parmeter covariance_type see Sklearn doc. K-means is almost a GMM
with spherical covariance.
In statistics, the Bayesian information criterion (BIC) is a criterion for model selection among
a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the
likelihood function and it is closely related to the Akaike information criterion (AIC).
X = iris.data
y_iris = iris.target
bic = list()
ks = np.arange(1, 10)
for k in ks:
gmm = GaussianMixture(n_components=k, covariance_type='full')
gmm.fit(X)
bic.append(gmm.bic(X))
k_chosen = ks[np.argmin(bic)]
plt.plot(ks, bic)
plt.xlabel("k")
plt.ylabel("BIC")
Choose k= 2
Sources:
• Hierarchical clustering with Scikit-learn
• Comparing different hierarchical linkage
Hierarchical clustering is an approach to clustering that build hierarchies of clusters in two main
approaches:
• Agglomerative: A bottom-up strategy, where each observation starts in their own cluster,
and pairs of clusters are merged upwards in the hierarchy.
• Divisive: A top-down strategy, where all observations start out in the same cluster, and
then the clusters are split recursively downwards in the hierarchy.
In order to decide which clusters to merge or to split, a measure of dissimilarity between clusters
is introduced. More specific, this comprise a distance measure and a linkage criterion. The
distance measure is just what it sounds like, and the linkage criterion is essentially a function of
the distances between points, for instance the minimum distance between points in two clusters,
the maximum distance between points in two clusters, the average distance between points in
two clusters, etc. One particular linkage criterion, the Ward criterion, will be discussed next.
The Agglomerative clustering use four main linkage strategies:
• Single Linkage: The distance between two clusters is defined as the shortest distance
between any two points in the clusters. This can create elongated, chain-like clusters
organized on manifolds that cannot be summarized by distribution around a center. How-
ever, it is sensitive to noise.
• Complete Linkage: The distance between two clusters is the maximum distance between
any two points in the clusters. This tends to produce compact and well-separated clusters.
• Average Linkage: The distance between two clusters is the average distance between all
pairs of points in the two clusters. This provides a balance between single and complete
linkage.
• Ward’s Linkage: Merges clusters by minimizing the increase in within-cluster variance.
This often results in more evenly sized clusters and is commonly used for hierarchical
clustering.
Application to “non-linear” manifolds: Ward vs. single linkage.
# Plot results
fig, axes = plt.subplots(1, 2)
plt.tight_layout()
plt.show()
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], edgecolors='k',
c=[colors[lab] for lab in ward2.fit_predict(X)])
plt.title("K=2")
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], edgecolors='k',
c=[colors[lab] for lab in ward3.fit_predict(X)])
plt.title("K=3")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], edgecolors='k',
c=[colors[lab] for lab in ward4.fit_predict(X)]) # .astype(np.float))
plt.title("K=4")
6.4.4 Exercises
Perform clustering of the iris dataset based on all variables using Gaussian mixture models. Use
PCA to visualize clusters.
SEVEN
SUPERVISED LEARNING
In statistics and machine learning, overfitting occurs when a statistical model describes random
errors or noise instead of the underlying relationships. Overfitting generally occurs when a
model is excessively complex, such as having too many parameters relative to the number
of observations. A model that has been overfit will generally have poor predictive performance,
as it can exaggerate minor fluctuations in the data.
A learning algorithm is trained using some set of training samples. If the learning algorithm has
the capacity to overfit the training samples the performance on the training sample set will
improve while the performance on unseen test sample set will decline.
The overfitting phenomenon has three main explanations: - excessively complex models, - mul-
ticollinearity, and - high dimensionality.
Multicollinearity
Predictors are highly correlated, meaning that one can be linearly predicted from the others.
In this situation the coefficient estimates of the multiple regression may change erratically in
response to small changes in the model or the data. Multicollinearity does not reduce the
predictive power or reliability of the model as a whole, at least not within the sample data
set; it only affects computations regarding individual predictors. That is, a multiple regression
model with correlated predictors can indicate how well the entire bundle of predictors predicts
the outcome variable, but it may not give valid results about any individual predictor, or about
which predictors are redundant with respect to others. In case of perfect multicollinearity the
predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a
general linear model y = Xw + 𝜀, the ordinary least-squares estimator, w𝑂𝐿𝑆 = (X𝑇 X)−1 X𝑇 y,
does not exist.
import numpy as np
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
(continues on next page)
269
Statistics and Machine Learning in Python, Release 0.8
An example where correlated predictor may produce an unstable model follows: We want to
predict the business potential (pb) of some companies given their business volume (bv) and the
taxes (tx) they are paying. Here pb ~ 10% of bv. However, taxes = 20% of bv (tax and bv
are highly collinear), therefore there is an infinite number of linear combinations of tax and bv
that lead to the same prediction. Solutions with very large coefficients will produce excessively
large predictions.
Multicollinearity between the predictors: business volumes and tax produces unstable models
X = np.column_stack([bv, tax])
beta_star = np.array([.1, 0]) # true solution
'''
Since tax and bv are correlated, there is an infinite number of linear
(continues on next page)
Model complexity
High-dimensional data refers to datasets with a large number of input features (𝑃 ). In linear
models, each feature corresponds to a parameter, so when the number of features 𝑃 is large
compared to the number of training samples 𝑁 (the “large P, small N” problem), the model
tends to overfit the training data. This phenomenon is part of the curse of dimensionality,
which describes the difficulties that arise when working in high-dimensional spaces.
One of the most critical factors in selecting a machine learning algorithm is the relationship
between 𝑃 and 𝑁 , as it significantly impacts model performance. Below are three key problems
associated with high-dimensionality:
Infinite Solutions and Ill-Conditioned Matrices
In linear models, the covariance matrix X𝑇 X is of size 𝑃 × 𝑃 and has rank min(𝑁, 𝑃 ). When
𝑃 > 𝑁 , the system of equations becomes overparameterized, meaning there are infinitely
many possible solutions that fit the training data. This leads to poor generalization, as the
learned solutions may be highly specific to the dataset. In such cases, the covariance matrix
is singular or ill-conditioned, making it unstable for inversion in methods like ordinary least
squares regression.
Exponential Growth of Sample Requirements
The density of data points in a high-dimensional space decreases exponentially with increasing
𝑃 . Specifically, the effective sampling density of 𝑁 points in a 𝑃 -dimensional space is pro-
portional to 𝑁 1/𝑃 . As a result, the data becomes increasingly sparse as 𝑃 grows, making it
difficult to estimate distributions or learn meaningful patterns. To maintain a constant density,
the number of required samples grows exponentially. For example:
For example, with 𝑁 = 500 and 𝑃 = 10, this distance is approximately 0.52, meaning that
most data points are more than halfway to the boundary. This has severe consequences for
prediction:
• In lower dimensions, models interpolate between data points.
• In high dimensions, models must extrapolate, which is significantly harder and less reli-
able.
This explains why many machine learning algorithms perform poorly in high-dimensional set-
tings and why dimensionality reduction techniques (e.g., PCA, feature selection) are essential.
Conclusion
The curse of dimensionality creates fundamental challenges for machine learning, including
overparameterization, data sparsity, and unreliable predictions. Addressing these issues re-
quires strategies such as dimensionality reduction, regularization, and feature selection to
ensure that models generalize well and remain computationally efficient.
(Source: T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction.* Second Edition, 2009.)*
These techniques add constraints to the model’s parameters to prevent excessive complexity.
• L2 Regularization (Ridge Regression / Weight Decay / Shrinkage)
– Adds a squared penalty: 𝜆 𝑤𝑖2 .
∑︀
– Shrinks weights but does not eliminate them, reducing model sensitivity to noise.
– Common in linear regression, logistic regression, and deep learning.
• L1 Regularization (Lasso Regression)
∑︀
– Adds an absolute penalty: 𝜆 |𝑤𝑖 |.
– Promote sparsity by setting some weights to zero, effectively selecting features.
– Used in high-dimensional datasets to perform feature selection.
• Elastic Net Regularization
𝑤𝑖2
∑︀ ∑︀
– Combines L1 and L2 penalties: 𝜆1 |𝑤𝑖 | + 𝜆2
– Used when dealing with correlated features.
• Feature Selection
– Reduces model complexity by removing redundant or irrelevant features.
– Methods include univariate filter (SelecKBest) or recursive feature elimination (RFE)
and mutual information filtering.
• Unsupervised Dimension Reduction as preprocessing step
– Reduces model complexity by reducing the dimension of the imput data.
– Methods include Linear Dimension reduction or Manifold Learning.
– Unsupervised approaches are generally not efficient, as they tend to overfill the data
before the supervised stage.
• Bayesian Regularization
– Introduces priors over model parameters, effectively acting as L2 regularization.
– Used in Bayesian Neural Networks, Gaussian Processes, and Bayesian Ridge Regres-
sion.
• Dropout
– Randomly disables a fraction of neurons during training to reduce reliance on specific
features.
– Helps improve generalization in fully connected and convolutional networks.
• Batch Normalization
– Normalizes activations across mini-batches, reducing internal covariate shift.
– Acts as an implicit regularizer by smoothing the optimization landscape.
• Early Stopping
Data-Centric Regularization
• Data Augmentation
– Artificially increases the dataset size using transformations (e.g., rotations, scaling,
flipping).
– Particularly useful in image and text processing tasks.
• Adding Noise to Inputs or Weights
– Introduces small random noise to training data or network weights to improve ro-
bustness.
– Common in deep learning and reinforcement learning.
Summary
Linear regression models the output, or target variable 𝑦 ∈ R as a linear combination of the 𝑃 -
dimensional input x ∈ R𝑃 . Let X be the 𝑁 × 𝑃 matrix with each row an input vector (with a 1
in the first position), and similarly let y be the 𝑁 -dimensional vector of outputs in the training
set, the linear model will predict the y given x using the parameter vector, or weight vector
w ∈ R𝑃 according to
𝑦𝑖 = x𝑇𝑖 w + 𝜀𝑖 ,
for all observation 𝑖. For the whole dataset this is equivalent to:
y = Xw + 𝜀,
where 𝜀 ∈ R𝑁 are the residuals (𝜀𝑖 ), or the errors of the prediction. The w is found by
minimizing an objective function, which is the loss function, 𝐿(w), i.e. the error measured on
the data. This error is the sum of squared errors (SSE) loss.
Minimizing the SSE is the Ordinary Least Square OLS regression as objective function. which is
a simple ordinary least squares (OLS) minimization whose analytic solution is:
The solution could also be found using gradient descent using the gradient of the loss:
𝐿(w, X, y) ∑︁
𝜕 =2 x𝑖 (x𝑖 · w − 𝑦𝑖 )
𝜕w
𝑖
Linear regression of Advertising.csv dataset with TV and Radio advertising as input features
and Sales as target. The linear model that minimizes the MSE is a plan (2 input features)
defined as: Sales = 0.05 TV + .19 Radio + 3:
Scikit-learn offers many models for supervised learning, and they all follow the same Applica-
tion Programming Interface (API), namely:
est = Estimator()
est.fit(X, y)
predictions = est.predict(X)
import numpy as np
import pandas as pd
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
(continues on next page)
lr = lm.LinearRegression().fit(X, y)
Regarding linear models, overfitting generally leads to excessively complex solutions (coeffi-
cient vectors), accounting for noise or spurious correlations within predictors. Regularization
aims to alleviate this phenomenon by constraining (biasing or reducing) the capacity of the
learning algorithm in order to promote simple solutions. Regularization penalizes “large” solu-
tions forcing the coefficients to be small, i.e. to shrink them toward zeros.
The objective function 𝐽(w) to minimize with respect to w is composed of a loss function 𝐿(w)
for goodness-of-fit and a penalty term Ω(w) (regularization to avoid overfitting). This is a
trade-off where the respective contribution of the loss and the penalty terms is controlled by
the regularization parameter 𝜆.
Therefore the loss function 𝐿(w) is combined with a penalty function Ω(w) leading to the
general form:
𝐽(w) = 𝐿(w) + 𝜆Ω(w).
The respective contribution of the loss and the penalty is controlled by the regularization
parameter 𝜆.
Popular penalties are:
• Ridge (also called ℓ2 ) penalty: ‖w‖22 . It shrinks coefficients toward 0.
• Lasso (also called ℓ1 ) penalty: ‖w‖1 . It performs feature selection by setting some coeffi-
cients to 0.
• ElasticNet (also called ℓ1 ℓ2 ) penalty: 𝛼 𝜌 ‖w‖1 + (1 − 𝜌) ‖w‖22 . It performs selection of
(︀ )︀
0 1 2 3 4 5 6 7 8 9
True 28.49 0.00 13.17 0.00 48.97 70.44 39.70 0.00 0.00 0.00
lr 28.49 0.00 13.17 0.00 48.97 70.44 39.70 0.00 0.00 -0.00
l2 1.03 0.21 0.93 -0.32 1.82 1.57 2.10 -1.14 -0.84 -1.02
l1 0.00 -0.00 0.00 -0.00 24.40 25.16 25.36 -0.00 -0.00 -0.00
l1l2 0.78 0.00 0.51 -0.00 7.20 5.71 8.95 -1.38 -0.00 -0.40
Ridge regression impose a ℓ2 penalty on the coefficients, i.e. it penalizes with the Euclidean
norm of the coefficients while minimizing SSE. The objective function becomes:
𝑁
∑︁
Ridge(w) = (𝑦𝑖 − x𝑇𝑖 w)2 + 𝜆‖w‖22 (7.5)
𝑖
= ‖y − xw‖22 + 𝜆‖w‖22 . (7.6)
The w that minimises 𝐹𝑅𝑖𝑑𝑔𝑒 (w) can be found by the following derivation:
∇w Ridge(w) = 0 (7.7)
∇w (y − Xw)𝑇 (y − Xw) + 𝜆w𝑇 w = 0
(︀ )︀
(7.8)
∇w (y𝑇 y − 2w𝑇 X𝑇 y + w𝑇 X𝑇 Xw + 𝜆w𝑇 w) = 0
(︀ )︀
(7.9)
−2X𝑇 y + 2X𝑇 Xw + 2𝜆w = 0 (7.10)
𝑇 𝑇
−X y + (X X + 𝜆I)w = 0 (7.11)
(X𝑇 X + 𝜆I)w = X𝑇 y (7.12)
𝑇 −1 𝑇
w = (X X + 𝜆I) X y (7.13)
• The solution adds a positive constant to the diagonal of X𝑇 X before inversion. This makes
the problem nonsingular, even if X𝑇 X is not of full rank, and was the main motivation
behind ridge regression.
• Increasing 𝜆 shrinks the w coefficients toward 0.
• This approach penalizes the objective function by the Euclidian (:math:`ell_2`) norm
of the coefficients such that solutions with large coefficients become unattractive.
The gradient of the loss:
𝐿(w, X, y) ∑︁
𝜕 = 2( x𝑖 (x𝑖 · w − 𝑦𝑖 ) + 𝜆w)
𝜕w
𝑖
Lasso regression penalizes the coefficients by the ℓ1 norm. This constraint will reduce (bias)
the capacity of the learning algorithm. To add such a penalty forces the coefficients to be small,
i.e. it shrinks them toward zero. The objective function to minimize becomes:
𝑁
∑︁
Lasso(w) = (𝑦𝑖 − x𝑇𝑖 w)2 + 𝜆‖w‖1 . (7.14)
𝑖
This penalty forces some coefficients to be exactly zero, providing a feature selection property.
Occam’s razor
Occam’s razor (also written as Ockham’s razor, and lex parsimoniae in Latin, which means
law of parsimony) is a problem solving principle attributed to William of Ockham (1287-1347),
who was an English Franciscan friar and scholastic philosopher and theologian. The principle
can be interpreted as stating that among competing hypotheses, the one with the fewest
assumptions should be selected.
Principle of parsimony
The penalty based on the ℓ1 norm promotes sparsity (scattered, or not dense): it forces many
coefficients to be exactly zero. This also makes the coefficient vector scattered.
The figure bellow illustrates the OLS loss under a constraint acting on the ℓ1 norm of the coef-
ficient vector. I.e., it illustrates the following optimization problem:
minimize ‖y − Xw‖22
w
subject to ‖w‖1 ≤ 1.
Optimization issues
Section to be completed
• No more closed-form solution.
• Convex but not differentiable.
• Requires specific optimization algorithms, such as the fast iterative shrinkage-thresholding
algorithm (FISTA): Amir Beck and Marc Teboulle, A Fast Iterative Shrinkage-Thresholding
Algorithm for Linear Inverse Problems SIAM J. Imaging Sci., 2009.
The ridge penalty shrinks the coefficients toward zero. The figure illustrates: the OLS solution
on the left. The ℓ1 and ℓ2 penalties in the middle pane. The penalized OLS in the right pane.
The right pane shows how the penalties shrink the coefficients toward zero. The black points
are the minimum found in each case, and the white points represents the true solution used to
generate the data.
The Elastic-net estimator combines the ℓ1 and ℓ2 penalties, and results in the problem to
𝑁
∑︁
(𝑦𝑖 − x𝑇𝑖 w)2 + 𝛼 𝜌 ‖w‖1 + (1 − 𝜌) ‖w‖22 ,
(︀ )︀
Enet(w) = (7.15)
𝑖
Rational
• If there are groups of highly correlated variables, Lasso tends to arbitrarily select only
one from each group. These models are difficult to interpret because covariates that are
strongly associated with the outcome are not included in the predictive model. Conversely,
the elastic net encourages a grouping effect, where strongly correlated predictors tend to
be in or out of the model together.
• Studies on real world data and simulation studies show that the elastic net often outper-
forms the lasso, while enjoying a similar sparsity of representation.
R-squared
The goodness of fit of a statistical model describes how well it fits a set of observations. Mea-
sures of goodness of fit typically summarize the discrepancy between observed values and the
values expected under the model in question. We will consider the explained variance also
known as the coefficient of determination, denoted 𝑅2 pronounced R-squared.
The total sum of squares, 𝑆𝑆tot is the sum of the sum of squares explained by the regression,
𝑆𝑆reg , plus the sum of squares of residuals unexplained by the regression, 𝑆𝑆res , also called the
SSE, i.e. such that
Fig. 7: title
The mean of 𝑦 is
1 ∑︁
𝑦¯ = 𝑦𝑖 .
𝑛
𝑖
The total sum of squares is the total squared sum of deviations from the mean of 𝑦, i.e.
∑︁
𝑆𝑆tot = (𝑦𝑖 − 𝑦¯)2
𝑖
The regression sum of squares, also called the explained sum of squares:
∑︁
𝑆𝑆reg = 𝑦𝑖 − 𝑦¯)2 ,
(ˆ
𝑖
where 𝑦ˆ𝑖 = 𝛽𝑥𝑖 + 𝛽0 is the estimated value of salary 𝑦ˆ𝑖 given a value of experience 𝑥𝑖 .
The sum of squares of the residuals (SSE, Sum Squared Error), also called the residual sum of
squares (RSS) is:
∑︁
𝑆𝑆res = (𝑦𝑖 − 𝑦ˆ𝑖 )2 .
𝑖
𝑅2 is the explained sum of squares of errors. It is the variance explain by the regression divided
by the total variance, i.e.
explained SS 𝑆𝑆reg 𝑆𝑆𝑟𝑒𝑠
𝑅2 = = =1− .
total SS 𝑆𝑆𝑡𝑜𝑡 𝑆𝑆𝑡𝑜𝑡
Test
ˆ 2 = 𝑆𝑆res /(𝑛 − 2) be an estimator of the variance of 𝜖. The 2 in the denominator stems
Let 𝜎
from the 2 estimated parameters: intercept and coefficient.
𝑆𝑆res
• Unexplained variance: ^2
𝜎
∼ 𝜒2𝑛−2
𝑆𝑆
• Explained variance: 𝜎^ 2reg ∼ 𝜒21 . The single degree of freedom comes from the difference
between 𝑆𝑆
^2
𝜎
tot
(∼ 𝜒2𝑛−1 ) and 𝑆𝑆
^2
𝜎
res
(∼ 𝜒2𝑛−2 ), i.e. (𝑛 − 1) − (𝑛 − 2) degree of freedom.
The Fisher statistics of the ratio of two variances:
Explained variance 𝑆𝑆reg /1
𝐹 = = ∼ 𝐹 (1, 𝑛 − 2)
Unexplained variance 𝑆𝑆res /(𝑛 − 2)
Using the 𝐹 -distribution, compute the probability of observing a value greater than 𝐹 under
𝐻0 , i.e.: 𝑃 (𝑥 > 𝐹 |𝐻0 ), i.e. the survival function (1 − Cumulative Distribution Function) at 𝑥 of
the given 𝐹 -distribution.
X, y = datasets.make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1)
lr = lm.LinearRegression()
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)
r2 = metrics.r2_score(y_test, yhat)
mse = metrics.mean_squared_error(y_test, yhat)
mae = metrics.mean_absolute_error(y_test, yhat)
In pure numpy:
y_mu = np.mean(y_test)
(continues on next page)
r2 = (1 - ss_res / ss_tot)
mse = np.mean(res ** 2)
mae = np.mean(np.abs(res))
[width=15cm]
Given a training set of 𝑁 samples, 𝐷 = {(𝑥1 , 𝑦1 ), . . . , (𝑥𝑁 , 𝑦𝑁 )} , where 𝑥𝑖 is a multidimen-
sional input vector with dimension 𝑃 and class label (target or response).
Multiclass Classification problems can be seen as several binary classification problems 𝑦𝑖 ∈
{0, 1} where the classifier aims to discriminate the sample of the current class (label 1) versus
the samples of other classes (label 0).
Therfore, for each class the classifier seek for a vector of parameters 𝑤 that performs a linear
combination of the input variables, 𝑥𝑇 𝑤. This step performs a projection or a rotation of input
sample into a good discriminative one-dimensional sub-space, that best discriminate sample of
current class vs sample of other classes.
This score (a.k.a decision function) is tranformed, using the nonlinear activation funtion 𝑓 (.), to
a “posterior probabilities” of class 1: 𝑝(𝑦 = 1|𝑥) = 𝑓 (𝑥𝑇 𝑤), where, 𝑝(𝑦 = 1|𝑥) = 1 − 𝑝(𝑦 = 0|𝑥).
The decision surfaces (orthogonal hyperplan to 𝑤) correspond to 𝑓 (𝑥) = constant, so that
𝑥𝑇 𝑤 = constant and hence the decision surfaces are linear functions of 𝑥, even if the function
𝑓 (.) is nonlinear.
A thresholding of the activation (shifted by the bias or intercept) provides the predicted class
label.
The vector of parameters, that defines the discriminative axis, minimizes an objective function
𝐽(𝑤) that is a sum of of loss function 𝐿(𝑤) and some penalties on the weights vector Ω(𝑤).
∑︁
min 𝐽 = 𝐿(𝑦𝑖 , 𝑓 (𝑥𝑖 𝑇 𝑤)) + Ω(𝑤),
𝑤
𝑖
This geometric method does not make any probabilistic assumptions, instead it relies on dis-
tances. It looks for the linear projection of the data points onto a vector, 𝑤, that maximizes
the between/within variance ratio, denoted 𝐹 (𝑤). Under a few assumptions, it will provide the
same results as linear discriminant analysis (LDA), explained below.
Suppose two classes of observations, 𝐶0 and 𝐶1 , have means 𝜇0 and 𝜇1 and the same total
within-class scatter (“covariance”) matrix,
∑︁ ∑︁
𝑆𝑊 = (𝑥𝑖 − 𝜇0 )(𝑥𝑖 − 𝜇0 )𝑇 + (𝑥𝑗 − 𝜇1 )(𝑥𝑗 − 𝜇1 )𝑇 (7.16)
𝑖∈𝐶0 𝑗∈𝐶1
= 𝑋𝑐 𝑇 𝑋𝑐 , (7.17)
where 𝑋0 and 𝑋1 are the (𝑁0 × 𝑃 ) and (𝑁1 × 𝑃 ) matrices of samples of classes 𝐶0 and 𝐶1 .
Let 𝑆𝐵 being the scatter “between-class” matrix, given by
𝑆𝐵 = (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 .
2
𝜎between
𝐹Fisher (𝑤) = 2 (7.18)
𝜎within
(𝑤𝑇 𝜇1 − 𝑤𝑇 𝜇0 )2
= (7.19)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
(𝑤𝑇 (𝜇1 − 𝜇0 ))2
= (7.20)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
𝑤𝑇 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑤
= (7.21)
𝑤𝑇 𝑋𝑐𝑇 𝑋𝑐 𝑤
𝑤 𝑇 𝑆𝐵 𝑤
= 𝑇 . (7.22)
𝑤 𝑆𝑊 𝑤
In the two-class case, the maximum separation occurs by a projection on the (𝜇1 − 𝜇0 ) using
the Mahalanobis metric 𝑆𝑊 −1 , so that
𝑤 ∝ 𝑆𝑊 −1 (𝜇1 − 𝜇0 ).
Demonstration
∇𝑤 𝐹Fisher (𝑤) = 0
(︂ 𝑇 )︂
𝑤 𝑆𝐵 𝑤
∇𝑤 =0
𝑤 𝑇 𝑆𝑊 𝑤
(𝑤𝑇 𝑆𝑊 𝑤)(2𝑆𝐵 𝑤) − (𝑤𝑇 𝑆𝐵 𝑤)(2𝑆𝑊 𝑤) = 0
(𝑤𝑇 𝑆𝑊 𝑤)(𝑆𝐵 𝑤) = (𝑤𝑇 𝑆𝐵 𝑤)(𝑆𝑊 𝑤)
𝑤 𝑇 𝑆𝐵 𝑤
𝑆𝐵 𝑤 = (𝑆𝑊 𝑤)
𝑤 𝑇 𝑆𝑊 𝑤
𝑆𝐵 𝑤 = 𝜆(𝑆𝑊 𝑤)
𝑆𝑊 −1 𝑆𝐵 𝑤 = 𝜆𝑤.
Since we do not care about the magnitude of 𝑤, only its direction, we replaced the scalar factor
(𝑤𝑇 𝑆𝐵 𝑤)/(𝑤𝑇 𝑆𝑊 𝑤) by 𝜆.
In the multiple-class case, the solutions 𝑤 are determined by the eigenvectors of 𝑆𝑊 −1 𝑆𝐵 that
correspond to the 𝐾 − 1 largest eigenvalues.
However, in the two-class case (in which 𝑆𝐵 = (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 ) it is easy to show that
𝑤 = 𝑆𝑊 −1 (𝜇1 − 𝜇0 ) is the unique eigenvector of 𝑆𝑊 −1 𝑆𝐵 :
𝑆𝑊 −1 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑤 = 𝜆𝑤
𝑆𝑊 −1 (𝜇1 − 𝜇0 )(𝜇1 − 𝜇0 )𝑇 𝑆𝑊 −1 (𝜇1 − 𝜇0 ) = 𝜆𝑆𝑊 −1 (𝜇1 − 𝜇0 ),
𝑤 ∝ 𝑆𝑊 −1 (𝜇1 − 𝜇0 ).
import numpy as np
import pandas as pd
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
errors = y_pred_lda != y
print("Nb errors=%i, error rate=%.2f" %
(errors.sum(), errors.sum() / len(y_pred_lda)))
Logistic regression is called a generalized linear models. ie.: it is a linear model with a link
function that maps the output of linear multiple regression to the posterior probability of class
1 𝑝(1|𝑥) using the logistic sigmoid function:
1
𝑝(1|𝑤, 𝑥𝑖 ) =
1 + exp(−𝑤 · 𝑥𝑖 )
x = np.linspace(-6, 6, 100)
plt.plot(x, logistic(x))
plt.grid(True)
plt.title('Logistic (sigmoid)')
Logistic regression is a discriminative model since it focuses only on the posterior probability
of each class 𝑝(𝐶𝑘 |𝑥). It only requires to estimate the 𝑃 weights of the 𝑤 vector. Thus it should
be favoured over LDA with many input features. In small dimension and balanced situations it
would provide similar predictions than LDA.
However imbalanced group sizes cannot be explicitly controlled. It can be managed using a
reweighting of the input samples.
import sklearn.linear_model as lm
logreg = lm.LogisticRegression(penalty=None).fit(X, y)
# This class implements regularized logistic regression.
# C is the Inverse of regularization strength.
# Large value => no regularization.
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)
errors = y_pred_logreg != y
print("Nb errors=%i, error rate=%.2f" %
(errors.sum(), errors.sum() / len(y_pred_logreg)))
print(logreg.coef_.round(2))
Exercise
Explore the Logistic Regression parameters and proposes a solution in cases of highly im-
balanced training dataset 𝑁1 ≫ 𝑁0 when we know that in reality both classes have the same
probability 𝑝(𝐶1 ) = 𝑝(𝐶0 ).
7.3.4 Losses
The Loss function for sample 𝑖 is the negative log of the probability:
{︃
− log(𝑝(1|𝑤, 𝑥𝑖 )) if 𝑦𝑖 = 1
𝐿(𝑤, 𝑥𝑖 , 𝑦𝑖 ) =
− log(1 − 𝑝(1|𝑤, 𝑥𝑖 ) if 𝑦𝑖 = 0
For the whole dataset 𝑋, 𝑦 = {𝑥𝑖 , 𝑦𝑖 } the loss function to minimize 𝐿(𝑤, 𝑋, 𝑦) is the negative
negative log likelihood (nll) that can be simplied using a 0/1 coding of the label in the case of
binary classification:
This is known as the cross-entropy between the true label 𝑦 and the predicted probability 𝑝.
For the logistic regression case, we have:
∑︁
𝐿(𝑤, 𝑋, 𝑦) = {𝑦𝑖 𝑤 · 𝑥𝑖 − log(1 + exp(𝑤 · 𝑥𝑖 ))}
𝑖
𝐿(𝑤, 𝑋, 𝑦) ∑︁
𝜕 = 𝑥𝑖 (𝑦𝑖 − 𝑝(1|𝑤, 𝑥𝑖 ))
𝜕𝑤
𝑖
TODO
The penalties use in regression are also used in classification. The only difference is the loss
function generally the negative log likelihood (cross-entropy) or the hinge loss. We will explore:
Summary
0 1 2 3 4 5 6 7 8 9
lr 0.04 1.14 -0.28 0.57 0.55 -0.03 0.17 0.37 -0.42 0.39
l2 -0.05 0.52 -0.21 0.34 0.26 -0.05 0.14 0.27 -0.25 0.21
l1 0.00 0.31 0.00 0.10 0.00 0.00 0.00 0.26 0.00 0.00
l1l2 -0.01 0.41 -0.15 0.29 0.12 0.00 0.00 0.20 -0.10 0.06
When the matrix 𝑆𝑊 is not full rank or 𝑃 ≫ 𝑁 , the The Fisher most discriminant projection
estimate of the is not unique. This can be solved using a biased version of 𝑆𝑊 :
𝑆𝑊 𝑅𝑖𝑑𝑔𝑒 = 𝑆𝑊 + 𝜆𝐼
where 𝐼 is the 𝑃 × 𝑃 identity matrix. This leads to the regularized (ridge) estimator of the
Fisher’s linear discriminant analysis:
Increasing 𝜆 will:
• Shrinks the coefficients toward zero.
• The covariance will converge toward the diagonal matrix, reducing the contribution of
the pairwise covariances.
The objective function to be minimized is now the combination of the logistic loss (negative
log likelyhood) − log ℒ(𝑤) with a penalty of the L2 norm of the weights vector. In the two-class
case, using the 0/1 coding we obtain:
import sklearn.linear_model as lm
lrl2 = lm.LogisticRegression(penalty='l2', C=.1)
# This class implements regularized logistic regression.
# C is the Inverse of regularization strength. Large value => no regularization.
lrl2.fit(X, y)
y_pred_l2 = lrl2.predict(X)
prob_pred_l2 = lrl2.predict_proba(X)
print("Coef vector:")
print(lrl2.coef_.round(2))
errors = y_pred_l2 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y)))
The objective function to be minimized is now the combination of the logistic loss − log ℒ(𝑤)
with a penalty of the L1 norm of the weights vector. In the two-class case, using the 0/1 coding
we obtain:
import sklearn.linear_model as lm
lrl1 = lm.LogisticRegression(penalty='l1', C=.1, solver='saga') # lambda = 1 / C!
lrl1.fit(X, y)
y_pred_lrl1 = lrl1.predict(X)
errors = y_pred_lrl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→lrl1)))
print("Coef vector:")
print(lrl1.coef_.round(2))
7.3.8 Linear Support Vector Machine (ℓ2 -regularization with Hinge loss)
Support Vector Machine seek for separating hyperplane with maximum margin to enforce ro-
bustness against noise. Like logistic regression it is a discriminative method that only focuses
of predictions.
Here we present the non separable case of Maximum Margin Classifiers with ±1 coding (ie.:
𝑦𝑖 {−1, +1}). In the next figure the legend aply to samples of “dot” class.
Here we introduced the slack variables: 𝜉𝑖 , with 𝜉𝑖 = 0 for points that are on or inside the
correct margin boundary and 𝜉𝑖 = |𝑦𝑖 − (𝑤 𝑐𝑑𝑜𝑡 · 𝑥𝑖 )| for other points. Thus:
1. If 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 1 then the point lies outside the margin but on the correct side of the
decision boundary. In this case 𝜉𝑖 = 0. The constraint is thus not active for this point. It
does not contribute to the prediction.
2. If 1 > 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 0 then the point lies inside the margin and on the correct side of the
decision boundary. In this case 0 < 𝜉𝑖 ≤ 1. The constraint is active for this point. It does
contribute to the prediction as a support vector.
3. If 0 < 𝑦𝑖 (𝑤 · 𝑥𝑖 )) then the point is on the wrong side of the decision boundary (missclassi-
fication). In this case 0 < 𝜉𝑖 > 1. The constraint is active for this point. It does contribute
to the prediction as a support vector.
This loss is called the hinge loss, defined as:
max(0, 1 − 𝑦𝑖 (𝑤 · 𝑥𝑖 ))
So linear SVM is closed to Ridge logistic regression, using the hinge loss instead of the logistic
loss. Both will provide very similar predictions.
svmlin = svm.LinearSVC(C=.1)
# Remark: by default LinearSVC uses squared_hinge as loss
svmlin.fit(X, y)
y_pred_svmlin = svmlin.predict(X)
errors = y_pred_svmlin != y
print("Nb errors=%i, error rate=%.2f" %
(errors.sum(), errors.sum() / len(y_pred_svmlin)))
print("Coef vector:")
print(svmlin.coef_.round(2))
Linear SVM for classification (also called SVM-C or SVC) with l1-regularization
∑︀𝑁
min 𝐹Lasso linear SVM (𝑤) = ||𝑤||1 + 𝐶 𝑖 𝜉𝑖
with ∀𝑖 𝑦𝑖 (𝑤 · 𝑥𝑖 ) ≥ 1 − 𝜉𝑖
svmlinl1.fit(X, y)
y_pred_svmlinl1 = svmlinl1.predict(X)
errors = y_pred_svmlinl1 != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_
˓→svmlinl1)))
print("Coef vector:")
print(svmlinl1.coef_.round(2))
7.3.10 Exercise
Compare predictions of Logistic regression (LR) and their SVM counterparts, ie.: L2 LR vs L2
SVM and L1 LR vs L1 SVM
• Compute the correlation between pairs of weights vectors.
• Compare the predictions of two classifiers using their decision function:
– Give the equation of the decision function for a linear classifier, assuming that their
is no intercept.
– Compute the correlation decision function.
– Plot the pairwise decision function of the classifiers.
• Conclude on the differences between Linear SVM and logistic regression.
The objective function to be minimized is now the combination of the logistic loss log 𝐿(𝑤) or
the hinge loss with combination of L1 and L2 penalties. In the two-class case, using the 0/1
coding we obtain:
# Or saga solver:
# enetloglike = lm.LogisticRegression(penalty='elasticnet',
# C=.1, l1_ratio=0.5, solver='saga')
print("Hinge loss and logistic loss provide almost the same predictions.")
print("Confusion matrix")
metrics.confusion_matrix(enetlog.predict(X), enethinge.predict(X))
Hinge loss and logistic loss provide almost the same predictions.
Confusion matrix
Decision_function log x hinge losses:
metrics.accuracy_score(y_true, y_pred)
# Balanced accuracy
b_acc = recalls.mean()
Some classifier may have found a good discriminative projection 𝑤. However if the threshold
to decide the final predicted class is poorly adjusted, the performances will highlight an high
specificity and a low sensitivity or the contrary.
In this case it is recommended to use the AUC of a ROC analysis which basically provide a mea-
sure of overlap of the two classes when points are projected on the discriminative axis. For more
detail on ROC and AUC see:https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
Learning with discriminative (logistic regression, SVM) methods is generally based on minimiz-
ing the misclassification of training samples, which may be unsuitable for imbalanced datasets
where the recognition might be biased in favor of the most numerous class. This problem
can be addressed with a generative approach, which typically requires more parameters to be
determined leading to reduced performances in high dimension.
Dealing with imbalanced class may be addressed by three main ways (see Japkowicz and
Stephen (2002) for a review), resampling, reweighting and one class learning.
In sampling strategies, either the minority class is oversampled or majority class is undersam-
pled or some combination of the two is deployed. Undersampling (Zhang and Mani, 2003) the
majority class would lead to a poor usage of the left-out samples. Sometime one cannot afford
such strategy since we are also facing a small sample size problem even for the majority class.
Informed oversampling, which goes beyond a trivial duplication of minority class samples, re-
quires the estimation of class conditional distributions in order to generate synthetic samples.
Here generative models are required. An alternative, proposed in (Chawla et al., 2002) generate
samples along the line segments joining any/all of the k minority class nearest neighbors. Such
procedure blindly generalizes the minority area without regard to the majority class, which may
be particularly problematic with high-dimensional and potentially skewed class distribution.
Reweighting, also called cost-sensitive learning, works at an algorithmic level by adjusting
the costs of the various classes to counter the class imbalance. Such reweighting can be im-
plemented within SVM (Chang and Lin, 2001) or logistic regression (Friedman et al., 2010)
classifiers. Most classifiers of Scikit learn offer such reweighting possibilities.
The class_weight parameter can be positioned into the "balanced" mode which uses the values
of 𝑦 to automatically adjust weights inversely proportional to class frequencies in the input data
as 𝑁/(2𝑁𝑘 ).
# dataset
X, y = make_classification(n_samples=500,
n_features=5,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=1,
shuffle=False)
import numpy as np
from sklearn import metrics
N_test = 50
X, y = make_classification(n_samples=200, random_state=42,
shuffle=False, class_sep=0.80)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=N_test, random_state=42)
lr = lm.LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
(continues on next page)
# acc, N = 0.65, 70
k = int(acc * N_test)
acc_test = stats.binomtest(k=k, n=N_test, p=0.5, alternative='greater')
auc_pval = stats.mannwhitneyu(
proba_pred[y_test == 0], proba_pred[y_test == 1]).pvalue
7.3.16 Exercise
Write a class FisherLinearDiscriminant that implements the Fisher’s linear discriminant anal-
ysis. This class must be compliant with the scikit-learn API by providing two methods: - fit(X,
y) which fits the model and returns the object itself; - predict(X) which returns a vector of the
predicted values. Apply the object on the dataset presented for the LDA.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
Kernel Machine are based kernel methods require only a user-specified kernel function
𝐾(𝑥𝑖 , 𝑥𝑗 ), i.e., a similarity function over pairs of data points (𝑥𝑖 , 𝑥𝑗 ) into kernel (dual) space
on which learning algorithms operate linearly, i.e. every operation on points is a linear combi-
nation of 𝐾(𝑥𝑖 , 𝑥𝑗 ). Outline of the SVM algorithm:
1. Map points 𝑥 into kernel space using a kernel function: 𝑥 → 𝐾(𝑥, .). Learning algo-
rithms operates linearly by dot product into high-kernel space: 𝐾(., 𝑥𝑖 ) · 𝐾(., 𝑥𝑗 ).
• Using the kernel trick (Mercer’s Theorem) replaces dot product in high dimensional
space by a simpler operation such that 𝐾(., 𝑥𝑖 ) · 𝐾(., 𝑥𝑗 ) = 𝐾(𝑥𝑖 , 𝑥𝑗 ).
• Thus we only need to compute a similarity measure 𝐾(𝑥𝑖 , 𝑥𝑗 ) for each pairs of point
and store in a 𝑁 × 𝑁 Gram matrix of.
7.4.2 SVM
2. The learning process consist of estimating the 𝛼𝑖 of the decision function that maximizes
the hinge loss (of 𝑓 (𝑥)) plus some penalty when applied on all training points.
3. Prediction of a new point 𝑥 using the decision function.
(︃ 𝑁 )︃
∑︁
𝑓 (𝑥) = sign 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥) .
𝑖
7.4. Non-Linear Kernel Methods and Support Vector Machines (SVM) 307
Statistics and Machine Learning in Python, Release 0.8
One of the most commonly used kernel is the Radial Basis Function (RBF) Kernel. For a pair
of points 𝑥𝑖 , 𝑥𝑗 the RBF kernel is defined as:
‖𝑥𝑖 − 𝑥𝑗 ‖2
(︂ )︂
𝐾(𝑥𝑖 , 𝑥𝑗 ) = exp − (7.28)
2𝜎 2
= exp −𝛾 ‖𝑥𝑖 − 𝑥𝑗 ‖2
(︀ )︀
(7.29)
Where 𝜎 (or 𝛾) defines the kernel width parameter. Basically, we consider a Gaussian function
centered on each training sample 𝑥𝑖 . it has a ready interpretation as a similarity measure as it
decreases with squared Euclidean distance between the two feature vectors.
Non linear SVM also exists for regression problems.
Dataset
X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.5, stratify=y, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Scikit-learn SVC (Support Vector Classification) with probalility function applying a logistic of
the decision_function
np.True_
7.4. Non-Linear Kernel Methods and Support Vector Machines (SVM) 309
Statistics and Machine Learning in Python, Release 0.8
Sources:
• Scikit-learn API
• Scikit-learn doc
Ensemble learning is a powerful machine learning technique that combines multiple models to
achieve better performance than any individual model. By aggregating predictions from diverse
learners, ensemble methods enhance accuracy, reduce variance, and improve generalization.
The main advantages of ensemble learning include:
• Reduced overfitting: By averaging multiple models, ensemble methods mitigate overfit-
ting risks.
• Increased robustness: The diversity of models enhances stability, making the approach
more resistant to noise and biases.
There are three main types of ensemble learning techniques: Bagging, Boosting, and Stack-
ing. Each method follows a unique strategy to combine multiple models and improve overall
performance.
Conclusion
Ensemble learning is a fundamental approach in machine learning that significantly enhances
predictive performance. Bagging helps reduce variance, boosting improves bias, and stacking
leverages multiple models to optimize performance. By carefully selecting and tuning ensemble
techniques, practitioners can build powerful and robust machine learning models suitable for
various real-world applications.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
print(breast_cancer.feature_names)
A tree can be “learned” by splitting the training dataset into subsets based on an features value
test. Each internal node represents a “test” on an feature resulting on the split of the current
sample. At each step the algorithm selects the feature and a cutoff value that maximises a given
metric. Different metrics exist for regression tree (target is continuous) or classification tree
(the target is qualitative). This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node has all the
same value of the target variable, or when splitting no longer adds value to the predictions.
This general principle is implemented by many recursive partitioning tree algorithms.
Decision trees are simple to understand and interpret however they tend to overfit the data.
However decision trees tend to overfit the training set. Leo Breiman propose random forest to
deal with this issue.
A single decision tree is usually overfits the data it is learning from because it learn from only
one pathway of decisions. Predictions from a single decision tree usually don’t make accurate
predictions on new data.
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
y_prob = tree.predict_proba(X_test)[:, 1]
print("bAcc: %.2f, AUC: %.2f " % (
metrics.balanced_accuracy_score(y_true=y_test, y_pred=y_pred),
metrics.roc_auc_score(y_true=y_test, y_score=y_prob)))
Bagging is an ensemble method that aims to reduce variance by training multiple models on
different subsets of the training data. It follows these steps:
1. Generate multiple bootstrap samples (randomly drawn with replacement) from the origi-
nal dataset.
2. Train an independent model (typically a weak learner like a decision tree) on each boot-
strap sample.
3. Aggregate predictions using majority voting (for classification) or averaging (for regres-
sion).
Example: The Random Forest algorithm is a widely used bagging method that constructs
multiple decision trees and combines their predictions.
Key Benefits:
• Reduces variance and improves stability.
• Works well with high-dimensional data.
• Effective for handling noisy datasets.
bagging_tree = BaggingClassifier(DecisionTreeClassifier())
bagging_tree.fit(X_train, y_train)
y_pred = bagging_tree.predict(X_test)
y_prob = bagging_tree.predict_proba(X_test)[:, 1]
print("bAcc: %.2f, AUC: %.2f " % (
metrics.balanced_accuracy_score(y_true=y_test, y_pred=y_pred),
metrics.roc_auc_score(y_true=y_test, y_score=y_prob)))
Random Forest
A random forest is a meta estimator that fits a number of decision tree learners on various
sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting. Random forest models reduce the risk of overfitting by introducing randomness by:
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
y_prob = forest.predict_proba(X_test)[:, 1]
print("bAcc: %.2f, AUC: %.2f " % (
metrics.balanced_accuracy_score(y_true=y_test, y_pred=y_pred),
metrics.roc_auc_score(y_true=y_test, y_score=y_prob)))
Boosting is an ensemble method that focuses on reducing bias by training models sequentially,
where each new model corrects the errors of its predecessors. The process includes:
1. Train an initial weak model on the training data.
2. Assign higher weights to misclassified instances to emphasize difficult cases.
3. Train a new model on the updated dataset, repeating the process iteratively.
4. Combine the predictions of all models using a weighted sum. %% Gradient boosting
~~~~~~~~~~~~~~~~~
Popular boosting algorithms include AdaBoost, Gradient Boosting Machines (GBM), XG-
Boost, and LightGBM.
Key Benefits:
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
subsample=0.5, random_state=0)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
y_prob = gb.predict_proba(X_test)[:, 1]
7.5.5 Stacking
Stacking (or stacked generalization) is a more complex ensemble technique that combines pre-
dictions from multiple base models using a meta-model. The process follows:
1. Train several base models (e.g., decision trees, SVMs, neural networks) on the same
dataset.
2. Collect predictions from all base models and use them as new features.
3. Train a meta-model (often a simple regression or classification model) to learn how to
best combine the base predictions.
Example: Stacking can combine weak and strong learners, such as decision trees, logistic re-
gression, and deep learning models, to create a robust final model.
Key Benefits:
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('svr', make_pipeline(StandardScaler(),
LinearSVC(random_state=42)))]
stacked_trees = StackingClassifier(estimators)
stacked_trees.fit(X_train, y_train)
y_pred = stacked_trees.predict(X_test)
y_prob = stacked_trees.predict_proba(X_test)[:, 1]
print("bAcc: %.2f, AUC: %.2f " % (
metrics.balanced_accuracy_score(y_true=y_test, y_pred=y_pred),
metrics.roc_auc_score(y_true=y_test, y_score=y_prob)))
This lab is inspired by a scikit-learn lab: Faces recognition example using eigenfaces and SVMs
It uses scikit-learan and pytorch models using skorch (slides).
• skorch provides scikit-learn compatible neural network library that wraps PyTorch.
• skorch abstracts away the training loop, making a lot of boilerplate code obsolete. A
simple net.fit(X, y) is enough.
Note that more sophisticated models can be used, see for a overview.
Models:
• Eigenfaces unsupervised exploratory analysis.
• LogisticRegression with L2 regularization (includes model selection with 5CV`_
• SVM-RBF (includes model selection with 5CV.
• MLP using sklearn using sklearn (includes model selection with 5CV)
• MLP using skorch classifier
• Basic Convnet (ResNet18) using skorch.
• Pretrained ResNet18 using skorch.
Pipelines:
import numpy as np
from time import time
import pandas as pd
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
# ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Preprocesing
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
# Dataset
from sklearn.datasets import fetch_lfw_people
# Models
from sklearn.decomposition import PCA
import sklearn.manifold as manifold
import sklearn.linear_model as lm
import sklearn.svm as svm
from sklearn.neural_network import MLPClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# Pytorch Models
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier
import skorch
7.6.1 Utils
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
single_faces[::5, :, :] = mean_faces
titles = [n for name in target_names for n in [name] * 5]
plot_gallery(single_faces, titles, h, w, n_row=n_classes, n_col=5)
7.6.4 Eigenfaces
Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled dataset): unsupervised
feature extraction / dimensionality reduction
n_components = 150
T-SNE
Plot eigenfaces:
Our goal is to obtain a good balanced accuracy, ie, the macro average (macro avg) of classes’
recalls. In this perspective, the good practices are:
• Scale input features using either StandardScaler() or MinMaxScaler() “It doesn’t harm”.
• Re-balance classes’ contributions class_weight=’balanced’
• Do not include an intercept (fit_intercept=False) in the model. This should reduce the
global accuracy weighted avg. But remember that we decided to maximize the balanced
accuracy.
lrl2_cv = make_pipeline(
preprocessing.StandardScaler(),
# preprocessing.MinMaxScaler(), # Would have done the job either
GridSearchCV(lm.LogisticRegression(max_iter=1000, class_weight='balanced',
(continues on next page)
t0 = time()
lrl2_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
print(lrl2_cv.steps[-1][1].best_params_)
y_pred = lrl2_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
done in 2.080s
Best params found by grid search:
{'C': np.float64(1.0)}
precision recall f1-score support
[[ 17 0 1 0 0 1 0]
[ 2 49 2 4 0 0 2]
[ 4 0 24 1 0 0 1]
[ 4 5 5 109 3 4 3]
[ 0 0 1 0 21 1 4]
[ 0 2 0 3 2 10 1]
[ 0 0 1 2 4 0 29]]
Coefficients
coefs = lrl2_cv.steps[-1][1].best_estimator_.coef_
coefs = coefs.reshape(-1, h, w)
plot_gallery(coefs, target_names, h, w)
Remarks: - RBF generally requires “large” C (>1) - Poly generally requires “small” C (<1)
svm_cv = make_pipeline(
# preprocessing.StandardScaler(),
preprocessing.MinMaxScaler(),
GridSearchCV(svm.SVC(class_weight='balanced'),
{'kernel': ['poly', 'rbf'], 'C': 10. ** np.arange(-2, 3)},
# {'kernel': ['rbf'], 'C': 10. ** np.arange(-1, 4)},
cv=5, n_jobs=5))
t0 = time()
svm_cv.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
(continues on next page)
y_pred = svm_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
done in 7.743s
Best params found by grid search:
{'C': np.float64(0.1), 'kernel': 'poly'}
precision recall f1-score support
mlp_param_grid = {"hidden_layer_sizes":
# Configurations with 1 hidden layer:
[(100, ), (50, ), (25, ), (10, ), (5, ),
# Configurations with 2 hidden layers:
(100, 50, ), (50, 25, ), (25, 10, ), (10, 5, ),
# Configurations with 3 hidden layers:
(100, 50, 25, ), (50, 25, 10, ), (25, 10, 5, )],
"activation": ["relu"], "solver": ["adam"], 'alpha': [0.0001]}
mlp_cv = make_pipeline(
# preprocessing.StandardScaler(),
preprocessing.MinMaxScaler(),
GridSearchCV(estimator=MLPClassifier(random_state=1, max_iter=400),
(continues on next page)
t0 = time()
mlp_cv.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best params found by grid search:")
print(mlp_cv.steps[-1][1].best_params_)
y_pred = mlp_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→sklearn/neural_network/_multilayer_perceptron.py:690: ConvergenceWarning:␣
˓→Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn
warnings.warn(
done in 134.940s
Best params found by grid search:
{'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'solver':
˓→'adam'}
class SimpleMLPClassifierPytorch(nn.Module):
"""Simple (one hidden layer) MLP Classifier with Pytorch."""
def __init__(self):
super(SimpleMLPClassifierPytorch, self).__init__()
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
t0 = time()
mlp.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = mlp.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
done in 1.502s
precision recall f1-score support
anova_l2lr = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('anova', SelectKBest(f_classif)),
('l2lr', lm.LogisticRegression(max_iter=1000, class_weight='balanced',
fit_intercept=False))
])
t0 = time()
anova_l2lr_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = anova_l2lr_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→numpy/ma/core.py:2881: RuntimeWarning: invalid value encountered in cast
pca_lrl2_cv = make_pipeline(
PCA(n_components=150, svd_solver='randomized', whiten=True),
GridSearchCV(lm.LogisticRegression(max_iter=1000, class_weight='balanced',
fit_intercept=False),
{'C': 10. ** np.arange(-3, 3)},
cv=5, n_jobs=5))
t0 = time()
pca_lrl2_cv.fit(X=X_train, y=y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = pca_lrl2_cv.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
done in 0.326s
Best params found by grid search:
{'C': np.float64(1.0)}
precision recall f1-score support
[[17 0 1 0 0 1 0]
[ 3 47 3 5 0 0 1]
[ 2 1 21 2 0 2 2]
[ 7 6 8 96 7 5 4]
[ 0 0 1 1 19 0 6]
[ 0 2 0 3 2 9 2]
[ 0 2 0 2 4 0 28]]
Note that to simplify, do not use pipeline (scaler + CNN) here. But it would have been simple
to do so, since pytorch is wrapped in skorch object that is compatible with sklearn.
Sources:
• ConvNet on MNIST
• NeuralNetClassifier
class Cnn(nn.Module):
"""Basic ConvNet Conv(1, 32, 64) -> FC(100, 7) -> softmax."""
x = torch.relu(self.fc1_drop(self.fc1(x)))
x = torch.softmax(self.fc2(x), dim=-1)
return x
torch.manual_seed(0)
cnn = NeuralNetClassifier(
Cnn,
max_epochs=100,
lr=0.001,
optimizer=torch.optim.Adam,
device=device,
train_split=skorch.dataset.ValidSplit(cv=5, stratified=True),
verbose=0)
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train).reshape(-1, 1, h, w)
(continues on next page)
t0 = time()
cnn.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = cnn.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
done in 86.976s
precision recall f1-score support
class Resnet18(nn.Module):
"""ResNet 18, pretrained, with one input chanel and 7 outputs."""
# self.model = torchvision.models.resnet18()
self.model = torchvision.models.resnet18(pretrained=True)
# Last layer
num_ftrs = self.model.fc.in_features
self.model.fc = nn.Linear(num_ftrs, n_outputs)
torch.manual_seed(0)
resnet = NeuralNetClassifier(
Resnet18,
# `CrossEntropyLoss` combines `LogSoftmax and `NLLLoss`
criterion=nn.CrossEntropyLoss,
max_epochs=50,
batch_size=128, # default value
optimizer=torch.optim.Adam,
# optimizer=torch.optim.SGD,
optimizer__lr=0.001,
optimizer__betas=(0.9, 0.999),
optimizer__eps=1e-4,
optimizer__weight_decay=0.0001, # L2 regularization
# Shuffle training data on each epoch
# iterator_train__shuffle=True,
train_split=skorch.dataset.ValidSplit(cv=5, stratified=True),
device=device,
verbose=0)
scaler = preprocessing.MinMaxScaler()
X_train_s = scaler.fit_transform(X_train).reshape(-1, 1, h, w)
X_test_s = scaler.transform(X_test).reshape(-1, 1, h, w)
t0 = time()
resnet.fit(X_train_s, y_train)
print("done in %0.3fs" % (time() - t0))
y_pred = resnet.predict(X_test_s)
print(classification_report(y_test, y_pred, target_names=target_names))
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is␣
˓→deprecated since 0.13 and may be removed in the future, please use 'weights'␣
˓→instead.
warnings.warn(
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight␣
˓→enum or `None` for 'weights' are deprecated since 0.13 and may be removed in␣
warnings.warn(msg)
done in 396.518s
precision recall f1-score support
EIGHT
RESAMPLING METHODS
Machine learning algorithms tend to overfit training data. Predictive performances MUST be
evaluated on independant hold-out dataset. A split of into a training test and an independent
test set mandatory. However to set the hyperparameters the dataset is generally splitted into
three sets:
1. Training Set (Fitting the Model and Learning Parameters)
• The training set is used to fit the model by learning its parameters (e.g., weights in a
neural network, coefficients in a regression model).
• The algorithm adjusts its parameters to minimize a chosen loss function (e.g., MSE for
regression, cross-entropy for classification).
335
Statistics and Machine Learning in Python, Release 0.8
• The model learns patterns from this data, but using only the training set risks overfit-
ting—where the model memorizes data instead of generalizing.
• Role: Learn the parameters of the model.
2. Validation Set (Hyperparameter Tuning and Model Selection)
• The validation set is used to fine-tune the model’s hyperparameters (e.g., learning rate,
number of layers, number of clusters).
• Hyperparameters are not directly learned from data but are instead set before training.
• The validation set helps to assess different model configurations, preventing overfitting
by ensuring that the model generalizes beyond the training set.
• If we see high performance on the training set but poor performance on the validation set,
we are likely overfitting.
• The process of choosing the best hyperparameters based on the validation set is called
model selection.
• Role: Tune hyperparameters and select the best model configuration.
• Data Leakage Risk: If we tune hyperparameters too much on the validation set, it essen-
tially becomes part of training, leading to potential overfitting on it.
3. Test Set (Final Independent Evaluation)
• The test set is an independent dataset used to evaluate the final model after training and
hyperparameter tuning.
• This provides an unbiased estimate of how the model will perform on completely new
data.
• The model should never be trained or tuned using the test set to ensure a fair evaluation.
• Performance metrics (e.g., accuracy, F1-score, ROC-AUC) on the test set indicate how well
the model is expected to perform in real-world scenarios.
• Role: Evaluate the final model’s performance on unseen data.
Summary:
• Training set
– Fits model parameters.
– High risk of overfitting if the model is too complex.
• Validation set
– Tunes hyperparameters and selects the best model.
– Risk of of overfitting if tuning too much.
• Test set
– Provides a final evaluation on unseen data.
Split dataset in train/test sets to train and assess the the final model after training and hyper-
parameter tuning.
mod = lm.Ridge(alpha=10)
mod.fit(X_train, y_train)
y_pred_test = mod.predict(X_test)
print("Test R2: %.2f" % metrics.r2_score(y_test, y_pred_test))
CV for regression
Usually the error function ℒ() is the r-squared score. However other function (MAE, MSE) can
be used.
CV with explicit loop:
estimator = lm.Ridge(alpha=10)
Train r2:0.99
Test r2:0.67
Test r2:0.73
Test r2:0.67
With classification problems it is essential to sample folds where each set contains approxi-
mately the same percentage of samples of each target class as the complete set. This is called
stratification. In this case, we will use StratifiedKFold with is a variation of k-fold which returns
stratified folds. As error function we recommend:
• The balanced accuracy
• The ROC-AUC
CV with explicit loop:
cv = StratifiedKFold(n_splits=5)
Test ACC:0.79
Test bACC:0.79
Combine CV and grid search: GridSearchCV perform hyperparameter tuning (model selection)
by systematically searching the best combination of hyperparameters evaluating all possible
combinations (over a grid of possible values) using cross-validation:
1. Define the model: Choose a machine learning model (e.g., SVM, Random Forest).
2. Specify hyperparameters: Create a dictionary of hyperparameters and their possible val-
ues.
3. Perform exhaustive search: GridSearchCV trains the model with every possible combina-
tion of hyperparameters.
4. Cross-validation: For each combination, it uses k-fold cross-validation (default cv=5).
5. Select the best model: The combination with the highest validation performance is chosen.
By default, refit an estimator using the best found parameters on the whole dataset.
# Predict
y_pred_test = lm_cv.predict(X_test)
print("Test bACC: %.2f" % metrics.balanced_accuracy_score(y_test, y_pred_test))
Cross-validation for both model (outer) evaluation and model (inner) selection
),
scores['fit_time'].sum()))
Regression
A permutation test is a type of non-parametric randomization test in which the null distribution
of a test statistic is estimated by randomly permuting the observations.
Permutation tests are highly attractive because they make no assumptions other than that the
observations are independent and identically distributed under the null hypothesis.
1. Compute a observed statistic 𝑡𝑜𝑏𝑠 on the data.
2. Use randomization to compute the distribution of 𝑡 under the null hypothesis: Perform 𝑁
random permutation of the data. For each sample of permuted data, 𝑖 the data compute
the statistic 𝑡𝑖 . This procedure provides the distribution of t under the null hypothesis 𝐻0 :
𝑃 (𝑡|𝐻0 )
3. Compute the p-value = 𝑃 (𝑡 > 𝑡𝑜𝑏𝑠 |𝐻0 ) |{𝑡𝑖 > 𝑡𝑜𝑏𝑠 }|, where 𝑡′𝑖 𝑠𝑖𝑛𝑐𝑙𝑢𝑑𝑒 : 𝑚𝑎𝑡ℎ : ‘𝑡𝑜𝑏𝑠 .
Example Ridge regression
Sample the distributions of r-squared and coefficients of ridge regression under the null hypoth-
esis. Simulated dataset:
orig_all = np.arange(X.shape[0])
for perm_i in range(1, nperm + 1):
model.fit(X, np.random.permutation(y))
y_pred = model.predict(X).ravel()
scores_perm[perm_i, :] = metrics.r2_score(y, y_pred)
coefs_perm[perm_i, :] = model.coef_
Compute p-values corrected for multiple comparisons using FWER max-T (Westfall and Young,
1993) procedure.
Plot distribution of third coefficient under null-hypothesis Coeffitients 0 and 1 are significantly
different from 0.
Paramters
---------
perms: 1d array, statistics under the null hypothesis.
perms[0] is the true statistic .
"""
# Re-weight to obtain distribution
pval = np.sum(perms >= perms[0]) / perms.shape[0]
weights = np.ones(perms.shape[0]) / perms.shape[0]
ax.hist([perms[perms >= perms[0]], perms], histtype='stepfilled',
bins=100, label="p-val<%.3f" % pval,
weights=[weights[perms >= perms[0]], weights])
# , label="observed statistic")
ax.axvline(x=perms[0], color="k", linewidth=2)
ax.set_ylabel(name)
ax.legend()
return ax
n_coef = coefs_perm.shape[1]
fig, axes = plt.subplots(n_coef, 1, figsize=(12, 9))
for i in range(n_coef):
hist_pvalue(coefs_perm[:, i], axes[i], str(i))
Exercise
Given the logistic regression presented above and its validation given a 5 folds CV.
1. Compute the p-value associated with the prediction accuracy measured with 5CV using a
permutation test.
2. Compute the p-value associated with the prediction accuracy using a parametric test.
8.3 Bootstrapping
2. For each sample 𝑖 fit the model and compute the scores.
3. Assess standard errors and confidence intervals of scores using the scores obtained on the
𝐵 resampled dataset. Or, average models predictions.
References:
[Efron B, Tibshirani R. Bootstrap methods for standard errors, confidence intervals, and other
measures of statistical accuracy. Stat Sci 1986;1:54–75](https://projecteuclid.org/download/
pdf_1/euclid.ss/1177013815)
# Bootstrap loop
nboot = 100 # !! Should be at least 1000
scores_names = ["r2"]
scores_boot = np.zeros((nboot, len(scores_names)))
coefs_boot = np.zeros((nboot, X.shape[1]))
orig_all = np.arange(X.shape[0])
for boot_i in range(nboot):
boot_tr = np.random.choice(orig_all, size=len(orig_all), replace=True)
boot_te = np.setdiff1d(orig_all, boot_tr, assume_unique=False)
Xtr, ytr = X[boot_tr, :], y[boot_tr]
Xte, yte = X[boot_te, :], y[boot_te]
model.fit(Xtr, ytr)
y_pred = model.predict(Xte).ravel()
scores_boot[boot_i, :] = metrics.r2_score(yte, y_pred)
coefs_boot[boot_i, :] = model.coef_
coefs_boot = pd.DataFrame(coefs_boot)
coefs_stat = coefs_boot.describe(percentiles=[.975, .5, .025])
print("Coefficients distribution")
print(coefs_stat)
df = pd.DataFrame(coefs_boot)
staked = pd.melt(df, var_name="Variable", value_name="Coef. distribution")
sns.set_theme(style="whitegrid")
ax = sns.violinplot(x="Variable", y="Coef. distribution", data=staked)
_ = ax.axhline(0, ls='--', lw=2, color="black")
Dataset
X, y = datasets.make_classification(
n_samples=20, n_features=5, n_informative=2, random_state=42)
cv = StratifiedKFold(n_splits=5)
test_accs = [metrics.accuracy_score(
y[test], y_test_pred_seq[test]) for train, test in cv.split(X, y)]
# Accuracy
print(np.mean(test_accs), test_accs)
# Coef
coefs_cv = np.array(coefs_seq)
print("Mean of the coef")
print(coefs_cv.mean(axis=0).round(2))
print("Std Err of the coef")
print((coefs_cv.std(axis=0) / np.sqrt(coefs_cv.shape[0])).round(2))
parallel = Parallel(n_jobs=5)
cv_ret = parallel(
delayed(_split_fit_predict)(
clone(estimator), X, y, train, test)
for train, test in cv.split(X, y))
y_test_pred = np.zeros(len(y))
for i, (train, test) in enumerate(cv.split(X, y)):
y_test_pred[test] = y_test_pred_cv[i]
test_accs = [metrics.accuracy_score(
y[test], y_test_pred[test]) for train, test in cv.split(X, y)]
print(np.mean(test_accs), test_accs)
NINE
9.1 Backpropagation
Y = max(XW(1) , 0)W(2)
A fully-connected ReLU network with one hidden layer and no biases, trained to predict y from
x using Euclidean error.
Forward pass with local partial derivatives of output given inputs:
𝑥 → 𝑧 (1) = 𝑥𝑇 𝑤(1) → ℎ(1) = max(𝑧 (1) , 0) → 𝑧 (2) = ℎ(1)𝑇 𝑤(2) → 𝐿(𝑧 (2) , 𝑦) = (𝑧 (2) − 𝑦)2
𝑤(1) ↗ 𝑤(2) ↗
𝜕𝑧 (1) 𝜕ℎ(1) 1 if 𝑧 (1) >0 𝜕𝑧
(2) 𝜕𝐿
=𝑥 = {else 0 = ℎ(1) = 2(𝑧 (2) − 𝑦)
𝜕𝑤(1) 𝜕𝑧 (1) 𝜕𝑤(2) 𝜕𝑧 (2)
𝜕𝑧 (1) 𝜕𝑧 (2)
= 𝑤(1) = 𝑤(2)
𝜕𝑥 𝜕ℎ(1)
Backward pass: compute gradient of the loss given each parameters vectors applying the chain
rule from the loss downstream to the parameters:
For 𝑤(2) :
351
Statistics and Machine Learning in Python, Release 0.8
𝜕𝐿 𝜕𝐿 𝜕𝑧 (2)
= (9.1)
𝜕𝑤(2) 𝜕𝑧 (2) 𝜕𝑤(2)
=2(𝑧 (2) − 𝑦)ℎ(1) (9.2)
For 𝑤(1) :
Given a function 𝑧 = 𝑥‘𝑤 with 𝑧 the output, 𝑥 the input and 𝑤 the coefficients.
Scalar to Scalar: 𝑥 ∈ R, 𝑧 ∈ R, 𝑤 ∈ R. Regular derivative:
𝜕𝑧
=𝑥∈R
𝜕𝑤
If 𝑤 changes by a small amount, how much will 𝑧 change?
Vector to Scalar: 𝑥 ∈ R𝑁 , 𝑧 ∈ R, 𝑤 ∈ R𝑁 . The derivative is the Gradient of partial derivative:
𝜕𝑧 𝑁
𝜕𝑤 ∈ R
⎡ 𝜕𝑧 ⎤
𝜕𝑤1
⎢ .. ⎥
⎢ . ⎥
𝜕𝑧 ⎢ 𝜕𝑧 ⎥
= ∇𝑤 𝑧 = ⎢
⎢ 𝜕𝑤𝑖
⎥ (9.5)
𝜕𝑤 ⎢ ..
⎥
⎥
⎣ . ⎦
𝜕𝑧
𝜕𝑤𝑁
For each element 𝑤𝑖 of 𝑤, if it changes by a small amount then how much will y change?
Vector to Vector: 𝑤 ∈ R𝑁 , 𝑧 ∈ R𝑀 . The derivative is Jacobian of partial derivative:
TO BE COMPLETED
𝜕𝑧
𝜕𝑤 ∈ R𝑁 ×𝑀
Backpropagation summary
import numpy as np
import sklearn.model_selection
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
iris = sns.load_dataset("iris")
#g = sns.pairplot(iris, hue="species")
df = iris[iris.species != "setosa"]
g = sns.pairplot(df, hue="species")
df['species_n'] = iris.species.map({'versicolor':1, 'virginica':2})
# Scale
from sklearn.preprocessing import StandardScaler
scalerx, scalery = StandardScaler(), StandardScaler()
X_iris = scalerx.fit_transform(X_iris)
(continues on next page)
/tmp/ipykernel_311643/2720649438.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
This implementation uses Numpy to manually compute the forward pass, loss, and backward
pass.
W1 = np.random.randn(D_in, H)
W2 = np.random.randn(H, D_out)
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y
z1 = X.dot(W1)
h1 = np.maximum(z1, 0)
Y_pred = h1.dot(W2)
# Update weights
W1 -= learning_rate * grad_w1
W2 -= learning_rate * grad_w2
losses_tr.append(loss)
losses_val.append(loss_val)
if t % 10 == 0:
print(t, loss, loss_val)
(continues on next page)
0 9203.227217413109 1771.0008046349092
10 105.08691753447015 142.69582265532787
20 57.85991994823311 88.78785521501051
30 49.29240133315093 78.82567808021732
40 46.38420044020043 74.75055450694353
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y
z1 = X.mm(W1)
h1 = z1.clamp(min=0)
y_pred = h1.mm(W2)
losses_tr.append(loss)
losses_val.append(loss_val)
if t % 10 == 0:
print(t, loss, loss_val)
0 18356.78125 5551.8330078125
10 123.62841796875 128.57347106933594
20 75.83642578125 80.15101623535156
30 62.85978317260742 69.07522583007812
40 57.26483917236328 65.22080993652344
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass
# using Tensors, but we do not need to keep references to intermediate
# values since we are not implementing the backward pass by hand.
y_pred = X.mm(W1).clamp(min=0).mm(W2)
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the
# gradient of the loss with respect to w1 and w2 respectively.
loss.backward()
y_pred = X_val.mm(W1).clamp(min=0).mm(W2)
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
0 5587.25634765625 1416.6103515625
10 172.29962158203125 314.83251953125
20 66.61656951904297 208.75909423828125
30 50.082000732421875 187.0765838623047
40 45.71429443359375 180.25369262695312
import torch
X = torch.from_numpy(X)
Y = torch.from_numpy(Y)
X_val = torch.from_numpy(X_val)
Y_val = torch.from_numpy(Y_val)
learning_rate = lr
for t in range(nite):
# Forward pass: compute predicted y by passing x to the model. Module
# objects override the __call__ operator so you can call them like
# functions. When doing so you pass a Tensor of input data to the Module
# and it produces a Tensor of output data.
y_pred = model(X)
# Compute and print loss. We pass Tensors containing the predicted and
# true values of y, and the loss function returns a Tensor containing the
(continues on next page)
# Backward pass: compute gradient of the loss with respect to all the
# learnable parameters of the model. Internally, the parameters of each
# Module are stored in Tensors with requires_grad=True, so this call
# will compute gradients for all learnable parameters in the model.
loss.backward()
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
0 123.43494415283203 138.0412139892578
10 66.410400390625 75.52440643310547
20 50.09678649902344 57.58549118041992
30 44.88142395019531 52.25304412841797
40 42.927696228027344 50.76806640625
This implementation uses the nn package from PyTorch to build the network. Rather than man-
ually updating the weights of the model as we have been doing, we use the optim package to
define an Optimizer that will update the weights for us. The optim package defines many op-
timization algorithms that are commonly used for deep learning, including SGD+momentum,
RMSProp, Adam, etc.
import torch
X = torch.from_numpy(X)
Y = torch.from_numpy(Y)
X_val = torch.from_numpy(X_val)
Y_val = torch.from_numpy(Y_val)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many
# other optimization algorithm. The first argument to the Adam constructor
# tells the optimizer which Tensors it should update.
learning_rate = lr
(continues on next page)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
with torch.no_grad():
y_pred = model(X_val)
loss_val = loss_fn(y_pred, Y_val)
if t % 10 == 0:
print(t, loss.item(), loss_val.item())
losses_tr.append(loss.item())
losses_val.append(loss_val.item())
0 83.01195526123047 89.04083251953125
10 60.55186462402344 63.37711715698242
20 48.237648010253906 50.89048385620117
30 42.51863479614258 47.64924621582031
40 40.23579788208008 48.51783752441406
Sources:
Sources:
• 3Blue1Brown video: But what is a neural network? | Deep learning chapter 1
• Stanford cs231n: Deep learning
• Pytorch: WWW tutorials
• Pytorch: github tutorials
• Pytorch: github examples
• Pytorch examples
• MNIST/pytorch nextjournal.com/gkoehler/pytorch-mnist
• Pytorch: github/pytorch/examples
• kaggle: MNIST/pytorch
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
# import torchvision
# from torchvision import transforms
# from torchvision import datasets
# from torchvision import models
#
from pathlib import Path
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# device = 'cpu' # Force CPU
print(device)
cpu
𝑓 (𝑥) = 𝜎(𝑥𝑇 𝑤 + 𝑏)
Where
• Input: 𝑥: a vector of dimension (𝑝) (layer 0).
• Parameters: 𝑤: a vector of dimension (𝑝) (layer 1). 𝑏 is the scalar bias.
• Output: 𝑓 (𝑥) a vector of dimension 1.
With multinomial logistic regression we have 𝑘 possible labels to predict. If we consider the
MNIST Handwritten Digit Recognition, the inputs is a 28 × 28 = 784 image and the output is a
vector of 𝑘 = 10 labels or probabilities.
𝑓 (𝑥) = softmax(𝑥𝑇 𝑊 + 𝑏)
where: - 𝑧𝑖 is the 𝑖-th element of the input vector z. - 𝑒 is the base of the natural logarithm. -
The sum in the denominator is over all elements of the input vector.
Softmax Properties
1. Probability Distribution: The output of the softmax function is a probability distribution,
meaning that all the outputs are non-negative and sum to 1.
2. Exponential Function: The use of the exponential function ensures that the outputs are
positive and that larger input values correspond to larger probabilities.
3. Normalization: The softmax function normalizes the input values by dividing by the sum
of the exponentials of all input values, ensuring that the outputs sum to 1
MNIST classfification using multinomial logistic
source: Logistic regression MNIST
Here we fit a multinomial logistic regression with L2 penalty on a subset of the MNIST digits
classification task.
source: scikit-learn.org
Hyperparameters
MNIST Loader
dataloaders, WD = load_mnist_pytorch(
batch_size_train=64, batch_size_test=10000)
os.makedirs(os.path.join(WD, "models"), exist_ok=True)
/home/ed203246/data/pystatml/dl_mnist_pytorch
Datasets shapes: {'train': torch.Size([60000, 28, 28]), 'test': torch.Size([10000,
˓→ 28, 28])}
So one test data batch is a tensor of shape: . This means we have 1000 examples of 28x28 pixels
in grayscale (i.e. no rgb channels, hence the one). We can plot some of them using matplotlib.
show_data_label_prediction(
data=example_data, y_true=example_targets, y_pred=None, shape=(2, 3))
X_train = dataloaders["train"].dataset.data.numpy()
X_train = X_train.reshape((X_train.shape[0], -1))
y_train = dataloaders["train"].dataset.targets.numpy()
X_test = dataloaders["test"].dataset.data.numpy()
X_test = X_test.reshape((X_test.shape[0], -1))
y_test = dataloaders["test"].dataset.targets.numpy()
print(X_train.shape, y_train.shape)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()
(continues on next page)
plt.show()
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
plt.show()
/home/ed203246/git/pystatsml/.pixi/envs/default/lib/python3.12/site-packages/
˓→sklearn/neural_network/_multilayer_perceptron.py:690: ConvergenceWarning:␣
˓→Stochastic Optimizer: Maximum iterations (10) reached and the optimization hasn
warnings.warn(
class TwoLayerMLP(nn.Module):
PyTorch doc: Save and reload PyTorch model:Note “If you only plan to keep the best per-
forming model (according to the acquired validation loss), don’t forget that best_model_state
= model.state_dict() returns a reference to the state and not its copy! You must serialize
best_model_state or use best_model_state = deepcopy(model.state_dict()) otherwise your best
best_model_state will keep getting updated by the subsequent training iterations. As a result, the
final model state will be the state of the overfitted model.”
Save/Load state_dict (Recommended) Save:
import torch
from copy import deepcopy
torch.save(deepcopy(model.state_dict()), PATH)
Load:
print(next(model.parameters()).is_cuda)
torch.save(deepcopy(model.state_dict()),
os.path.join(WD, 'models/mod-%s.pth' % model.__class__.__name__))
False
torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])
Total number of parameters = 39760
(continues on next page)
Training complete in 0m 5s
Best val Acc: 91.05%
False
Reload model
TwoLayerMLP(
(linear1): Linear(in_features=784, out_features=50, bias=True)
(linear2): Linear(in_features=50, out_features=10, bias=True)
)
Use the model to make new predictions. Consider the device, ie, load data on device
example_data.to(device) from prediction, then move back to cpu example_data.cpu().
with torch.no_grad():
output = model(example_data).cpu()
example_data = example_data.cpu()
# Softmax predictions
preds = output.argmax(dim=1)
show_data_label_prediction(
data=example_data, y_true=example_targets, y_pred=preds, shape=(3, 4))
Continue training from checkpoints: reload the model and run 10 more epochs
Epoch 0/9
----------
train Loss: 0.3097 Acc: 91.11%
test Loss: 0.2904 Acc: 91.91%
Epoch 2/9
----------
train Loss: 0.2844 Acc: 91.94%
test Loss: 0.2809 Acc: 92.10%
Epoch 4/9
----------
train Loss: 0.2752 Acc: 92.21%
test Loss: 0.2747 Acc: 92.23%
Epoch 6/9
----------
train Loss: 0.2688 Acc: 92.45%
test Loss: 0.2747 Acc: 92.17%
Epoch 8/9
----------
train Loss: 0.2650 Acc: 92.64%
test Loss: 0.2747 Acc: 92.32%
• Define a MultiLayerMLP([D_in, 512, 256, 128, 64, D_out]) class that take the size of
the layers as parameters of the constructor.
• Add some non-linearity with relu acivation function
class MLP(nn.Module):
Epoch 0/9
----------
train Loss: 1.1449 Acc: 64.01%
test Loss: 0.3366 Acc: 89.96%
Epoch 2/9
----------
train Loss: 0.1693 Acc: 95.00%
test Loss: 0.1361 Acc: 96.11%
Epoch 4/9
----------
train Loss: 0.0946 Acc: 97.21%
test Loss: 0.0992 Acc: 96.94%
Epoch 8/9
----------
train Loss: 0.0395 Acc: 98.87%
test Loss: 0.0833 Acc: 97.42%
Reduce the size of the training dataset by considering only 10 minibatche for size16.
train_size = 10 * 16
# Stratified sub-sampling
targets = dataloaders["train"].dataset.targets.numpy()
nclasses = len(set(targets))
train_loader = \
torch.utils.data.DataLoader(dataloaders["train"].dataset,
batch_size=16,
sampler=torch.utils.data.SubsetRandomSampler(indices))
Train size= 160 Train label count= {np.int64(0): np.int64(16), np.int64(1): np.
˓→int64(16), np.int64(2): np.int64(16), np.int64(3): np.int64(16), np.int64(4):␣
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'test': torch.Size([10000,␣
˓→28, 28])}
Epoch 0/99
----------
train Loss: 2.3042 Acc: 10.00%
test Loss: 2.3013 Acc: 9.80%
Epoch 20/99
----------
train Loss: 2.0282 Acc: 30.63%
test Loss: 2.0468 Acc: 24.58%
Epoch 40/99
----------
train Loss: 0.4945 Acc: 88.75%
test Loss: 0.9630 Acc: 66.42%
Epoch 80/99
----------
train Loss: 0.0145 Acc: 100.00%
test Loss: 1.0840 Acc: 74.40%
Training complete in 1m 7s
Best val Acc: 75.32%
Epoch 0/99
----------
train Loss: 2.2763 Acc: 15.00%
test Loss: 2.1595 Acc: 45.78%
Epoch 20/99
----------
train Loss: 0.0010 Acc: 100.00%
test Loss: 1.1346 Acc: 77.06%
Epoch 40/99
----------
(continues on next page)
Epoch 60/99
----------
train Loss: 0.0001 Acc: 100.00%
test Loss: 1.3149 Acc: 77.00%
Epoch 80/99
----------
train Loss: 0.0001 Acc: 100.00%
test Loss: 1.3633 Acc: 77.07%
Training complete in 1m 6s
Best val Acc: 77.54%
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images
per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images.
The test batch contains exactly 1000 randomly-selected images from each class. The training
batches contain the remaining images in random order, but some training batches may contain
more images from one class than another. Between them, the training batches contain exactly
5000 images from each class. The ten classes are: airplane, automobile, bird, cat, deer, dog,
frog, horse, ship, truck
Load CIFAR-10 dataset CIFAR-10 Loader
dataloaders, _ = load_cifar10_pytorch(
batch_size_train=100, batch_size_test=100)
Run MLP Classifier with hidden layers of sizes: 512, 256, 128, and 64:
Epoch 0/19
----------
train Loss: 1.6872 Acc: 39.50%
test Loss: 1.5318 Acc: 45.31%
Epoch 10/19
----------
train Loss: 0.7136 Acc: 74.50%
test Loss: 1.5536 Acc: 54.77%
Training complete in 4m 3s
Best val Acc: 54.84%
TEN
Sources:
• 3Blue1Brown video: Convolutions in Image Processing
• far1din video: Convolutional Neural Networks from Scratch
• What is a Convolutional Neural Network?.
• CNN Stanford cs231n
• Deep learning Stanford cs231n
• Pytorch
– WWW tutorials
– github tutorials
– github examples
• MNIST and pytorch:
383
Statistics and Machine Learning in Python, Release 0.8
– MNIST nextjournal.com/gkoehler/pytorch-mnist
– MNIST github/pytorch/examples
– MNIST kaggle
CNNs are deep learning architectures designed for processing grid-like data such as images.
Inspired by the biological visual cortex, they learn hierarchical feature representations, making
them effective for tasks like image classification, object detection, and segmentation.
Key Principles of CNNs:
• Convolutional Layers are the core building block of a CNN, which applies a convolution
operation to the input, passing the result to the next layer: it perform feature extraction
using learnable filters (kernels), allowing CNNs to detect local patterns such as edges and
textures.
• Activation Functions introduce non-linearity into the model, enabling the network to
learn complex patterns. ReLU (Rectified Linear Unit) is the most commonly used acti-
vation function, improving training speed and mitigating vanishing gradients. Possible
function are Tanh or Sigmoid and most commonly used the ReLu(Rectified Linear Unit
function. ReLu accelerate the training because the derivative of sigmoid becomes very
small in the saturating region and therefore the updates to the weights almost vanish.
This is called vanishing gradient problem..
• Pooling Layers reduces the spatial dimensions (height and width) of the input feature
maps by downsampling the input feature maps summarizing the presence of features in
patches of the feature map. Max pooling and average pooling are the most common
functions.
• Fully Connected Layers flatten extracted features and connects to a classifier, typically a
softmax layer for classification tasks.
• Dropout: reduces the over-fitting by using a Dropout layer after every FC layer. Dropout
layer has a probability,(p), associated with it and is applied at every neuron of the response
map separately. It randomly switches off the activation with the probability p.
• Batch Normalization normalizes the inputs of each layer to have a mean of zero and a
variance of one, which improve network stability. This normalization is performed for
each mini-batch during training.
LeNet-5 (1998)
Fig. 2: LeNet
AlexNet (2012)
Revolutionized deep learning by winning the ImageNet competition. Introduced ReLU acti-
vation, dropout, and GPU acceleration. Featured Convolutional Layers stacked on top of each
other (previously it was common to only have a single CONV layer always immediately followed
by a POOL layer).
VGG (2014)
ResNet (2015)
• Recent architectures: EfficientNet, Vision Transformers (ViTs), MobileNet for edge de-
vices.
• Advanced topics: Transfer learning, object detection (YOLO, Faster R-CNN), segmenta-
tion (U-Net).
• Hands-on implementation: Implement CNNs using TensorFlow/PyTorch for real-world
applications.
%matplotlib inline
import os
import numpy as np
from pathlib import Path
# ML
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
import torchvision.transforms as transforms
from torchvision import models
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = 'cpu' # Force CPU
# print(device)
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
plt.rcParams['figure.figsize'] = (fig_w, fig_h * .5)
%matplotlib inline
LeNet-5
import torch.nn as nn
import torch.nn.functional as F
class LeNet5(nn.Module):
"""
layers: (nb channels in input layer,
nb channels in 1rst conv,
nb channels in 2nd conv,
nb neurons for 1rst FC: TO BE TUNED,
nb neurons for 2nd FC,
nb neurons for 3rd FC,
nb neurons output FC TO BE TUNED)
"""
def __init__(self, layers = (1, 6, 16, 1024, 120, 84, 10), debug=False):
super(LeNet5, self).__init__()
self.layers = layers
self.debug = debug
self.conv1 = nn.Conv2d(layers[0], layers[1], 5, padding=2)
self.conv2 = nn.Conv2d(layers[1], layers[2], 5)
self.fc1 = nn.Linear(layers[3], layers[4])
self.fc2 = nn.Linear(layers[4], layers[5])
self.fc3 = nn.Linear(layers[5], layers[6])
class MiniVGGNet(torch.nn.Module):
def __init__(self, layers=(1, 16, 32, 1024, 120, 84, 10), debug=False):
super(MiniVGGNet, self).__init__()
self.layers = layers
self.debug = debug
# Conv block 1
self.conv11 = nn.Conv2d(in_channels=layers[0],out_channels=layers[1],
kernel_size=3, stride=1, padding=0, bias=True)
self.conv12 = nn.Conv2d(in_channels=layers[1], out_channels=layers[1],
kernel_size=3, stride=1, padding=0, bias=True)
# Conv block 2
self.conv21 = nn.Conv2d(in_channels=layers[1], out_channels=layers[2],
kernel_size=3, stride=1, padding=0, bias=True)
self.conv22 = nn.Conv2d(in_channels=layers[2], out_channels=layers[2],
kernel_size=3, stride=1, padding=1, bias=True)
x = F.relu(self.conv21(x))
x = F.relu(self.conv22(x))
x = F.max_pool2d(x, 2)
if self.debug:
print("### DEBUG: Shape of last convnet=", x.shape[1:],
". FC size=", np.prod(x.shape[1:]))
x = x.view(-1, self.layers[3])
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
ResNet-like Model
# ---------------------------------------------------------------------------- #
# An implementation of https://arxiv.org/pdf/1512.03385.pdf #
# See section 4.2 for the model architecture on CIFAR-10 #
# Some part of the code was referenced from below #
# https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py #
# ---------------------------------------------------------------------------- #
import torch.nn as nn
# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
return nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
# Residual block
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(ResidualBlock, self).__init__()
self.conv1 = conv3x3(in_channels, out_channels, stride)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(out_channels, out_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
# ResNet
class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=10):
super(ResNet, self).__init__()
self.in_channels = 16
self.conv = conv3x3(3, 16)
self.bn = nn.BatchNorm2d(16)
self.relu = nn.ReLU(inplace=True)
self.layer1 = self.make_layer(block, 16, layers[0])
self.layer2 = self.make_layer(block, 32, layers[1], 2)
(continues on next page)
ResNet9
Sources:
• DAWNBench on cifar10
• ResNet9: train to 94% CIFAR10 accuracy in 100 seconds
MNIST Loader
dataloaders, WD = load_mnist_pytorch(
batch_size_train=64, batch_size_test=10000)
os.makedirs(os.path.join(WD, "models"), exist_ok=True)
/home/ed203246/data/pystatml/dl_mnist_pytorch
---------------------------------------------------------------------------
LeNet
Dry run in debug mode to get the shape of the last convnet layer.
LeNet5(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([16, 5, 5]) . FC size= 400
torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 400])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters = 61706
Epoch 0/4
----------
train Loss: 0.8882 Acc: 72.55%
val Loss: 0.1889 Acc: 94.00%
Epoch 2/4
----------
train Loss: 0.0865 Acc: 97.30%
val Loss: 0.0592 Acc: 98.07%
Epoch 4/4
----------
train Loss: 0.0578 Acc: 98.22%
val Loss: 0.0496 Acc: 98.45%
MiniVGGNet
print(model)
_ = model(data_example)
MiniVGGNet(
(conv11): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 5, 5]) . FC size= 800
torch.Size([16, 1, 3, 3])
torch.Size([16])
torch.Size([16, 16, 3, 3])
torch.Size([16])
torch.Size([32, 16, 3, 3])
torch.Size([32])
torch.Size([32, 32, 3, 3])
torch.Size([32])
torch.Size([120, 800])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters = 123502
Epoch 0/4
----------
train Loss: 1.2111 Acc: 58.85%
val Loss: 0.1599 Acc: 94.67%
Epoch 2/4
----------
train Loss: 0.0781 Acc: 97.58%
val Loss: 0.0696 Acc: 97.75%
Epoch 4/4
----------
train Loss: 0.0493 Acc: 98.48%
val Loss: 0.0420 Acc: 98.62%
Training complete in 2m 9s
Best val Acc: 98.62%
Reduce the size of the training dataset by considering only 10 minibatche for size16.
train_size = 10 * 16
# Stratified sub-sampling
targets = train_loader.dataset.targets.numpy()
nclasses = len(set(targets))
Train size= 160 Train label count= {np.int64(0): np.int64(16), np.int64(1): np.
˓→int64(16), np.int64(2): np.int64(16), np.int64(3): np.int64(16), np.int64(4):␣
Batch sizes= [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Datasets shape {'train': torch.Size([60000, 28, 28]), 'val': torch.Size([10000,␣
˓→28, 28])}
LeNet5
Epoch 0/99
----------
train Loss: 2.3072 Acc: 7.50%
val Loss: 2.3001 Acc: 8.89%
Epoch 20/99
----------
train Loss: 0.4810 Acc: 83.75%
val Loss: 0.7552 Acc: 72.66%
Epoch 40/99
----------
train Loss: 0.1285 Acc: 95.62%
val Loss: 0.6663 Acc: 81.72%
Epoch 60/99
----------
train Loss: 0.0065 Acc: 100.00%
val Loss: 0.6982 Acc: 84.26%
Epoch 80/99
----------
train Loss: 0.0032 Acc: 100.00%
val Loss: 0.7571 Acc: 84.26%
MiniVGGNet
Epoch 0/99
----------
train Loss: 2.3048 Acc: 10.00%
val Loss: 2.3026 Acc: 10.28%
Epoch 20/99
----------
train Loss: 2.2865 Acc: 26.25%
val Loss: 2.2861 Acc: 23.22%
Epoch 40/99
----------
train Loss: 0.3847 Acc: 85.00%
val Loss: 0.8042 Acc: 75.76%
Epoch 60/99
----------
train Loss: 0.0047 Acc: 100.00%
val Loss: 0.8659 Acc: 83.57%
Epoch 80/99
----------
train Loss: 0.0013 Acc: 100.00%
val Loss: 1.0183 Acc: 83.39%
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images
per class.
Source Yunjey Choi Github pytorch tutorial
Load CIFAR-10 dataset CIFAR-10 Loader
dataloaders, _ = load_cifar10_pytorch(
batch_size_train=100, batch_size_test=100)
LeNet
LeNet5(
(conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([16, 6, 6]) . FC size= 576
torch.Size([6, 3, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 576])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
Total number of parameters = 83126
Epoch 0/24
----------
train Loss: 2.3037 Acc: 10.06%
val Loss: 2.3032 Acc: 10.05%
Epoch 5/24
----------
train Loss: 2.3005 Acc: 10.72%
val Loss: 2.2998 Acc: 10.61%
Epoch 10/24
----------
train Loss: 2.2931 Acc: 11.90%
val Loss: 2.2903 Acc: 11.27%
Epoch 15/24
----------
train Loss: 2.2355 Acc: 16.46%
(continues on next page)
Epoch 20/24
----------
train Loss: 2.1804 Acc: 19.07%
val Loss: 2.1579 Acc: 20.26%
Epoch 0/24
----------
train Loss: 2.1798 Acc: 17.53%
val Loss: 1.9141 Acc: 31.27%
Epoch 5/24
----------
train Loss: 1.3804 Acc: 49.93%
val Loss: 1.3098 Acc: 53.23%
Epoch 10/24
----------
train Loss: 1.2019 Acc: 56.79%
val Loss: 1.0886 Acc: 60.91%
Epoch 20/24
----------
train Loss: 1.0569 Acc: 62.31%
val Loss: 0.9942 Acc: 65.55%
Epoch 0/24
----------
train Loss: 1.8866 Acc: 29.71%
val Loss: 1.6111 Acc: 40.21%
Epoch 5/24
----------
train Loss: 1.3877 Acc: 49.62%
val Loss: 1.3016 Acc: 53.23%
Epoch 10/24
----------
(continues on next page)
Epoch 15/24
----------
train Loss: 1.1399 Acc: 59.28%
val Loss: 1.0712 Acc: 61.84%
Epoch 20/24
----------
train Loss: 1.0806 Acc: 61.62%
val Loss: 1.0334 Acc: 62.69%
MiniVGGNet
MiniVGGNet(
(conv11): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))
(conv12): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv21): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv22): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=1, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
### DEBUG: Shape of last convnet= torch.Size([32, 6, 6]) . FC size= 1152
Epoch 0/24
----------
train Loss: 2.2581 Acc: 13.96%
val Loss: 2.0322 Acc: 25.49%
Epoch 5/24
----------
train Loss: 1.4107 Acc: 48.84%
val Loss: 1.3065 Acc: 52.92%
Epoch 10/24
----------
train Loss: 1.0621 Acc: 62.12%
val Loss: 1.0013 Acc: 64.64%
Epoch 15/24
----------
train Loss: 0.8828 Acc: 68.70%
val Loss: 0.8078 Acc: 72.08%
Epoch 20/24
----------
train Loss: 0.7830 Acc: 72.52%
val Loss: 0.7273 Acc: 74.83%
Adam
Epoch 0/24
----------
train Loss: 1.8556 Acc: 30.40%
val Loss: 1.5847 Acc: 40.66%
Epoch 5/24
----------
train Loss: 1.2417 Acc: 55.39%
val Loss: 1.0908 Acc: 61.45%
Epoch 10/24
----------
train Loss: 1.0203 Acc: 63.66%
val Loss: 0.9503 Acc: 66.19%
Epoch 15/24
----------
train Loss: 0.9051 Acc: 67.98%
val Loss: 0.8536 Acc: 70.10%
Epoch 20/24
----------
train Loss: 0.8273 Acc: 70.74%
val Loss: 0.7942 Acc: 72.55%
ResNet
Epoch 0/24
----------
train Loss: 1.4107 Acc: 48.21%
val Loss: 1.2645 Acc: 54.80%
Epoch 5/24
----------
train Loss: 0.6440 Acc: 77.60%
val Loss: 0.8178 Acc: 72.40%
Epoch 10/24
----------
train Loss: 0.4914 Acc: 82.89%
val Loss: 0.6432 Acc: 78.16%
Epoch 15/24
----------
train Loss: 0.4024 Acc: 86.27%
val Loss: 0.5026 Acc: 83.43%
Epoch 20/24
----------
train Loss: 0.3496 Acc: 87.86%
(continues on next page)
Below is an example of how to implement image segmentation using the U-Net architecture
with PyTorch on a real dataset. We will use the Oxford-IIIT Pet Dataset for this example.
Step1: Load the Dataset
We will use the Oxford-IIIT Pet Dataset, which can be downloaded from here. For simplicity, we
will assume the dataset is already downloaded and structured as follows:
Step 2: Define the U-Net Model
Here is the implementation of the U-Net model in PyTorch:
import torch
import torch.nn as nn
class UNet(nn.Module):
def __init__(self, in_channels, out_channels):
super(UNet, self).__init__()
bottleneck = self.bottleneck(self.pool(enc4))
dec4 = self.upconv4(bottleneck)
(continues on next page)
return self.conv_final(dec1)
# Directory
DIR = os.path.join(Path.home(), "data", "pystatml", "dl_Oxford-IIITPet")
# <Directory>/images: input images
# <Directory>/annotations: corresponding masks
class PetDataset(Dataset):
def __init__(self, image_dir, mask_dir, transform=None):
self.image_dir = image_dir
self.mask_dir = mask_dir
self.transform = transform
self.image_filenames = os.listdir(image_dir)
def __len__(self):
return len(self.image_filenames)
if self.transform:
image = self.transform(image)
(continues on next page)
transform = transforms.Compose([
transforms.Resize((128, 128)),
transforms.ToTensor()
])
torch.save(model.state_dict(), model_filename)
model_ = UNet(in_channels=3, out_channels=1)
model_.load_state_dict(torch.load(model_filename, weights_only=True))
_ = model_.eval()
images = images.cpu().numpy()
masks = masks.cpu().numpy()
plt.tight_layout()
plt.show()
visualize_results(model, dataloader)
• The model
• Train with predefined dataset and dataloader
Sources Transfer Learning cs231n @ Stanford: In practice, very few people train an entire Con-
volutional Network from scratch (with random initialization), because it is relatively rare to have
a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset
(e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the Con-
vNet either as an initialization or a fixed feature extractor for the task of interest.
These two major transfer learning scenarios look as follows:
1. CNN as fixed feature extractor:
• Take a CNN pretrained on ImageNet
• Remove the last fully-connected layer (this layer’s outputs are the 1000 class scores
for a different task like ImageNet).
• Treat the rest of the CNN as a fixed feature extractor for the new dataset.
• This last fully connected layer is replaced with a new one with random weights and
only this layer is trained:
• Freeze the weights for all of the network except that of the final fully connected layer.
2. Fine-tuning all the layers of the CNN:
• Same procedure, but do not freeze the weights of the CNN, by continuing the back-
propagation on the new task.
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Plot parameters
plt.style.use('seaborn-v0_8-whitegrid')
fig_w, fig_h = plt.rcParams.get('figure.figsize')
(continues on next page)
# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# device = 'cpu' # Force CPU
print(device)
cpu
dataloaders, _ = load_cifar10_pytorch(
batch_size_train=100, batch_size_test=100)
model_ft = resnet18(weights=ResNet18_Weights.DEFAULT)
num_ftrs = model_ft.fc.in_features
(continues on next page)
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['test'], '--r')
Epoch 0/4
----------
train Loss: 1.1057 Acc: 61.23%
test Loss: 0.7816 Acc: 72.62%
Adam optimizer
model_ft = resnet18(weights=ResNet18_Weights.DEFAULT)
# model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Here the size of each output sample is set to 10.
model_ft.fc = nn.Linear(num_ftrs, D_out)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['test'], '--r')
Epoch 0/4
----------
train Loss: 0.9112 Acc: 69.17%
test Loss: 0.7230 Acc: 75.18%
Freeze all the network except the final layer: requires_grad == False to freeze the parameters
so that the gradients are not computed in backward().
model_conv = resnet18(weights=ResNet18_Weights.DEFAULT)
# model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
model_conv = model_conv.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['test'], '--r')
Epoch 0/4
----------
train Loss: 1.8177 Acc: 36.64%
test Loss: 1.6591 Acc: 42.88%
Training complete in 8m 6s
Best val Acc: 46.44%
Adam optimizer
model_conv = resnet18(weights=ResNet18_Weights.DEFAULT)
# model_conv = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
param.requires_grad = False
model_conv = model_conv.to(device)
criterion = nn.CrossEntropyLoss()
epochs = np.arange(len(losses['train']))
_ = plt.plot(epochs, losses['train'], '-b', epochs, losses['test'], '--r')
Epoch 0/4
----------
train Loss: 1.7337 Acc: 39.62%
test Loss: 1.6193 Acc: 44.09%
ELEVEN
Bag-of-Words Model from Wikipedia: The bag-of-words model is a model of text which uses a
representation of text that is based on an unordered collection (or “bag”) of words. [. . . ] It
disregards word order [. . . ] but captures multiplicity.
11.1.1 Introduction
419
Statistics and Machine Learning in Python, Release 0.8
# Example usage
# import regex
import re
# Remove numbers
no_number_string = re.sub(r'\d+','', lower_string)
# Tokenization
print(no_wspace_string.split())
import nltk
import re
import string
import unicodedata
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
def strip_accents(text):
# Normalize the text to NFKD form and strip accents
text = unicodedata.normalize('NFKD', text)
text = ''.join([c for c in text if not unicodedata.combining(c)])
return text
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove punctuation
# string.punctuation provides a string of all punctuation characters.
# str.maketrans() creates a translation table that maps each punctuation
# character to None.
# text.translate(translator) uses this translation table to remove all
# punctuation characters from the input string.
text = text.translate(str.maketrans('', '', string.punctuation))
# Strip accents
text = strip_accents(text)
return words
standardize_tokenize(text)
Stemming and lemmatization are techniques used to reduce words to their base or root form,
which helps in standardizing text and improving the performance of various NLP tasks.
Stemming is the process of reducing a word to its base or root form, often by removing suffixes
or prefixes. The resulting stem may not be a valid word but is intended to capture the word’s
core meaning. Stemming algorithms, such as the Porter Stemmer or Snowball Stemmer, use
heuristic rules to chop off common morphological endings from words.
Example: The words “running,” “runner,” and “ran” might all be reduced to “run.”
# standardize_tokenize(text, stemming=True)
standardize_tokenize_stemming(text)
Lemmatization is the process of reducing a word to its lemma, which is its canonical or dic-
tionary form. Unlike stemming, lemmatization considers the word’s part of speech and uses a
more comprehensive approach to ensure that the transformed word is a valid word in the lan-
guage. Lemmatization typically requires more linguistic knowledge and is implemented using
libraries like WordNet.
Example: The words “running” and “ran” would both be reduced to “run,” while “better” would
be reduced to “good.”
# standardize_tokenize(text, lemmatization=True)
standardize_tokenize_lemmatization(text)
While both stemming and lemmatization aim to reduce words to a common form, lemmatiza-
tion is generally more accurate and produces words that are meaningful in the context of the
language. However, stemming is faster and simpler to implement. The choice between the two
depends on the specific requirements and constraints of the NLP task at hand.
analyzer(text)
CountVectorizer:” Convert a collection of text documents to a matrix of token counts. Note that
``CountVectorizer`` preforms the standardization and the tokenization.”
It creates one feature (column) for each tokens (words) in the corpus, and returns one line per
sentence, counting the occurence of each tokens.
corpus = [
'This is the first document. This DOCUMENT is in english.',
'in French, some letters have accents, like é.',
'Is this document in French?',
]
Word n-grams are contiguous sequences of ‘n’ words from a given text. They are used to
capture the context and structure of language by considering the relationships between words
within these sequences. The value of ‘n’ determines the length of the word sequence:
• Unigram (1-gram): A single word (e.g., “natural”).
• Bigram (2-gram): A sequence of two words (e.g., “natural language”).
• Trigram (3-gram): A sequence of three words (e.g., “natural language processing”).
𝐼𝐷𝐹 (𝑡) ≈ 1 if 𝑡 appears in all documents, while 𝐼𝐷𝐹 (𝑡) ≈ 𝑁 if 𝑡 is a rare meaningfull word
that appears in only one document.
Finally:
TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF (Term Frequency-Inverse Document
Frequency)
import numpy as np
import pandas as pd
# Plot
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud
# ML
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
data = pd.read_csv('../datasets/FinancialSentimentAnalysis.csv')
Target variable
y = data['Sentiment']
y.value_counts(), y.value_counts(normalize=True).round(2)
text = 'Tesla to recall 2,700 Model X SUVs over seat issue https://t.co/
˓→OdPraN59Xq $TSLA https://t.co/xvn4blIwpy https://t.co/ThfvWTnRPs'
tokenizer_sklearn = vectorizer.build_analyzer()
print(" ".join(tokenizer_sklearn(text)))
print("Shape: ", CountVectorizer(tokenizer=tokenizer_sklearn).fit_transform(data[
˓→'Sentence']).shape)
print(" ".join(standardize_tokenize(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize).fit_
˓→transform(data['Sentence']).shape)
print(" ".join(standardize_tokenize_stemming(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_stemming).fit_
˓→transform(data['Sentence']).shape)
print(" ".join(standardize_tokenize_lemmatization(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_lemmatization).
˓→fit_transform(data['Sentence']).shape)
print(" ".join(standardize_tokenize_stemming_lemmatization(text)))
print("Shape: ", CountVectorizer(tokenizer=standardize_tokenize_stemming_
˓→lemmatization).fit_transform(data['Sentence']).shape)
X = vectorizer.fit_transform(data['Sentence'])
# print("Tokens:", vectorizer.get_feature_names_out())
print("Nb of tokens:", len(vectorizer.get_feature_names_out()))
print("Dimension of input data", X.shape)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(metrics.balanced_accuracy_score(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred, normalize='true')
cm_ = metrics.ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cm_.plot()
plt.show()
df.to_excel("/tmp/test.xlsx")
Positive sentences
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(sentence_positive))
plt.imshow(wc)
Negative sentences
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(sentence_negative))
plt.imshow(wc)
# utilities
import re
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
data = df[['text','target']]
data['target'] = data['target'].replace(4,1)
print(data['target'].unique())
data_pos = data[data['target'] == 1]
data_neg = data[data['target'] == 0]
data_pos = data_pos.iloc[:20000]
data_neg = data_neg.iloc[:20000]
dataset = pd.concat([data_pos, data_neg])
def standardize_stemming_lemmatization(text):
out = " ".join(standardize_tokenize_stemming_lemmatization(text))
return out
rm = dataset['text_stdz'].isnull() | (dataset['text_stdz'].str.len() == 0)
X, y = dataset.text_stdz, dataset.target
# Separating the 95% data for training data and 5% for testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_
˓→state=26105111)
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
def model_Evaluate(model):
# Predict values for Test dataset
y_pred = model.predict(X_test)
# Print the evaluation metrics for the dataset.
print(classification_report(y_test, y_pred))
# Compute and plot the Confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten()␣
˓→/ np.sum(cf_matrix)]
BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_pred1 = BNBmodel.predict(X_test)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()
TWELVE
• genindex
• modindex
• search
433