Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
82 views

Machine Learning Slides

This document provides an overview of machine learning concepts and applications. It discusses supervised learning problems where the goal is to predict an output based on input features. The document introduces key concepts like features, outcomes, and predictors. It distinguishes between regression, where the output is continuous, and classification, where the output is categorical. The document outlines the typical machine learning system of training a model on data and then testing it to make predictions and evaluate performance.

Uploaded by

mariana mourão
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Machine Learning Slides

This document provides an overview of machine learning concepts and applications. It discusses supervised learning problems where the goal is to predict an output based on input features. The document introduces key concepts like features, outcomes, and predictors. It distinguishes between regression, where the output is continuous, and classification, where the output is categorical. The document outlines the typical machine learning system of training a model on data and then testing it to make predictions and evaluate performance.

Uploaded by

mariana mourão
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 281

Machine Learning Slides

draft - under construction

Jorge S. Marques

July 22, 2022

Jorge S. Marques, IST, 2017 1/279


Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 2/279


What is machine learning?

Many engineering problems can be solved by using models that depend


on a small number of variables.

Examples:
I motion of a rocket → Newton law: mẍ(t) = F (t)
I electromagnetic waves → Maxwell equations

... but other problems are more complex and cannot be tackled with
closed form expressions.

Jorge S. Marques, IST, 2017 3/279


Hospital problem
Suppose a patient enters in an hospital and we wish to predict if he/she
is going to live or die.

There is no general principle that can be used to solve this problem.

May be we have a data set of previous examples, including information


about the patient status (medical tests / symptoms) and the outcome
(live/die).

T1 T2 ... ... ... Tp y

Ti ith medical test /


/symptom
y outcome

How can we use this information to predict the outcome for a new
patient?
Jorge S. Marques, IST, 2017 4/279
What is Machine Learning?

”the field of study that gives computers the ability to learn without being
explicitly programmed.” (Arthur Samuel, 1959)

Arthur Samuel was a pioneer in the area of Machine Learning.

”A computer program is said to learn from experience E with respect to


some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience.” (Tom Mitchell, 1998)

Tom Mitchell is a professor of Computer Science at CMU.

Jorge S. Marques, IST, 2017 5/279


Data growth in internet

The economist
Jorge S. Marques, IST, 2017 6/279
Applications

I prediction
I time series analysis
I speech recognition - conversion of the speech signal into text
I machine translation
I detection of failures
I image denoising
I human activity recognition
I medical image analysis - e.g., cancer detection in images
I robot navigation
I self driving car

Some of these are amongst the most difficult problems in engineering.

Jorge S. Marques, IST, 2017 7/279


Amazing progress in image recognition

Xuedong Huang - Deep Learning and Intelligent Applications

Current machine learning methods perform better than humans in this


task.

Jorge S. Marques, IST, 2017 8/279


Visual recognition - AlexNet (2012)

Jorge S. Marques, IST, 2017 Alexnet 2012 - Krizhevsky, Sutskever, Hinton 9/279
Visual recognition - AlexNet (2012)

Alexnet 2012 - Krizhevsky, Sutskever, Hinton

Jorge S. Marques, IST, 2017 10/279


Image description (2015)

Karpathy, Fei-Fei - CVPR 2015

Jorge S. Marques, IST, 2017 11/279


Course overview

Stucture: lectures (4h/week) + lab (1.5h/week) + problem sessions


(1.5h/week).

Lab: students organized in groups of 2, should perform a project and a


10 pages report. The project includes 2 parts: regression problem and a
classification problem. Lab enrolment is done through fenix in the first
week.

Programming: Phython. An introduction is provided in 1st week


(problem sessions).

Grading: exam (50%) + Lab (50%).

Jorge S. Marques, IST, 2017 12/279


Learning problems
There are several learning problems. The major categories are:

I Supervised learning - the computer receives a set of inputs and


desired outputs and aims to find the map between them.

I Unsupervised learning - the computer receives a set of inputs but no


desired outputs. The goal is to find the structure of data
(probability distribution, groups).

I Reinforcement learning - aims to learn the behavior of software


agents or robots based on feedback from the environment.

This course is focused on the first learning category.

Jorge S. Marques, IST, 2017 13/279


Example 1 (supervised learning)
Suppose we want to predict the price of a flat in Lisbon (in K euros),
knowing its area (in m2 ). Fortunately, we know some examples.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430 100

180 640 0
50 100 150 200

The area is known as a feature and the price is the outcome.

Question: how can we predict the price of a flat?

Jorge S. Marques, IST, 2017 14/279


Example 2 (supervised learning)
A fisher boat has a sonar system that measures the length and volume of
fish (features).

Given the table with the length and volume of tuna and swordfish (class),
we wish to design a system that predicts the class.

length volume class 0.8


0.36 0.67 tuna
0.82 0.56 tuna 0.6
0.46 0.67 sword
0.40 0.30 sword 0.4
0.60 0.80 tuna
0.61 0.47 tuna 0.2
0.21 0.41 sword
0
0 0.2 0.4 0.6 0.8 1

Question: how can we predict the type of fish?


Jorge S. Marques, IST, 2017 15/279
Key concepts

Identify the key concepts in previous examples:

I features

I outcome

I predictor

Jorge S. Marques, IST, 2017 16/279


Problem formulation: supervised learning

Given an input variable x ∈ Rp (vector of features), we wish to predict an


output variable y (outcome) i.e., we wish to find a map between the input
space and the output space, assuming we know a set input-output pairs.

This operation is known as model learning.

Problem: Given a set of examples (training set)


n o
T = (x (i) , y (i) ), i = 1, . . . , n x (i) ∈ Rp

we wish to estimate a function (predictor)

ŷ = f (x)

such that ŷ is, in some sense, close to y .

Jorge S. Marques, IST, 2017 17/279


Regression vs classification
700

600

500

If the output y is a scalar (y ∈ R or 400

Rp ) the problem is known as a 300

regression problem. 200

100

0
50 100 150 200

If the output y is a label (categorical


variable)

y ∈Ω Ω = {ω0 , . . . , ωK −1 }

the problem is known as a


classification problem.

(each color represents a different class)

Jorge S. Marques, IST, 2017 18/279


System architecture

The design of a machine learning system comprises a training phase to


learn a model and a testing phase to predict new data and to eventually
assess the model performance.

This diagram does not consider the choice of features and their
extraction, e.g., if we are dealing with an image or speech analysis
problem, what features do we extract from the signal. This issue is
application dependent and will not be considered.
Jorge S. Marques, IST, 2017 19/279
Main questions

The block diagram suggests three main questions:

I what class of functions should we consider?

I how do we fit the function f to the training data i.e., how do we


select the function?

I how do we evaluate the predictor?

These questions have multiple answers that we will discuss along this
course.

Jorge S. Marques, IST, 2017 20/279


Data sets

Data sets are important tools to train and evaluate machine learning
systems. They allow to compare different techniques and they often
foster the development of new methods.

There are many sites with data sets. One example is:
https://archive.ics.uci.edu/ml/datasets.html

Jorge S. Marques, IST, 2017 21/279


One of the oldest: Fisher Iris flower data set
setosa versicolor virginica

Wikimedia

Sepal length Sepal width Petal length Petal width Species


5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa 150 examples
5.0 3.6 1.4 0.3 I. setosa
.. .. .. .. ..
. . . . .
7.7 2.6 6.9 2.3 I. virginica
7.9 3.8 6.4 2.0 I. virginica
Jorge S. Marques, IST, 2017 22/279
One of the oldest: Fisher Iris flower data set

4.5 3

2.5
4

3.5

1.5

2.5
0.5

2 0
4 4.5 5 5.5 6 6.5 7 7.5 8 1 2 3 4 5 6 7

feature 1 vs 2 feature 3 vs 4

Species: setosa (red), versicolor (green), virginica (blue)

Please note that the scale is not the same in both axis.

Jorge S. Marques, IST, 2017 23/279


ImageNet 2012
URL: www.image-net.org
Data set: 10 million images Classes: 1000+

Jorge S. Marques, IST, 2017 24/279


Nearest neighbor method
Suppose we wish to predict a variable y knowing an input vector x ∈ Rp .

Suppose we also know a collection of training examples (training set)


n o
T = (x (i) , y (i) ), i = 1, . . . , n

A simple strategy to predict y for new values of x consists of finding the


training pattern x (i) nearest to x and approximating y by y (i) .

Let (x(1) , y(1) ), . . . , (x(n) , y(n) ) be a reordering of the training set such
that
kx(1) − xk ≤ kx(2) − xk ≤ · · · ≤ kx(n) − xk
The nearest neighbor (NN) method assigns x to the outcome of the
nearest neighbor.
f (x) = y(1)

This is valid for both classification and regression problems.

Jorge S. Marques, IST, 2017 25/279


k nearest neighbor

The NN method can be extended to take into account not one but k
nearest neighbors of x.

In classification problems, the predicted class is chosen as the most voted


class in the sequence (y(1) , . . . , y(k) )

f (x) = most voted class in (y(1) , . . . , y(k) ).

In regression problems, the predicted value in chosen as the average of


(y(1) , . . . , y(k) )
k
1X
f (x) = y(i)
k
i=1

Jorge S. Marques, IST, 2017 26/279


Example: supervised classification problem

Consider a binary classification problem. The training set is shown in the


figure and was generated by a mixture of Gaussians.

training data

How would you classify this data?

Jorge S. Marques, IST, 2017 27/279


k Nearest neighbor

Decision regions of kNN classifier with k = 1 (left) and k = 10 (right)

What k would you choose?

Jorge S. Marques, IST, 2017 28/279


Recommended Bibliography

I T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical


Learning, Springer, 2009
I T. Michell, Machine Learning, McGraw Hill, 1997.
I J. S. Marques, Reconhecimento de Padrões. Métodos Estatı́sticos e
neuronais, ISTPress, 2nd ed. 2005.
I R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2nd
edition, 2000.
I L. Almeida, Multilayer Perceptrons, in Handbook of Neural
Computation, Oxford Press, 1997.
I I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press,
http://www.deeplearningbook.org, 2016
I T. Fletcher, Support Vector Machines, UCL, 2008.

Jorge S. Marques, IST, 2017 29/279


Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 30/279


Regression Problem

Consider the data set T = (x (i) , y (i) ), i = 1, . . . , n defined in the table.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430 100

180 640 0
50 100 150 200

We wish to predict the price of a flat in Lisbon, taking its area into
account.

The simplest prediction model is a straight line

ŷ = f (x) = β0 + β1 x

β0 is called the intercept or offset.


Jorge S. Marques, IST, 2017 31/279
Predictor estimation
How should we estimate the coefficients β0 , β1 ?

If we know one training example (x (1) , y (1) ), we obtain a single equation

ŷ (1) = β0 + β1 x (1) → infinite (β0 , β1 ) solutions

If we know two training examples, we obtain two equations

ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → unique solution, but bad (noisy)

If we know three training examples, we obtain three equations

ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → no solution, impossible
ŷ (3) = β0 + β1 x (3)

To solve this problem, we must assume that there is an error between the
output of the model ŷ (i) and the data y (i) .

Jorge S. Marques, IST, 2017 32/279


Prediction loss
We assume that there is a prediction error associated to each training
example
e (i) = y (i) − ŷ (i) = y (i) − f (x (i) ),

and define a quadratic loss (cost)

L(y (i) , ŷ (i) ) = (y (i) − ŷ (i) )2 .

The total loss in the training set


n
X
SSE = (y (i) − f (x (i) ))2 ,
i=1

also known as (aka) sum of squared errors (SSE), or least squares


criterion (LS).
Jorge S. Marques, IST, 2017 33/279
Minimization: first order model

Model fit is achieved by minimizing the total loss in the training set
n
X
min (y (i) − β0 − β1 x (i) )2 .
β0 ,β1
i=1

The minimum is achieved at a point (β̂0 , β̂1 ) such that the partial
derivatives are zero (gradient vector is the null vector)
" ∂SSE #T
∂β0
∇β SSE = ∂SSE
= 0,
∂β1

This leads to
 ∂SSE  Pn
 ∂β0 =0  −2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0

∂SSE Pn (i)
= 0. −2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.
 
∂β1

Jorge S. Marques, IST, 2017 34/279


Analytic optimization

 Pn
 −2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0
Pn (i)
−2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.

 Pn Pn Pn
 i=1 β̂0 + i=1 β̂1 x (i) = i=1 y (i)
 Pn Pn Pn
i=1 β̂0 x (i) + i=1 β̂1 x (i) x (i) = i=1 y (i) x (i) .

This leads to (normal equations):


" Pn Pn (i)
#" # " Pn (i)
#
i=1 1 i=1 x β̂0 i=1 y
Pn Pn = Pn .
(i) (i)2 (i) (i)
i=1 x i=1 x β̂1 i=1 y x

By solving this system of equations, we obtain (β̂0 , β̂1 ).

Jorge S. Marques, IST, 2017 35/279


Analytic optimization

To guarantee a minimum is achieved at this point, we should evaluate


the matrix of second derivatives (Hessian matrix)
 2
∂ 2 SSE

∂ SSE " Pn Pn (i)
#
∂β02 ∂β0 ∂β1 i=1 1 i=1 x
H=   = 2 Pn
∂ 2 SSE ∂ 2 SSE (i)
Pn (i)2
∂β1 ∂β0 ∂β 2 1
i=1 x i=1 x

and check if it is positive definite.

Jorge S. Marques, IST, 2017 36/279


Reminder: positive definite matrix

A symmetric n × n matrix, M, is a positive definite matrix if the scalar


z T Mz is positive for every non-zero vector z ∈ Rn .

If M is a positive definite matrix, then


I it is a non singular matrix (det M 6= 0);
I all the eigenvalues are real and positive;
I all the leading principal minors are positive. The kth leading
principal minor of a matrix M is the determinant of its upper-left k
by k sub-matrix.

Jorge S. Marques, IST, 2017 37/279


Exercise

Suppose we remove the average value of the feature x and outcome y ,

x 0 ← x − x̄ y 0 ← y − ȳ ,

where x̄, ȳ are average values computed in the training set.

Show that the least squares estimates of β0 is β̂0 = 0.

This result is useful to simplify the estimation of the β coefficients.

Jorge S. Marques, IST, 2017 38/279


Exercise

Minimize
n 
X 2
SSE = y 0(i) − β00 − β10 x 0(i)
i=1
n
∂SSE X 
= 0 ⇒ −2 y 0(i) − β̂00 − β̂10 x 0(i) = 0
∂β0
i=1
n
X n
X
y 0(n) − nβ̂00 − β̂10 x 0(n) = 0
i=1 i=1

0 − nβ̂00 − 0 = 0

β̂00 = 0

Jorge S. Marques, IST, 2017 39/279


Linear regression model (general case)

Let us extend linear regression to the general case in which we have p


features x1 , x2 , . . . , xp ∈ R. The linear regression model is given by

ŷ = β0 + β1 x1 + · · · + βp xp .

Using vector notation, we obtain


 
β0
 β1  
 = 1 xT β ,

ŷ = [1 x1 . . . xp ] 
 
..
 . 
βp
T T
where x = [x1 x2 , . . . , xp ] ∈ Rp and β = [β0 , β1 , . . . , βp ] ∈ Rp+1 ,
ŷ ∈ R.

Jorge S. Marques, IST, 2017 40/279


Problem formulation


Consider a training set T = (x (i) , y (i) ), i = 1, . . . , n , where x (i) ∈ Rp
and y (i) ∈ R, i = 1, . . . , n.

The linear model


f (x) = 1 x T β ,
 

is trained by finding the vector of coefficients β̂ ∈ Rp+1 that minimizes


the total cost
n
X
SSE (β) = (y (i) − f (x (i) ))2 .
i=1

Jorge S. Marques, IST, 2017 41/279


Matrix notation

Adopting matrix notation


 (1) (1)

y (1)
 
1 x1 ... xp
 1 x1(2) ...
(2)
xp y (2)
   

X = , y = ,
 
 .. .. .. ..
.

 . . ... .   
1 x1
(n)
...
(n)
xp y (n)

X is called the design matrix and y ∈ Rn is the vector of outcomes.

cost function:

SSE (β) = ky − ŷ k2 = ky − X βk2



where kzk = z T z denotes the Euclidean norm.

Jorge S. Marques, IST, 2017 42/279


Normal equations

The minimization of the SSE cost functional leads to a system of


equations that are denoted normal equations:

(X T X )β̂ = X T y

The normal equations have a unique solution iif det(X T X ) 6= 0.

The normal equations can be derived from the stationary condition


(necessary condition)
∇SSE (β) = 0

Jorge S. Marques, IST, 2017 43/279


Gradient properties

The proof of the normal equations requires two properties of the gradient.

Let f (x) be a scalar function, where x = [x1 , . . . , xp ]T is a vector.

The gradient of f is defined by


 ∂f 
∂x1
..
∇x f (x) =  .
 
.
∂f
∂xp

Useful properties:

inner product: ∇x (b T x) = b , b ∈ Rp ,
quadratic form: ∇x (x T Mx) = (M + M T )x , M ∈ Rp×p .

Jorge S. Marques, IST, 2017 44/279


Proof
Cost function
SSE = ky − X βk2 = (y − X β)T (y − X β) norm definition

= y T y − y T X β − βT X T y + βT X T X β distributive prop.

= y T y − 2y T X β + β T X T X β transpose prop..

Computing the gradient and making it equal to zero

∇β SSE = −2X T y + 2X T X β = 0,

we conclude
(X T X )β̂ = X T y .
The inverse of matrix X T X may not exist due to two main reasons:
I small amount of data e.g., number of data points smaller than the
number of features.
I redundant features (linearly dependent) e.g., duplicated features.

Jorge S. Marques, IST, 2017 45/279


Summary of linear regression

Model training
normal equations: (X T X )β̂ = X T y

parameter estimates: β̂ = (X T X )−1 X T y

Prediction
new data: f (x0 ) = [1 x0T ]β̂

training data: ŷ = X β̂ = X (X T X )−1 X T y

Attention: the inverse of matrix X T X does not exist if det(X T X ) = 0.

Jorge S. Marques, IST, 2017 46/279


Gauss Markov Theorem

Let y = X β + w where β is unknown, X is known (deterministic) and w


is a realization of a random vector of zero mean and covariance σ 2 I .
Then,

I The least squares estimate of β

β̂ = (X T X )−1 X T y ,

is unbiased with covariance matrix σ 2 (X T X )−1 .


I If β̃ = Py is another unbiased estimator of β, it has a covariance
matrix that is equal or larger then Cov{β̂} 1 .

Proof: Hastie et al., Elements of Statistical Learning, Springer, 2009

1 the inequality A ≥ B means that A − B is a semi-definite positive matrix


Jorge S. Marques, IST, 2017 47/279
Example

Figure shows the least squares fit of a linear model (straight line) to the
flat data.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430
100
180 640
0
50 100 150 200

Jorge S. Marques, IST, 2017 48/279


Polynomial model
The linear model is often very rigid, especially when the number of
features is small.

An alternative is the polynomial model (x scalar)

f (x) = β0 + β1 x + · · · + βp x p .

This can be considered as linear model whose features are the powers of
x (scalar).
700

The model is non-linear in x but 600

linear in the parameters βi . 500

400

Therefore, the β coefficients can be 300

obtained by the least squares 200

method described before, leading to 100

a linear set of equations. 0


50 100 150 200

Jorge S. Marques, IST, 2017 49/279


Model order
The polynomial model becomes numerically unstable when we increase
the order of the polynomial.

Figure shows polynomial fits for p = 1, 3, 4.

700

600

500

400

300

200

100

0
50 100 150 200

How should we choose the best order? is the SSE a good criterion?

Jorge S. Marques, IST, 2017 50/279


Radial basis functions: model

This model is based on a sum of radial basis functions (Gaussian local


functions) defined by
1 (k) 2
Gk (x) = e − 2σ2 kx−c k
k = 1, . . . , p ,
(k) d
where c ∈ R is a center vector (aka centroid) to be computed from
the data.

The Radial basis function model


approximates the outcome y by a
weighted sum of local basis functions
p
X
f (x) = wk Gk (x).
k=1

Jorge S. Marques, IST, 2017 51/279


Radial basis functions: training

The model is estimated as follows. First we estimate the p centroids, ck .


This is not done by least squares. The centroids ck are obtained using a
clustering algorithm such as k-means (to be presented later in this
course).

Then, the coefficients w = [w1 , . . . , wp ]T are estimated by least squares

w = (X T X )−1 X T y ,

where X is given by

G1 (x (1) ) G2 (x (1) ) . . . Gp (x (1) )


 

X = .. .. ..
,
 
. . ... .
G1 (x (n) ) G2 (x (n) ) . . . Gp (x (n) )

σ 2 is an hyperparameter chosen by the user or estimated from the data.

Jorge S. Marques, IST, 2017 52/279


Example - radial basis functions

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
50 100 150 200 50 100 150 200

radial basis functions regression results

(centroids are equally spaced instead of being computed from the data)

Jorge S. Marques, IST, 2017 53/279


Regression with multiple outputs

Suppose that we have multiple outputs y1 , . . . , yK ∈ R, each of them


approximated by a linear model with coefficients βk ∈ Rp+1

yk = X βk + wk , k = 1, . . . , K .

The SSE for the multiple outputs is the sum of the SSE for each output.
K
X
SSE = SSEk (βk ).
k=1

Each regression problem can be independently solved i.e., β̂k can be


obtained by minimizing SSEk (βk )

(X T X )β̂k = X T yk .

Jorge S. Marques, IST, 2017 54/279


Regression with multiple outputs

The problem can also be formulated using matrix notation

Y = Xβ + W,

where Y = [y1 . . . yK ] ∈ Rn×K , β = [β1 . . . βK ] ∈ Rp×K ,


W = [w1 . . . wK ] ∈ Rn×K .

The minimization of the sum of squared errors criterion

SSE (β) = tr (Y − X β)T (Y − X β) ,




leads to
β̂ = (X T X )−1 X T Y .
This is equivalent to independently solving each of the K least squares
problem sharing the same design matrix X .

tr {} denotes the trace of a matrix (sum of the diagonal elements).

Jorge S. Marques, IST, 2017 55/279


Exercises


1. Consider a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) . Write the
normal equations for a least squares fit of a second order polynomial
model (x scalar) to the training data.

2. Repeat the previous problem, assuming 2D features x ∈ R2 .

3. What happens to the predicted outcome ŷ estimated by least


squares (without offset), if the observed features are scaled i.e.,
x 0(i) = Dx (i) , i = 1, . . . , n, where D is a diagonal matrix.

Jorge S. Marques, IST, 2017 56/279


Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 57/279


Motivation

A linear model can be estimated by minimizing the least squares criterion


in the training set
ky − X βk2 ,
where
 (1) (1) (1)

y (1)
   
x1 x2 ... xp β1
(2) (2) (2)
y (2) x1 x2 ... xp β2
     
 
y = , X = , β= .
   
.. .. .. .. ..
. .
 
   . . ... .   
y (n) (n)
x1
(n)
x2 ...
(n)
xp βd

We removed the mean x̄, ȳ from the data in order to make β0 = 0. The
vector β does not include β0 and X does not include a column of ones.

Jorge S. Marques, IST, 2017 58/279


Drawbacks

LS approach leads to the normal equations

(X T X )β̂ = X T y .

If (X T X ) is a singular matrix, the least squares estimate is not unique.


Infinite solutions are available.

Example
Suppose we wish to estimate the model from a single example.

x1 x2 y
ŷ = x1 β1 + x2 β2
1 1 3

Jorge S. Marques, IST, 2017 59/279


Example (cont.)

We obtain 2 parameters and 1 constraint, leading to infinite solutions

β1 + β2 = 3.

How can we solve this difficulty?

By adding a new constraint: minimizing the squared norm of the


coefficients
β12 + β22

This is a measure of ”model complexity”.

Jorge S. Marques, IST, 2017 60/279


Ridge regression

An alternative criterion is ridge regression

β̂ridge = arg min ky − X βk2 +λkβk2 ,


β

where k.k denotes the Euclidean norm. The new term kβk2 penalizes the
use of large coefficients and it is denoted a regularization term. This
criterion aims to represent the data, keeping the coefficients small.

λ represents the trade-off between both objectives.

Furthermore, we assume that training data X , y have zero mean. The


coefficient β0 is not usually included in the regularization term.

Jorge S. Marques, IST, 2017 61/279


Ridge regression

The ridge problem can be rewritten as a constrained optimization problem

β̂ridge = arg min ky − X βk2 , s.t., kβk2 ≤ τ.


β

There is a one to one correspondence between the values of τ and λ.

The ridge regression can be solved by computing the gradient vector and
making it equal to zero, leading to

β̂ridge = (X T X + λI )−1 X T y .

Matrix (X T X + λI ), with λ > 0, is always non-singular, even if (X T X ) is


singular.

Jorge S. Marques, IST, 2017 62/279


Proof

Cost function
ERidge = ky − X βk2 + λkβk2

= (y − X β)T (y − X β) + λβ T β

= y T y − 2y T X β + β T X T X β + λβ T β

= y T y − 2y T X β + β T (X T X + λI )β.

Computing the gradient and making it equal to zero

∇β ERidge = −2X T y + 2(X T X + λI )β = 0,

we conclude
(X T X + λI )β̂ridge = X T y .

Jorge S. Marques, IST, 2017 63/279


Exercise

Find the relationship between the eigenvector and eigenvalues of the LS


matrix (X T X ) and ridge matrix (X T X + λI ).

Try to solve it by yourself.

Jorge S. Marques, IST, 2017 64/279


Tentative solution

LS: (X T X )v ls = λls v ls ⇒ (X T X − λls I )v ls = 0

Ridge: (X T X + λI )v ridge = λridge v ridge

(X T X + λI − λridge I )v ridge = 0

Comparing,

λridge = λls + λ

v ridge = v ls

Conclusion:
The eigenvectors are equal and the eigenvalues are shifted by λ. If λ > 0
the eigenvalues of the ridge matrix are positive and the matrix is non
singular.

Jorge S. Marques, IST, 2017 65/279


The Lasso

Another alternative is Lasso.

Lasso regression aims to minimize the sum of squared errors (with


β0 = 0)
min ky − X βk2 ,
β

with a different constraint on the coefficients that penalizes large errors


less.
Xp
|βj | ≤ τ.
j=1

This constraint can be expressed in terms of the `1 norm kβk1 ≤ τ .


Since we are dealing with two norms, the `2 norm (Euclidean) will be
denoted by k.k2 and the `1 norm by k.k1 .

Jorge S. Marques, IST, 2017 66/279


The Lasso

The Lagrangian formulation is given by

βlasso = arg min ky − X βk22 + λkβk1 ,


β

where the last term can be interpreted as a regularization term.

This optimization problem cannot be solved by a linear system of


equations as before. In fact we have to resort to convex optimization
methods to numerically solve this problem.

Jorge S. Marques, IST, 2017 67/279


CVX software package

Jorge S. Marques, IST, 2017 68/279


Sparse solutions

We often wish to find sparse solutions for β (with ssome zero


coefficients) which corresponds to selecting only a subset of features
(feature selection).

The problem can be formulated as

β̂sparse = arg min ky − X βk22 + λ number of non-zero coefficients,


β

where the number of non-zero coefficients is often called ”`0 norm”, k.k0 ,
but does not verify the axioms of a norm.

Regularization with the `0 norm is difficult to solve numerically. However,


the solution is often well approximated by the lasso regression which also
leads to sparse solutions in many problems: when a feature is not
important the corresponding coefficient is made equal to zero.

Jorge S. Marques, IST, 2017 69/279


Feature Selection

The lasso estimate βlasso is often a sparse vector of coefficients where less
important features receive a zero coefficient.

This can be interpreted as a feature selection operation. Since


unimportant features are removed, the other ones are better estimated.

Jorge S. Marques, IST, 2017 70/279


Example: lasso vs ridge regression
Regression problem with a subset of features uncorrelated with the
outcome. The training data was generated by y = x T β + w where
w ∼ N(0, σ 2 ), x ∼ N(0, I ) and β = [1 0.5 0 0 0].

Estimates obtained by least squares (horizontal lines), ridge regression


and lasso as a function of λ.

ridge regression lasso


0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

-0.1 -0.1

-0.2 -0.2
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40

λ λ
Jorge S. Marques, IST, 2017 71/279
Non-centered data

How 
should we proceed if the training data
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) are not centered?

1. pre-processing: x 0(i) = x (i) − x̄, y 0(i) = y (i) − ȳ (x̄, ȳ average values


computed in the training set);

2. estimate linear model without intercept: estimate model y 0 = x 0T β 0 ,


β 0 ∈ Rp , using the pre-processed
with  data
T = (x 0(1) , y 0(1) ), . . . , (x 0(n) , y 0(n) ) and regularization;

3. invert pre-processing: β̂ = [β̂0 β̂ 0T ]T where β̂0 = ȳ − x̄ T β̂ 0 ;

Matlab commands ridge, lasso perform all the tree steps.

Jorge S. Marques, IST, 2017 72/279


Exercises

1. Does a linear regressor, estimated by the least squares method,


depend on the scale of the features? what happens if we we use
ridge regression instead?

2. Suppose we wish to predict a variable y ∈ R, using a single feature


x ∈ R (without intercept). Given a training set
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) and assuming that the feature x
is normalized
n
1 X (i) 2
(x ) = 1 ,
n
i=1

Find the ridge and lasso coefficients, β̂ ridge , β̂ lasso , as a function of


the least squares coefficient, β̂ ls , and plot them.

Try to solve by yourself

Jorge S. Marques, IST, 2017 73/279


Tentative solution

1. Suppose we multiply the features by a scale factor X 0 = sX , where s


is the scale factor. Then, the vector of coefficients becomes

β 0ls = (X 0T X 0 )−1 X 0T y = (s 2 X T X )−1 sX T y = s −2 (X T X )−1 sX T y

β 0ls = s −1 β ls .
ls
LS predictor: ŷ 0 = x 0T β 0 = sx T s −1 β ls = x T β ls = ŷ is invariant
under scaling.

The ridge coefficient are the solution of (X 0T X 0 + λI )β 0ridge = X 0 y .

Matrix (X 0T X 0 + λI ) has two terms: one that depends on the scale


and the other that does not. Therefore, the ridge predictor is not
invariant to scale.

Jorge S. Marques, IST, 2017 74/279


Tentative solution
2.

LS : β̂ ls = (X T X )−1 X 0 y = n1 X 0 y

Ridge : β̂ ridge = (X T X + λI )−1 X 0 y = 1 0


n+λ X y

n
β̂ ridge = n+λ β̂
ls

Lasso : minβ ky − X βk22 + λkβk1

minβ (y T − 2y T X β + β 2 X T X ) + λ|β|

d(... )
Hypothesis : β̂ > 0 dβ = 0 ⇒ −2nβ̂ ls + 2β̂ lasso + λ = 0
λ
β̂ lasso = β̂ ls − 2n

the same should be repeated for β̂ lasso < 0.


These relationships should be graphically represented.
Jorge S. Marques, IST, 2017 75/279
Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 76/279


Optimization
Linear regression boils down to minimizing

SSE = ky − X βk2 ,

that can be analytically solved.

Most regression / classification problems involve the solution of an


optimization problem,
θ̂ = arg min J(θ)
θ

where J : R → R is the cost function and θ ∈ Rp denotes the model


p

parameters.

In most cases, this cannot be analytically solved and we must rely on


numerical (iterative) optimization algorithms that deliver approximate
values for the parameters.

Jorge S. Marques, IST, 2017 77/279


Optimization methods

Optimization methods require different types of information


1. function values: J(θ),
2. first derivatives (gradient vector): ∇θ J,
3. second derivatives (Hessian matrix): H.

Jorge S. Marques, IST, 2017 78/279


Global and Local minima
A function J : Rp → R has a local minimum at θ∗ ∈ Rp if there is an
 > 0 such that

kθ − θ∗ k <  ⇒ J(θ) ≥ J(θ∗ )

where
I J(θ ∗ ) is called a local minimum,

I θ∗ is called a local minimizer.

A function J : Rp → R has a global minimum at θ∗ ∈ Rp if for all θ ∈ Rp

J(θ) ≥ J(θ∗ )

where
I J(θ ∗ ) is called a global minimum,

I θ∗ is called a global minimizer.

Jorge S. Marques, IST, 2017 79/279


Global and Local minima

We are interested in the global minimum but most algorithms get


trapped in the local minima (left), if they exist and may not converge to
the global minimum.

Notice that if the function is convex (right) it has no more than one
minimum.

Jorge S. Marques, IST, 2017 80/279


Gradient descent -1D case

If θ is a scalar, the derivative of J(θ) conveys information about the tilt


of the function to be minimized.

If we move a small amount in the


opposite direction of the derivative,
the function decreases:
dJ (t)
θ(t+1) = θ(t) −η (θ )

where η controls the displacement of the point θ(t) and is known as step
size or learning step.

The process starts with an initial guess θ(0) .

Jorge S. Marques, IST, 2017 81/279


Gradient descent - vector case

If θ ∈ Rp and J(θ) is a differentiable function in a neighborhood of a


point θ(t) , then J(θ) decreases fastest if we move along the opposite
direction of the gradient:

θ(t+1) = θ(t) − η∇θ J(θ(t) ).

This procedure is repeated until the function stops decreasing, meaning


that we are in the vicinity of a local minimum or in a plateau. This
algorithm is called gradient descent or stepest descent algorithm. I

Jorge S. Marques, IST, 2017 82/279


Gradient descent - vector case

Another way to motivate the gradient algorithm is based on the first


order approximation of the cost function J(θ(t) + ∆), using a Taylor
series expansion

J(θ(t) + ∆) = J(θ(t) ) + ∇θ J(θ(t) )T ∆ ,

valid for a small displacement ∆.

If we make ∆ = −η∇θ J(θ(t) )T , we obtain

J(θ(t) + ∆) = J(θ(t) )−ηk∇θ J(θ(t) )k2

which corresponds to a decrease of the cost function.

The choice of η is a difficult aspect of the algorithm. The first order


approximation becomes invalid if η is too ”large”.

Jorge S. Marques, IST, 2017 83/279


Choice of step size η

The choice of step size involves a trade-off and it is often obtained by


trial-and-error. If η is too small, the update process becomes very slow.

Figures: isotropic valey + narrow valid

On the contrary, if η is too large the algorithm may skip a local minima
or produce an update of θ that increases the objective function J(θ).

Acceleration techniques can be used to speed up convergence e.g.,


adaptive step size and momentum technique.

Jorge S. Marques, IST, 2017 84/279


Momentum technique

This method performs a lowpass filtering of the gradient sequence and


updates θ(t+1) using the filtered gradient v (t+1)

v (t+1) = αv (t) − η∇θ J(θ(t) )


θ(t+1) = θ(t) + v (t+1) .

The parameter α (pole) typically ranges from 0.5 to 0.95.

This technique improves the convergence rate, specially if the cost


function J(θ) exhibits deep valleys in which the gradient method is slow.

The objective function J is evaluated in each iteration to check if it


decreases as expected. If not, the memory of the moment term is set to
zero.

L. Almeida, Multilayer Perceptrons, in Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 85/279


Nesterov accelerated gradient

This method is similar to the momentum technique but it computes the


gradient in a different position. It computes an approximate position for
the parameters in the next iteration and computes the gradient there
(look ahead).

v (t+1) = αv (t) − η∇θ J(θ(t) +αv (t) )


θ(t+1) = θ(t) + v (t+1) .

This algorithm performs better than the momentum term in many


problems.

Sutskever, Martens, Dahl, Hinton, On the importance of initialization and momentum in deep
learning, 2013.

Jorge S. Marques, IST, 2017 86/279


Adaptive step size (Almeida & Silva)
This method assumes that the step size η is different for each component
of θ and changes in each iteration. Therefore,

(t+1) (t) (t) ∂J  (t) 


θi = θi − ηi θ .
∂θi

Step size update


(
(t−1) ∂J
 ∂J

(t) u ηi if ∂θ θ(t) · ∂θi θ(t−1) > 0
ηi = (t−1)
i .
d ηi otherwise

Typical values for the parameters: u = 1.2, d = 0.8. This technique


performs very well if the cost function contains valleys aligned with the x
axes.

The objective function J is evaluated in each iteration to check if it


decreases as expected. If not, the previous values of the parameters are
kept and the step sizes are reduced.
Jorge S. Marques, IST, 2017 87/279
Newton method

The Newton methods assumes that we know not only the gradient vector
∇θ J(θ) but also the matrix of second derivatives (Hessian matrix).
∂2J ∂2J ∂2J
 
 ∂J 
∂θ 2 ∂θ 1 ∂θ 2
... ∂θ1 ∂θ d
∂θ1  ∂ 2 J1
 ∂J  ∂2J ∂2J

 ∂θ2 

 ∂θ2 ∂θ1 ∂θ22
... ∂θ2 ∂θd 

∇θ J =  .  H= .. .. ..
 .. 


 . . ... . 

∂J ∂ J2 2
∂ J ∂ J2
∂θd ∂θd ∂θ1 . . . ∂θ2 ∂θ2 ∂θ 2
d

and requires the inversion of H in each iteration.

Jorge S. Marques, IST, 2017 88/279


Newton method

Given a guess θ(t) we can approximate the cost function J(θ) by the 2nd
order Taylor expansion
1
J(θ(t) + ∆) = J(θ(t) ) + ∇θ J(θ(t) )T ∆ + ∆T H(θ(t) )∆
2
where ∆ is a small displacement vector.

Minimization of J(θ(t) + ∆) with respect to the displacement ∆, can be


achieved by the necessary condition

∇∆ J(θ(t) + ∆) = 0

Jorge S. Marques, IST, 2017 89/279


Newton method

Necessary condition for optimality,


∇∆ J(θ(t) + ∆) = 0
 
(t) (t) T 1 T (t)
∇∆ J(θ ) + ∇θ J(θ ) ∆ + ∆ H(θ )∆ = 0
2
∇θ J(θ(t) ) + H(θ(t) )∆ = 0
h i−1
∆ = − H(θ(t) ) ∇θ J(θ(t) )
Therefore,
h i−1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ J(θ(t) )

The Newton method gives an exact solution for the parameters if J is a


quadratic function.

Jorge S. Marques, IST, 2017 90/279


Example 1 - gradient vs Newton method

Quadratic function: J(x1 , x2 ) = x12 + x22

1.5
8
1
6
0.5

4
0

2
-0.5

0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

10 iterations of gradient descent (blue) and 1 iteration of Newton


method (red).

Jorge S. Marques, IST, 2017 91/279


Example 1 - gradient, momentum, Nesterov, Newton

Quadratic function: J(x1 , x2 ) = x12 + x22

2
10 1

1.5

10 0
0.5

-0.5
10 -1

-1

-1.5

-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10

Left: 10 iterations of gradient descent (green), gradient with momentum


(blue) Nesterov accelerated gradient (red) and Newton method (cyan).
Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 92/279


Example 2 - gradient vs Newton method

Quadratic function: J(x1 , x2 ) = x12 + 10x22

1.5
50
1
40

0.5
30

20 0

10 -0.5

0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

10 iterations of gradient descent (blue) and 1 iteration of Newton


method (red).

Jorge S. Marques, IST, 2017 93/279


Example 2 - gradient, momentum, Nesterov, Newton

Quadratic function: J(x1 , x2 ) = x12 + 10x22

2
10 1

1.5

10 0
0.5

-0.5
10 -1

-1

-1.5

-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10

Left: 10 iterations of gradient descent (green), gradient with momentum


(blue) and Nesterov accelerated gradient (red) and Newton method
(cyan). Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 94/279


Example 3 - gradient vs Newton method
Rosenbrock function with constant 10, instead of 100, to simplify the
problem: J(x1 , x2 ) = (1 − x1 )2 + 10(x2 − x12 )2

1.5
400
1

300
0.5

200 0

-0.5
100

-1
0
2 -1.5
1 2
0 1 -2
0
-1
-1
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

300 iterations of gradient descent (blue) and 5 iteration of Newton


method (red).

Jorge S. Marques, IST, 2017 95/279


Example 3 - gradient, momentum, Nesterov, Newton

Rosenbrock function with constant 10, instead of 100, to simplify the


problem: J(x1 , x2 ) = (1 − x1 )2 + 10(x2 − x12 )2

2
10 1

1.5

1
10 0

0.5

10 -1
-0.5

-1

10 -2
-1.5

-2

10 -3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 10 20 30 40 50 60 70 80 90 100

Left: 100 iterations of gradient descent (green), gradient with


momentum (blue), Nesterov accelerated gradient (red) and Newton
method (cyan). Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 96/279


Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 97/279


Supervised learning

We wish to predict an outcome y , given a vector of features x ∈ Rp .

Predictor (model):
ŷ = f (x, θ).
The parameters of the model, θ, are estimated from a training set
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) .

But, learned systems are not perfect. The output of a learned system is
not always the desired output.

Learning systems need to be evaluated.

Jorge S. Marques, IST, 2017 98/279


Example: regressor

How do we measure the performance of a regressor?

700

600

500

400

300

200

100

0
50 100 150 200

Polynomial fits (order 1, 3, and 4).

Jorge S. Marques, IST, 2017 99/279


Example: classifier

How do we measure the performance of a classifier?

Training set (left) and predicted classes (right) using the k nearest
neighbor method.

Jorge S. Marques, IST, 2017 100/279


Loss function

If the desired output, y , is different from the predicted outcome,


ŷ = f (x), we define a loss L(y , ŷ ), e.g.,

Regression Classification

0 y = ŷ
L(y , ŷ ) = (y − ŷ )2 L(y , ŷ ) =
1 otherwise
or
L(y = i, ŷ = j) = Lij
diagonal terms (no error) equal to zero.

In the classification problem, the second loss function is more flexible


since it may assign different penalties to different kinds of errors.

Jorge S. Marques, IST, 2017 101/279


Risk

If x, y are realizations of two random variables, it makes sense to define


the expected (average) value of the loss, also known as risk.

R = E {L(y , ŷ (x))}

In the case of regression, the risk would be


Z Z
R= L(y , ŷ (x))p(x, y ) dx dy

This requires the joint distribution of input and output p(x, y ) which is
usually unknown.

Jorge S. Marques, IST, 2017 102/279


Risk

In the case of classification problems, the risk would be


XX
R= L(y , ŷ )P(y , ŷ )
y ŷ

This requires the joint distribution of true and predicted class P(y , ŷ )
which is usually unknown.

Jorge S. Marques, IST, 2017 103/279


Empirical risk

Since the risk cannot be computed in most problems, we can replace the
expected value by an average of the loss computed with the training data,
n
1X
Re = L(y (i) , f (x (i) )) .
n
i=1

This is called the empirical risk.

The empirical risk is often used to train the predictor. However, is it a


good criterion to evaluate the system?

Jorge S. Marques, IST, 2017 104/279


Example: polynomial fit
Polynomial fits of order 1, 3, 4: which model is the best?

700

600

500

400

300

200

100

0
50 100 150 200

The empirical risk of the forth order polynomial is the smallest. But is
this the best model?

The model order is often considered as an hyperparameter.

Jorge S. Marques, IST, 2017 105/279


Generalization
We want to measure the performance of the system with new data. This
property is known as generalization.

To evaluate the generalization of a system, we should consider an


independent data set
n o
T 0 = (x 0(i) , y 0(i) ), i = 1, . . . , n0

and evaluate the model in it.


0
n
1 X
R0e = 0 L(y 0(i) , f (x 0(i) ))
n
i=1

Important questions:
I is Re , (computed in the training set) a good estimate of R0e
(computed in an independent set)?
I can Re or R0e be used to choose the model hyperparameters (e.g.,
polynomial order)?
Jorge S. Marques, IST, 2017 106/279
Evaluation of polynomial order
2

1.5

0.5

0
1 2 3 4 5

Average loss in the training set (blue) and in an independent set (red), as
a function of polynomial degree (hyperparameter).
Conclusions:
I The evaluation in the training set is too optimistic. An independent
data set is mandatory to obtain a reliable evaluation.
I The use of an independent data set allows the choice of model
hyperparameters (polynomial degree).
Jorge S. Marques, IST, 2017 107/279
Evaluation of k Nearest neighbor

0.25

0.2

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Percentage of classification error in the training set (blue) and in an


independent set (red), as a function of 1/k

The conclusions are the same as before.

Jorge S. Marques, IST, 2017 108/279


Overfitting

When there is a large difference between the evaluation of the model in


the training set and in an independent set, it means that the model is too
specialized in representing the training data and performs much worse
with new data.

This fenomena is known as overfitting.

Jorge S. Marques, IST, 2017 109/279


Summary (until now)

To estimate a model and evaluate it it is recommended the use of 2


independent data sets: training set, and a test set.

To estimate a model, select the hyperparameters and evaluate the


selected model it is recommended the use of 3 independent data sets:
training set, validation set, and a test set

More sophisicated techniques are available (cross validation,


leave-one-out)

Jorge S. Marques, IST, 2017 110/279


Model training and testing (without hyperparameters)

In most learning problems, we need to train and evaluate the model.


These operations should be done using two independent data sets known
as training set and test set.

This can be written in pseudocode using the functions f = train(T ),


P = perform(f , T 0 )

Data: training set T and test set T 0 .


f = train(T );
P = perform(f , T 0 );

Algorithm 1: Train and test of a model

Jorge S. Marques, IST, 2017 111/279


Hyperparameter selection
If we need to choose the value of hyperparameters ξ (e.g., polinomial
degree), this should be done using a third independent set known as
validation set.
This can be written in pseudocode using the functions f = train(T , ξ),
P = perform(f , T 0 )

Data: training set T , validation set Tv and test set T 0 .


Result: Select hyperparameters ξ and evaluate model.
for all values of ξ do
f = train(T , ξ);
P(ξ) = perform(f , Tv );
end
ξˆ = arg minξ P(ξ);
ˆ
f = train(T ∪ Tv , ξ);
P = perform(f , T 0 );

Algorithm 2: Training, optimization and testing of a model

This method requires a lot of data.


Jorge S. Marques, IST, 2017 112/279
Cross-validation
Cross validation is a very useful technique when we do not have a large
amount of data. The data set is split into k folds Tk (subsets with the
same number of examples). One fold is used for testing and the others
for training. After, the test fold rotates K times.

Data: k folds Tk .
for k=1, . . . , K do
f = train(T \ Tk );
Pk = perform(f , Tk );
end
P = P¯k
Algorithm 3: Cross validation without hyperparameters. The bar denotes
average for all the folders.
The final score is a combination of the evaluation of K models. The
method does not produce a final classifier/regressor.
Question: if we need one, what should we do?.

Jorge S. Marques, IST, 2017 113/279


Cross-validation with hyperparameters (nested)
Cross validation can be extended to account for the estimation of
hyperparameters. The data is again divided into K folds and two of them
will be used for validation and test.
Data: k folds Tk .
for i = 1, . . . , K do
for all values of ξ do
for j 6= i do
f = train(T \ (Ti ∪ Tj ), ξ);
P(ξ)j = perform(f , Tj );
end
P(ξ) = P(ξ) ¯ j
end
ξˆi = arg minξ P(ξ);
f = train(T \ Ti , ξˆi );
Pi = perform(f , Ti );
end
P = P̄i
Algorithm 4: Cross validation with hyperparameters (nested).
Jorge S. Marques, IST, 2017 114/279
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 115/279


History

MIT

I The human brain has been a source of inspiration in Computer


Science.
I The brain has a large number of processing units (∼ 1011 neurons)
that are slow (∼ 1 ms) but highly connected. Although they are
slow, the neurons are able to perform very complex tasks, e.g., visual
tasks, in real time and almost effortless.
I One of the first models of a neuron was proposed by McCulloch &
Pitts in the 40s and was a starting point for an exciting area of
artificial neural networks (ANN).
Jorge S. Marques, IST, 2017 116/279
Neuron

Wikipedia

The neuron is a cell consisting of dendrites (inputs), a soma (cell body)


and an axon (output).

It receives input signals through its dendrites. These signals are combined
in the soma and, from time to time, an electric impulse is generated that
travels through the axon and influences other cells.

Jorge S. Marques, IST, 2017 117/279


McCulloch & Pitts model

The neuron model proposed by McCulloch & Pitts (1942) has a linear
part, followed by a nonlinearity:

weighted sum of inputs (activation)

s = [1 x T ]w = x̃ T w

output (Heaviside function)



1 if s ≥ 0
ŷ = z = g (s) =
0 otherwise

The weighted sum s is called activation, the nonlinear function


g : R → R is known as activation function and the vector
T
w = [w0 . . . wp ] is the weight vector.

Jorge S. Marques, IST, 2017 118/279


Rosenblatt algorithm

Rosenblatt proposed an iterative algorithm in the 50s to train the weights


of the McCulloch & Pitts unit for the prediction of binary outcomes.

Rosenblatt algorithm

1. training set: T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , with
x (k) ∈ Rp , y (k) ∈ {0, 1};
2. initialization: randomly initialize the weights wi (0), i = 0, . . . , p;
3. new training example: present a new training pattern (x(t), y (t)) to
the model and compute the model output ŷ (t) = g (x̃ T (t)w (t − 1));
4. update: update the weights according to

wi (t) = wi (t − 1) + η x̃i (t)(t), (t) = y (t) − ŷ (t)

where y (t) is the desired outcome for the input x(t).


5. cycle: return to step 3, until a stop condition is met.

Jorge S. Marques, IST, 2017 119/279


Example: Gaussian data

Two trials of the Rosenblatt algorithm applied to the same linearly


separated data.

The training data is the same but the outcome is different in each
experiment (why?)

Jorge S. Marques, IST, 2017 120/279


Pros & cons

Pros
It can be proved that the Rosenblatt algorithm solves any binary problem
in a finite number of iterations, provided the training data can be
separated by a hyperplane in feature space.

Cons
It does not provide a hint to deal with data that cannot be separated by
a hyperplane or to deal with regression problems that are not binary.

Most practical problems are noisy and fit into one of these categories.
Therefore, a single unit trained by the Rosenblatt algorithm is seldom
useful in practice.

Jorge S. Marques, IST, 2017 121/279


Multilayer perceptron
To overcome previous limitations of a single unit, three important issues
were proposed:
I architectures with multiple units, usually organized in layers, known
as multilayer perceptron (MLP), and
I continuous and differentiable activation functions.
I training based on the minimization of a cost function.

For the sake of simplicity, the activation function of each unit is not explicitly
represented (but it exists!). offsets are not shown.
Jorge S. Marques, IST, 2017 122/279
Weights

Each unit i is connected to a unit j


of the next layer through a weight
wij .

unit j (layer ` = 1) unit j (layer ` > 1)


P P
sj = w0j + i∈input wij xi sj = w0j + i∈previous layer wij zi

zj = g (sj ) zj = g (sj )

g (.) is the activation function and the weight w0j is called the offset.

Jorge S. Marques, IST, 2017 123/279


Visible and hidden units

The units of the last layer are considered as visible. They are the output
of the network and we denote their output by ŷi

The units of the other layers are considered as hidden since we do not
know their desired values in the training phase. They are intermediate
variables used to compute the network output.

Jorge S. Marques, IST, 2017 124/279


Activation function

The activation functions should be continuous and differentiable to allow


the evaluation of the influence of weight changes on the network output.

Some common choices are:


1

sigmoid: logistic function 0.8

0.6

1 0.4

g (s) =
1 + e −s
0.2

0
-5 0 5

1.5

sigmoid: arctangent function 0.5

-0.5

g (s) = arctan s
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Jorge S. Marques, IST, 2017 125/279


Activation function

Linear unit 2

g (s) = s -2

-4

-4 -2 0 2 4

ReLU: rectified linear unit (recent) 3

g (s) = max(0, s) 1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

The ReLU is currently the recommended activation function since it does


not saturate making the convergence of the gradient algorithm faster.
Other units (linear, softmax) are often used in the output layer.

Jorge S. Marques, IST, 2017 126/279


Exercises
1. Compute the derivative of the activation functions:
1
I g (s) = 1+e −s

I g (s) = arctan s
I g (s) = s
I g (s) = max(0, s)

2. Write the equations for the network

Jorge S. Marques, IST, 2017 127/279


Arquitecture & weights
To specify a multilayer perceptron we need to indicate the arquitecture:
I number of layers
I number of units per layer

We also need to indicate


I activation functions
I weights

w = {wij }

where wij is the weight


connecting the output of unit i
to unit j

The network is thus a nonlinear map ŷ = f (x, w ) from the input space to
the output space, controlled by a set of weights w .

Jorge S. Marques, IST, 2017 128/279


How do we choose the architecture?

How should we choose the number of layers?


Cybenko (1989) proved that a multilayer perceptron with 1 hidden layer
is an universal approximator of any continuous function defined on a
compact subset of Rp . This is a useful theorem but it does not explain
how many units are needed nor how should the weights be chosen.

I Common practice shows that it is often better to use more layers


since the network can synthesize a wider variety of nonlinear
functions with less units.
I It also shows that deeper networks (with more layers) are more
difficult to train.

Great improvements were achieved in the last 10 years in the training of


deep neural networks. The state of the art in many problems (vision,
speech, text processing) is now based on neural networks.

Jorge S. Marques, IST, 2017 129/279


Perceptron training
After choosing the NN architecture, we needto learn all the weights,
using a training set of labeled patterns T = (x (k) , y (k) ), k = 1, . . . , n .

Goal: minimize the total loss (cost)


n
X n
X
C= L(y (k) , ŷ (k) ) = L(k) ,
k=1 k=1

where ŷ (k) is the network output for the input x (k) . A typical choice for
the loss function is the quadratic error

L(y , ŷ ) = ky − ŷ k2
The minimization of C is often achieved by using the gradient algorithm

∂C
wij (t + 1) = wij (t) + ∆wij (t) , ∆wij (t) = −η
∂wij w (t)

or a modified version of it; η denotes the learning step.

Jorge S. Marques, IST, 2017 130/279


Training modes: batch, mini-batch and on-line

The gradient vector includes the contribution of all the training patterns.
The weight update using all the training patterns in each iteration is
called the batch mode.
∂C X ∂L(y (k) , ŷ (k) ) X ∂L(k)
∆wij = −η = −η = −η .
∂wij ∂wij ∂wij
k k

Another alternative consists of using one training pattern k, only, and


updating the weights with that information. This is called the on-line
mode or stochastic gradient

∂L(k)
∆wij = −η .
∂wij

A third hypothesis consists of updating the NN weights using a small


subset of training patterns. This is known as mini-batch mode.

Jorge S. Marques, IST, 2017 131/279


Chain rule of differentiation

To train the weights wij we need the gradient of the loss function L.

This task relies on the chain rule of differentiation.

dz dz dy dz ∂z dw ∂z dv
= = +
dx dy dx dx ∂w dx ∂v dx

Jorge S. Marques, IST, 2017 132/279


Training a single unit
Let us start by a simple problem: a network with a single unit trained
with a single pattern (x, y ).

Forward network:
Pp
s = w0 + i=1 wi xi ,

ŷ = z = g (s)

Gradient:
∂L dL ∂s
∂wp = ds ∂wp =  xp

Therefore, the gradient is given by


∂L dL dL d ŷ dL
= xp  = = = g 0 (s)
∂wp ds d ŷ ds d ŷ

Compare with the Rosemblat algorithm.


Jorge S. Marques, IST, 2017 133/279
Gradient structure
The structure of the gradient can be extended to more general cases.

If unit q belongs to a layer ` higher than 1,


X
sq = w0q + wiq zi
i∈previous layer

Using the chain rule, the derivative of L with respect to a weight wpq can
be computed as
∂L ∂L ∂sq
= = z p q ,
∂wpq ∂sq ∂wpq

Therefore,
∂L ∂L
= z p q , q = q ∈ layer higher than 1
∂wpq ∂sq

If the unit q belongs to the first layer, zp is replaced by xp .

Jorge S. Marques, IST, 2017 134/279


Training the output layer
These ideas can be applied to NN with multiple layers. Let us start by
the output layer.

Forward network (unit j ∈ {6, 7}):


P
sj = w0j + i∈previous layer wij zi ,

zj = g (sj )

Gradient (unit q ∈ {6, 7}):


∂L
∂wpq = z p q

where
∂L ∂L ∂zq ∂L
q = = = g 0 (sq )
∂sq ∂zq ∂sq ∂zq

Jorge S. Marques, IST, 2017 135/279


Training a hidden layer
Let us consider units from a hidden layer.

Forward network (j ∈ {3, 4, 5}):


Pp
sj = w0j + i=1 wij zi ,
zj = g (sj )

Gradient (q ∈ {3, 4, 5}):


∂L
∂wpq = z p q

where
   
∂L X ∂L ∂sj ∂zq X
q = = = g 0 (sq ) wqj j
∂sq ∂sj ∂zq ∂sq
j∈next layer j∈next layer

Jorge S. Marques, IST, 2017 136/279


Backpropagation algorithm
The gradient components are given by
∂L
= z i j ,
∂wij

where zi is obtained from the multiplayer perceptron and j is obtained


from auxiliary network called backpropagation network.

This algorithm for the computation of the gradient using the


backpropagation network is known as the backpropagation algorithm.
Jorge S. Marques, IST, 2017 137/279
Backpropagation network
How do we build the backpropagation network?

The backpropagation network (right) is obtained from the original


network (left)by
I linearizing nonlinear units (activation functions);
I inverting the direction of links, converting sums into derivation
points (and vice versa);
I the output of linearized branches are the variables i ;
I the input of backpropagation network are the derivatives of the loss
with respect to forward network outputs.
Jorge S. Marques, IST, 2017 138/279
Acceleration techniques

The convergence of the gradient algorithm is often very slow and


acceleration techniques are usually adopted, namely:

I momentum term;
I adaptive weights.

These techniques modify the weight update rule and were discussed
before in the optimization lesson. Next we summarize the steps involved
in the gradient algorithm with momentum term (bach and on-line).

Jorge S. Marques, IST, 2017 139/279


Gradient algorithm (batch) with momentum term
Set t = 1 and ∆wij (0) = 0. Repeat steps 1 through 4 below until
stopping criteria is met
1. Set the variables gij to zero. These variables will be used to
accumulate the gradient components.
2. For k = 1, . . . , n, perform steps 2.1 through 2.4
2.1 propagate forward: apply the training pattern x (k) to the perceptron
(k)
and compute the variables zi and outputs ŷj
∂Lk
2.2 compute the cost derivatives: (k)
∂ ŷj
∂Lk
2.3 propagate backwards: apply (k) to the inputs of backpropagation
∂ ŷj
network and compute its internal variables j
2.4 compute and accumulate components: compute the variables
∂Lk
∂wij
= zi j and accumulate each of them in the corresponding
variable i.e., gij ← gij + zi j
3. Apply momentum: set ∆wij (t) = −ηgij + α∆wij (t − 1)
4. Update the weights: set wij (t + 1) = wij (t) + ∆wij (t)
adapted from L. Almeida, Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 140/279


Gradient algorithm (on-line) with momentum term

Set t = 1 and ∆wij (0) = 0. Repeat step 1 until stopping criteria is met
1. For k = 1, . . . , n, perform steps 1.1 through 1.6
1.1 propagate forward: apply the training pattern x k to the perceptron
and compute the variables zi and outputs ŷ (k)
k
1.2 compute the cost derivatives: ∂L(k)
∂ ŷj
∂Lk
1.3 propagate backwards: apply (k) to the inputs of backpropagation
∂ ŷj
network and compute its internal variables j
∂Lk
1.4 compute the gradient components: compute the variables ∂wij
= zi j
1.5 Apply momentum: set ∆wij (t) = −ηzi j + α∆wij (t − 1)
1.6 Update the weights: set wij (t + 1) = wij (t) + ∆wij (t)

adapted from L. Almeida, Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 141/279


Example
Output of a multi layer perceptron trained by the gradient algorithm
using the backpropagation method.
I data: 150 training patterns; binary outcome
I architecture: 2 inputs, 2 hidden layers and one output layer (5-3-1
units)
I activation function: logistic
I training mode: on-line, no speeding algorithms

Jorge S. Marques, IST, 2017 142/279


Regression vs classification

MLPs can be used for regression and for classification tasks.

In regression tasks the output units typically have linear activation


functions and the network is trained with quadratic loss (SSE).

In classification tasks the output units typically have logistic or Softmax


activation functions and the network is often trained with negative
log-likelihood (cross-entropy) loss. This will be discussed later.

Jorge S. Marques, IST, 2017 143/279


Example

Write all the equations required to compute the gradient components,


assuming one training example and the SSE cost

SSE = (y1 − ŷ1 )2 + (y2 − ŷ2 )2


Forward network Backward network
input: 2(ŷ1 − y1 ) 2(ŷ2 − y2 )
s1 = w0 1 + w11 x1 , z1 = g (s1 )
4 = g 0 (s4 ) × 2(ŷ2 − y2 )
s2 = w0 2 + w22 x2 , z2 = g (s2 )
3 = g 0 (s3 ) × 2(ŷ1 − y1 )
s3 = w0 3 + w13 z1 + w23 z2 , z3 = g (s3 )
2 = g 0 (s2 )[w24 4 + w23 3 ]
s4 = w0 4 + w14 z1 + w24 z2 , z4 = g (s4 )
1 = g 0 (s1 )[w14 4 + w13 3 ]

Jorge S. Marques, IST, 2017 144/279


Example (cont)

∆01 = 11 , ∆11 = x1 1 ∆03 = 13 , ∆13 = z1 3 , ∆23 = z2 3

∆02 = 12 , ∆22 = x2 2 ∆04 = 14 , ∆14 = z1 4 , ∆24 = z2 4

Jorge S. Marques, IST, 2017 145/279


Exercises

1. Consider the multi-layer perceptron sketched in previous slides. Write


the equations for the gradient of the loss function with respect to all the
weights.

2. Write the equations for the forward and backpropagation networks


using matrix notation. Consider vectors s (`) , y (`) , (`) containing the s, y
and  variables associated to layer `.

Jorge S. Marques, IST, 2017 146/279


Exercises

I Consider a MLP with one layer and linear units. Prove that the
input output map performed by the perceptron is a linear (affine)
transformation
f (x) = Ax + b
I prove this statement for the case of a MLP with two layers and
linear units.

This property can be extended to an arbitrary number of layers provided


the units are linear. Matrix A may be rank deficient if some of the hidden
layers has less units than the input or the output.

Jorge S. Marques, IST, 2017 147/279


Analysis of images with neural networks

Imagine that you want to distinguish images of horses and cats. How
would you proceed?

Jorge S. Marques, IST, 2017 148/279


Analysis of images with neural networks

Images encode information in very complicated ways (different


viewpoints, shapes, colors, textures, illumination).

Finding a set of rules on low level image features (e.g., color, corners)
seems to be unfeasible!

Jorge S. Marques, IST, 2017 149/279


Analysis of images with neural networks

However, some properties should hold:


I There is a spatial dependence in images.
I There are small regions that convey important information (e.g.,
eyes, ears). Some kind of pattern matching might work.
I The system should be invariant to translation, color, and
illumination changes.

How do we put this information in an algorithm?

May be the human brain can inspire us again.

Jorge S. Marques, IST, 2017 150/279


Convolutional neural networks

Convolutional neural networks (CNN) have recently achieved an


enormous success in the analysis of images.

History

I receptive fields (Hubel and Wiebel, 1950s, 60s): individual neurons


in the visual cortex respond to small regions in the field of view.

I neocognitron (Fukushima, 1980): hierarchical model using receptive


fields.

I LeNet-5 (LeCun et al., 1998): convolutional neural network


proposed for digit recognition.

I Alexnet (Krizhevsky et al., 2012): convolutional neural network.


Breakthrough in Imagenet international challenge.

Jorge S. Marques, IST, 2017 151/279


ImageNet - Large Scale Visual Recognition Challenge

LSVRC - Imagenet dataset


I 1K categories
I 1M images (1K images per category)
I annotation: manual annotation using Amazon Mechanical Turk.

Jorge S. Marques, IST, 2017 152/279


Breakthrough 2012 - Alexnet
Alexnet won the ImageNet challenge in 2012 by a large margin.

A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with Deep Convolution


Neural Networks, NIPS, 2012

Since then, all the winners of the ImageNet challenge are convolutional
neural networks
Jorge S. Marques, IST, 2017 153/279
End-to-end architecture

In Alexnet, no image features (handcrafted features) are defined by the


user.

Alexnet learns to directly compute the image label (class) for the input
image. Of course this requires a large training set (more than 1 million
images).

This strategy is called the end-to-end approach.

The classic blocks (feature extraction and classification) are both learned
from the training data without the use of handcrafted features.

Jorge S. Marques, IST, 2017 154/279


Basic CNN architecture

A convolutional neural network (CNN) receives an input image and


predicts an output image or a label, based on a sequence of internal
representations that extract useful information (features) from the image.

Most of these representations are 3D arrays. Each 3D array can be


viewed as a collection of 2D arrays known as channels or feature maps.

Internal representations of the image are obtained by a concatenation of


layers, including:
I convolutional layers: convolution followed by non-linearity
(activation function)
I pooling layers: dimensionality reduction
I fully connected layers: used in classification problems

Jorge S. Marques, IST, 2017 155/279


Convolution layer
A convolution layer receives a 3D input, convolves it with a set of kernels
(filters) and applies an activation function (typically RELU) to the filter
outputs.

Each kernel has a localized support in the first two (spatial) coordinates
and it is full range in the third (depth) coordinate.

Jorge S. Marques, IST, 2017 156/279


Convolutional layer

`−1
3D input: zijk ` − 1 - number of input layer

`
3D kernel: hijk

2D output:
XXX
`−1
sij` = `
hpqr zi+p,j+q,0+r
p q r

zij` = g (sij` )

Each filter produces a 2D output known as a feature map. Stacking the


feature maps produced by multiple filters leads to a 3D array.

Jorge S. Marques, IST, 2017 157/279


Pooling
Pooling reduces the size of a 3D array.

Each channel is separately processed. First the channel is divided into


non-overlapping cells (e.g., ∆ × ∆). Then, each cell is replaced by a
numeric value (e.g., its maximum, or mean).

Jorge S. Marques, IST, 2017 158/279


Pooling

`−1
3D input: zijk ` - number of input layer

3D output:
n o
` `−1
zijk = max z∆i+p,∆j+q,k
p,q∈{0,...,∆−1}

Jorge S. Marques, IST, 2017 159/279


Fully connected layer
The fully connected layer is used when the image representation is
converted into a 1D array.

It is often used as an output layer in classification problems.


Jorge S. Marques, IST, 2017 160/279
Alexnet

Layer 6 - pooling fully connected

Layer 2- max pooling followed I maxpooling: 2 × 2


Layer 1- convolutional by convolutional
I input: 13 × 13 × 256
I maxpooling: No I maxpooling: 2 × 2
I units: 4096
I Input: 224 × 224 × 3 I Input: 55 × 55 × 96
Layer 7 - fully connected
I kernel: 96 × 11 × 11 × 3 I kernel: 256 × 5 × 5 × 96
I input: 4096
I stride: 4 I stride: 1
I units: 4096
I units: 55 × 55 × 96 I units: 27 × 27 × 256
Layer 8- fully connected
Layer 3,4,5- similar
I input: 4096
I units: 1000

Jorge S. Marques, IST, 2017 161/279


Alexnet
Number of weights to be learned

layer expression weights


1 (11 × 11) × 3 × 96 0.03 M
2 (5 × 5) × 96 × 256 0.6 M
3 (3 × 3) × 256 × 384 0.8 M
4 (3 × 3) × 384 × 384 1.3 M
5 (3 × 3) × 384 × 256 0.8 M
6 (6 × 6) × 256 × 4096 37.7 M
7 4096 × 4096 16.7M
8 4096 × 1000 4.1 M

The Alexnet has 60 million weights. Almost all of them associated to the
last three layers (fully connected layers).

This high number of weights leads to overfitting problems in the training


phase. Some kind of regularization must be considered.

Jorge S. Marques, IST, 2017 162/279


Other convolutional neural networks: VGG

I deeper network
I deeper layers
I kernels: smaller spatial dimensions (3 × 3)

Jorge S. Marques, IST, 2017 163/279


Other convolutional neural networks: GoogLeNet

convolution, pooling, softmax, merge

I inception module
I 1x1 convolution

Jorge S. Marques, IST, 2017 164/279


Other convolutional neural networks: ResNet

I very deep network


I shortcut connections

Jorge S. Marques, IST, 2017 165/279


Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 166/279


What is a classifier?
length volume class
0.36 0.67 tuna
0.82 0.56 tuna
0.46 0.67 sword
Example: fish classification
0.40 0.30 sword
0.60 0.80 tuna
0.61 0.47 tuna
0.21 0.41 sword

Given an observation x ∈ Rd , we wish to predict its class y ∈ Ω, where


Ω = {ω0 , . . . , ωK −1 } or Ω = {0, . . . , K − 1}.

K is the number of classes and the ith class will be denoted by ωi or


simply by i.

We wish to learn a function f (x) that associates each feature vector


x ∈ Rp with the predicted class ŷ = f (x) ∈ Ω. This function is known as
a classifier.

Jorge S. Marques, IST, 2017 167/279


Discriminant functions

An alternative way to define a classifier is by using K functions

fi : Rp → R , i = 0, . . . , K − 1

such that x is classified in class ωi iif

fi (x) ≥ fj (x) , ∀j 6= i.

These functions fi (x) are called discriminant functions.

Note: if fi (x) = fj (x) the classification is ambiguous.

Jorge S. Marques, IST, 2017 168/279


Decision regions and decision boundary

A classifier f (x) splits the input space Rd into K disjoint regions, Rj ,


each of them associated to a specific class ωj , with j ∈ {0, . . . , K − 1}.

Rj = x ∈ Rd : f (x) = ωj .


These regions are known as decision regions.

The boundary points of these decision regions are called decision


boundaries or decision surfaces.

Knowing the decision regions is equivalent to knowing the classifier f (x),


or a set of discriminant functions fi (x), i = 0, . . . , K − 1. In fact, the
indicator functions of the regions is a set of discriminant functions.

Jorge S. Marques, IST, 2017 169/279


Classifier design

The main question is:

Given a classification problem, how do we define the classifier f (x) or a


set of discriminant functions f0 (x), . . . , fK −1 (x) or the decision regions
R0 , . . . , RK −1 ?

The three representations are equivalent.

Two cases will be considered:


I we know the probability distribution of the data (ideal case);
I we only know a data set (training set).

Jorge S. Marques, IST, 2017 170/279


Classifier evaluation - confusion matrix

The confusion matrix P is a K × K matrix whose generic element Pij is


the joint probability of true class i being predicted as class j.

Pij = Pr {y = i, ŷ = j}

Properties:
Pij ∈ [0, 1], ∀i, j
PK −1 PK −1
i=0 j=0 Pij = 1.

Matrix P is a joint probability distribution: the diagonal elementos


correspond to true decisions and the off diagonal elements correspond to
errors.

If we normalize each line i to sum 1, the i − th normalized line explains


how the classifier predicts the data from class i: what is the probaility of
error in class i and what errors are most probable.

Jorge S. Marques, IST, 2017 171/279


Probability of error

The probability of error can be obtained from the confusion matrix P


K
X −1
P(error) = 1 − Pii
i=0

Proof

P(error) = 1 − P(correct decision)


PK −1
=1− i=0 P(correct decision, y = i)
PK −1
=1− i=0 P(y = i, ŷ = i)
PK −1
=1− i=0 Pii

Jorge S. Marques, IST, 2017 172/279


How to compute the confusion matrix

In simple cases, the confusion matrix can be analytically evaluated.

Assuming that a feature vector associated to class i is generated


according to a pdf p(x|y = i), the class j is chosen by the classifier if
x ∈ Rj ,
Z
Pij = Pr {y = i, x ∈ Rj } = p(x|y = i)P(y = i) dx
Rj

When the integral cannot be evaluated, the confusion matrix may be


experimentally obtained:
I perform N classification experiments;
I count how many training examples from class i are classified in class
j (Nij );
Nij
I estimate Pij using the relative frequency P̂ij = PK −1 PK −1
Npq
.
p=0 q=0

Jorge S. Marques, IST, 2017 173/279


Example (1)
Compute the confusion matrix and the probability of error, assuming that
I x ∈ [0, 1], y ∈ {0, 1}, p(x|y = 0) = 1, p(x|y = 1) = 2x, ∀x ∈ [0, 1];

I the classifier is characterized by the decision regions


R0 = [0, T [, R1 = [T , 1].

Confusion matrix
P00 = Pr (y = 0, ŷ = 0) = Pr (ŷ = 0|y = 0)Pr (y = 0) =
R RT
= P0 R0 p(x|y = 0)dx = P0 0 1 dx = P0 T ,

P01 = P0 − P00 = P0 (1 − T ),
P10 = Pr (y = 1, ŷ = 0) = Pr (ŷ = 0|y = 1)Pr (y = 1) =
RT
= P1 R0 p(x|y = 1)dx = P1 0 2x dx = P1 T 2 ,
R

P11 = P1 − P10 = P1 (1 − T 2 ).

Jorge S. Marques, IST, 2017 174/279


Example (2)

Confusion matrix
 
P0 T P0 (1 − T )
P= 
P1 T 2 P1 (1 − T 2 )

Probability of error

P(error) = 1 − (P00 + P11 ) = P1 T 2 − P0 T + 1 − P1

The threshold T is chosen by the user. For example, it can be chosen by


minimizing P(error).

Jorge S. Marques, IST, 2017 175/279


Loss function
The confusion matrix is enough if all the errors are equally important and
the classes are equally probable.

Sometimes this is not true. We need to define a loss function L(y , ŷ )


that assigns a penalty when the true class of x is y and the predicted
class is ŷ .

Examples:

0 if ŷ = y
binary loss: L(y , ŷ ) = .
1 otherwise

general loss: L(y = ωi , ŷ = ωj ) = Lij , Lii = 0 , Lij > 0 , i 6= j.

The first case is a binary loss (no error/error).

The second case is a square K × K matrix of penalties with zeros in the


diagonal (decisions without error) and different costs associated to the
different types of errors.
Jorge S. Marques, IST, 2017 176/279
Loss function (cont)

Example: medical diagnosis: ω0 - no tumor , ω1 - tumor


   
0 1 0 1
L= L=
1 0 5 0

Question: which loss is more appropriate for this problem?

The loss function is not differentiable. We cannot use optimization


algorithms based on the gradient or Hessian matrix to reduce the loss :-(.

Jorge S. Marques, IST, 2017 177/279


Is there an ideal classifier?

The answer is Yes, if x, y are realizations of random variables X , Y with


known distribution and we wish to minimize the expected loss, also
known as (aka)risk (ideal case of known distributions)

R = E {L(y , ŷ (x))} .

If the loss is binary, the optimal classifier is given by

ŷ = arg max P(y = ω|x) .


ω∈Ω

This is known as the Bayes classifier and chooses the class with greatest
a posteriori probability (the most probable class, given the observations).

The Bayes classifier with binary loss is optimal in the sense that it
minimizes the probability of decision error.

Jorge S. Marques, IST, 2017 178/279


Is there an ideal classifier? (cont)

If we adopt a general loss function, the optimal classifier is also simple.


We compute the expected cost of choosing class ŷ = ω
X
cω (x) = L(y , ω)P(y |x),
y ∈Ω

and choose the class with smallest cost, i.e., the feature vector x should
be classified as follows

f (x) = arg min cω (x).


ω∈Ω

This is an optimal classifier in the sense that minimizes the risk for a
general loss function and it is also known as Bayes classifier.

Jorge S. Marques, IST, 2017 179/279


Proof
Risk with general loss matrix
Z X
R = E {L(y , ŷ (x))} = L(y , ŷ (x))p(x, y )dx
y ∈Ω
 
Z X
=  L(y , ŷ (x))P(y |x) p(x)dx
y ∈Ω

Minimization can be independently performed at each feature vector x.


 
X
f (x) = arg min  L(y , ω)P(y |x) = arg min cω (x)
ω∈Ω ω∈Ω
y ∈Ω

If the loss is binary we obtain,

f (x) = arg min [1 − P(ω|x)] = arg max P(ω|x)


ω∈Ω ω∈Ω

Jorge S. Marques, IST, 2017 180/279


a posteriori distribution of the classes

The a posteriori distribution of the classes P(y = i|x) is the distribution


of the classes after observing the feature vector x.

These probabilities can be obtained by using the Bayes law

p(x|y = i)P(y = i)
P(y = i|x) = ,
p(x)
where
I p(x|y = i) - distribution of the feature vector x associated to class i;
I P(y = i) - a priori distribution of the classes (before knowing the
observations);
I p(x) - normalization term that does not influence the decision,
X
p(x) = p(x|y )P(y ).
y ∈Ω

Jorge S. Marques, IST, 2017 181/279


Example: binary classification with Gaussian features (1d)

This example considers two classes (K = 2) with Gaussian 1d features:


x|ωi ∼ N(µi , σi2 ), i = 0, 1 (σ12 = σ22 = σ 2 ).
A priori distribution: P1 = P(ω1 ), P0 = P(ω0 ) = 1 − P1 .

conditional distribution joint distribution of data a posteriori distribution


of data p(x|y ) and classes p(x, y ) P(y |x)
0.4 0.4 1

0.9
0.35 0.35
0.8
0.3 0.3
0.7
0.25 0.25
0.6

0.2 0.2 0.5

0.4
0.15 0.15
0.3
0.1 0.1
0.2
0.05 0.05
0.1

0 0 0
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

Data classification can be obtained by maximizing the joint distribution


of data and class p(x, y ) or the a posteriori distribution of classes P(y |x),
with respect to y .
Jorge S. Marques, IST, 2017 182/279
Exercises

1. Consider a discrete feature variable x ∈ {1, 2, 3, 4} and an associate


binary class y ∈ {ω0 , ω1 }. Assume that y is characterized by the a
priori distribution P(ω0 ) = 0.4, P(ω1 ) = 0.6 and the observations x
are characterized by a conditional distribution defined by the table.
ω
p(x|y ) ω0 ω1
1 0.3 0.2
x 2 0.2 0.3
3 0.1 0.4
4 0.4 0.1

Derive the Bayes classifier assuming a binary loss.

Jorge S. Marques, IST, 2017 183/279


Exercises

2. Consider an observation x ∈ {0, . . . , p} generated by one of the two


binomial distributions
 
p
P(x|ωi ) = αix (1 − αi )p−x i = 0, 1,
x

where p, α1 > α0 are known parameters. Find the decision regions


of the Bayes classifier, assuming a binary loss matrix and equally
probable classes.

3. Assume that x ∈ R+ 0 is a realization of a random variable with one


of the following density functions

p(x|y = k) = αk e −αk x , k = 0, 1, α1 > α0 > 0.

Find the decision regions associated to both classes. Assume that


P1 = 2 P0

Jorge S. Marques, IST, 2017 184/279


Learning the classifier

In practice, we often do not know the joint distribution of the features


and true class p(x, y ), required in the design of the Bayes classifier.

In many
 practical problems, all we know is a training set
T = (x (i) , y (i) ), i = 1, . . . , n with n realizations of the pair X , Y .

We could learn a classifier by minimizing the empirical risk


n
1X
R= L(y (i) , f (x (i) ))
n
i=1

However, this is a difficult approach because y (i) and f (x (i) ) are


categorical variables and most optimization algorithms cannot be used.

Jorge S. Marques, IST, 2017 185/279


Learning the classifier

Alternative approaches are required. Some classification techniques try to


approximate the ideal (Bayes) classifier by estimating the a posteriori
probabilities of the classes, P(y |x), as a function of the feature vector x.

This can be done directly by proposing a class of functions for such


probabilities or by estimating the data distribution p(x|y = k),
k = 1, . . . , K and applying the Bayes law.

Other methods try to directly estimate a set of discriminant functions


without trying to estimate the data distribution which is considered to be
a more difficult problem.

This approach is supported by the Vapnik principle.

Vapnik principle: When trying to solve a problem, we should not solve a


more difficult problem as an intermediate step.

Jorge S. Marques, IST, 2017 186/279


Example - digit recognition
Digit recognition aims to recognize handwritten digits in images, in an
automatic way. It involves two steps:
I The first step consists of computing a bounding box for each digit
with, e.g., 20 × 20 pixels.
I The second step involves the classification of each 20 × 20 image.

examples from MNIST data set

If the feature vector, x, contains the intensity of 400 pixels, it is very


difficult to estimate the conditional distribution p(x|ωi ) : R400 → R.

Jorge S. Marques, IST, 2017 187/279


Naı̈ve Bayes classifier
When the feature vector x = [x1 , . . . , xp ]T contains many features, the
estimation of the conditional distribution p(x|y = k) is a difficult
problem.

The Naı̈ve Bayes classifier simplifies the problem by making a drastic


assumption: it assumes that features are conditionally independent
p
Y p
Y
p(x1 , . . . , xp |y = k) = p(xi |x1 , . . . , xi−1 , y = k) = p(xi |y = k)
i=1 i=1

This means that we only need to estimated the conditional distribution of


each feature.

In the digit recognition problem this means that we have to estimate the
conditional distribution of each pixel which is a simple task.

The Naı̈ve Bayes classifier is a suboptimal classifier if the independence


assumption is not true but it often leads to surprisingly good results.

Jorge S. Marques, IST, 2017 188/279


Exercises

1. Draw a pair of scatter plots for two features (x1 , x2 ), assuming that
they are
I dependent;
I independent.

2. Discuss the problem of e-mail classification (spam/non-spam).

Jorge S. Marques, IST, 2017 189/279


Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 190/279


Linear methods for classification

We denote by linear classifiers those whose decision boundaries are linear


(hyperplane) or piece-wise linear.

One example are the methods based on affine discriminant functions


fi (x) = [1 x T ]βi , where i stands for the class.

The decision boundary between two classes ωi , ωj is the set

x ∈ Rp : [1 x T ](βi − βj ) = 0 ,


which is (a subset of) an hyperplane in the feature space Rp .

Jorge S. Marques, IST, 2017 191/279


Class coding

Classification problems aim to predict a class label y ∈ {ω0 , . . . , ωK −1 }.

Some classifiers represent the class label by numbers and use regression
methods to predict those numbers.

One idea: ω0 → 0
ω1 → 1
ω2 → 2
This does not make much sense because in most problems there is no
natural order among the class labels.

A more interesting approach is the use of binary indicator variables:

y0 y1 y2
ω0 → 1 0 0
ω1 → 0 1 0
ω2 → 0 0 1

Jorge S. Marques, IST, 2017 192/279


One hot encoding

The indicator variable of class ωi , is



1 if class ωi occurs
yi =
0 otherwise

The representation of the class labels through a set of indicator variables


is known as one hot encoding.

Classification works as followed. In the training phase, a set of predictors


fi (x) are learned to fit the indicator variables.

In the test phase, new feature vectors are classified by computing the
predictors fi (x) and selecting the one with greatest value

ŷ = arg max fi (x)


i

Jorge S. Marques, IST, 2017 193/279


Training indicator variable with constant input

Consider n realizations of an indicator variable y associated to an


arbitrary class ω: y (1) , . . . , y (n) (x (k) is assumed constant).

Let us minimize the


n
X
SSE = (y (k) − ŷ )2
k=1
(k)
Since y is a binary variable, the SSE can be split into two terms

SSE = n0 (0 − ŷ )2 + n1 (1 − ŷ )2

where n0 , n1 is the number of 0s and 1s. The minimization of SSE leads


to
n1
ŷ =
n0 + n1
This means that the minimization of SSE leads to the estimation of the
class probability P(ω). This idea can be extended as we will see.

Jorge S. Marques, IST, 2017 194/279


Linear regression of indicator variables

Consider a binary classification problem with classes ω0 , ω1 and let us


assume that y is the indicator variable of class ω1

We fit a linear model f (x) = [1 x T ]β to the training set


T = (x (i) , y (i) ), i = 1, . . . , n using least squares.

The function f (x) can be considered as an estimate of the a posteriori


distribution of class ω1 . Since f (x) is linear, it takes values outside the
interval [0, 1].

After training the model, a new observation x can be classified by


comparing f (x) with a threshold 0.5

1 if f (x) > 0.5
ŷ = .
0 otherwise

Jorge S. Marques, IST, 2017 195/279


Example - 1D data
This example discusses two binary classification problems with 1D
features for which we know a training set (see figures). The data was fit
by a straight line and we display the decision boundary (cyan).

The first problem is well solved by linear regression of the indicator


variables. The second is not: all the features are classified in the same
class. Why?
1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5

Why does the linear regression with indicator variables fail in the second
example?
Jorge S. Marques, IST, 2017 196/279
Example - 2D data

This slide shows two problems with 2D features and a linear model. Only
the first problem can be solved by linear models. Why?
3

-1

-1

-2

-2

-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same class.

Jorge S. Marques, IST, 2017 197/279


Regression with more flexible models
The previous difficulties can be circumvented by using more flexible
models (e.g. 2nd order polynomials in R and R2 ).

-1
1

0.5 -2

0 0.5 1 1.5 2 2.5


0
0 0.5 1 1.5 2 2.5

Notice, though that these are not linear models with respect to x but
they are linear models in the parameters that can be estimated by a linear
system of equations.

Jorge S. Marques, IST, 2017 198/279


Drawbacks & extensions
The regressor function f (x) can be interpreted as an estimate of the a
posteriori probabilty P(ω1 |x) but it is not constrained to be in the
interval [0, 1]. Since the model is linear, it will take all real values.

The decision boundary between two classes is hyperplane. Therefore, the


technique can only be used if the data is well separated by a hyperplane.

The model can be easily extended to more flexible classes of functions


e.g., polynomials, radial basis functions, neural networks.

This approach can be easily extended to more than 2 classes by


considering K indicator variables (one per class) and fit a linear model to
predict these labels (one vs. all). This is often called one hot encoding.
Label prediction is performed by choosing the discriminant function with
the highest value
ŷ (x) = arg max fi (x).
i

Jorge S. Marques, IST, 2017 199/279


Logistic regression

Consider a binary classification problem in which y ∈ {0, 1}. The Bayes


classifier is based on the a posteriori distribution of the classes

P(y = 1|x), P(y = 0|x).


The logistic regression proposes a parametric model for the a posteriori
probabilities
T
1 e −x β
P(y = 1|x) = , P(y = 0|x) = .
1 + e −x T β 1 + e −x T β

Where x ∈ Rp+1 is the feature vector and β ∈ Rp+1 the vector of


parameters to be estimated. We have included β0 in vector β and
extended the feature vector x with 1.

Jorge S. Marques, IST, 2017 200/279


Logistic regression

This model guarantees that

P(y = 1|x), P(y = 0|x) ∈ [0, 1]

P(y = 0|x) + P(y = 1|x) = 1.

It can be rewritten as follows


P(y = 1|x) = g (x T β) , P(y = 0|x) = 1 − g (x T β)

where
1
g (s) =
1 + e −s

What is the relationship between the logistic regression and a perceptron


unit?
What is the meaning of the perceptron output ŷ , in this context?

Jorge S. Marques, IST, 2017 201/279


Logistic regression: learning

Given a training set T = (x (i) , y (i) ), i = 1, . . . , n , the coefficients β
can be estimated by the maximum likelihood method

β̂ = arg max `(β) ,


β

where `(β) is the conditional log-likelihood function

`(β) = log P(y (1) , . . . , y (n) |x (1) , . . . , x (n) ; β) .

Since the training examples are independent


n
X
`(β) = log P(y (i) |x (i) ) .
i=1

n n
X o
`(β) = y (i) log[g (x (i)T β)] + (1 − y (i) ) log[1 − g (x (i)T β)] .
i=1

This function cannot be analytically optimized. We have to use


numerical optimization algorithms e.g., gradient ascent method.
Jorge S. Marques, IST, 2017 202/279
Logistic regression - gradient ascent

The gradient of the conditional log-likelihood function `(β) can be easily


computed
Xn
∇β `(β) = [y (i) − g (x (i)T β)]x (i) .
i=1

Therefore, the gradient ascent algorithm is given by

β (t+1) = β (t) + γ∇β `(β (t) ) , γ>0.

Is it possible to train the logistic regressor using the SSE criterion?

Jorge S. Marques, IST, 2017 203/279


Log-likelihood gradient

Proof for 1 pattern (drop variable i)


∇β `(β) = ∇β y log g (x T β) + (1 − y ) log[1 − g (x T β)]

g 0 (x T β) −g 0 (x T β)
=y g (x T β)
x + (1 − y ) [1−g (x T β)]
x

g (x T β) [1−g (x T β)] −g (x T β) [1−g (x T β)]


=y g (x T β)
x + (1 − y ) [1−g (x T β)]
x

= y [1 − g (x T β)] x − (1 − y ) g (x T β) x

= [y − g (x T β)] x

Jorge S. Marques, IST, 2017 204/279


Example - logistic regression with 2D data

Consider two classification problems with 2D features described before.


Figures show the decision boundaries obtained by logistic regression.
Only the first problem can be solved by linear models.

3
2.5

2
2

1.5

1
1

0.5
0

-1
-0.5

-1
-2

-1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same
class. The model is too rigid.
Jorge S. Marques, IST, 2017 205/279
Logistic regression with more flexible models

The previous difficulties can be circumvented by using more flexible


models (e.g., 2nd order polynomials).

-1

-2

0 0.5 1 1.5 2 2.5

Notice that these models are not linear models with respect to x.

Jorge S. Marques, IST, 2017 206/279


Softmax
Softmax extends logistic regression to classification problems with an
arbitrary number of classes K .
The true class is expressed using indicator variables, a.k.a. one hot
encoding
K
X −1
y = (y0 , . . . , yK −1 ), yi ∈ {0, 1}, yi = 1
i=0

SOFTMAX proposes a model for the a posteriori probabilities of the


classes
e si
ŷi = P(yi = 1|x, β) = PK −1
sc
c=0 e
where
K
X −1
si = βji xj
j=0

This model guarantees that P(yi = 1|x, β) ∈ [0, 1], i = 0, . . . , K − 1 and


PK −1
i=0 P(yi = 1|x, β) = 1.

Jorge S. Marques, IST, 2017 207/279


Softmax: learning

Given a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , the coefficients β
can be estimated by the maximum likelihood method

β̂ = arg max `(β) ,


β

where `(β) is the conditional log-likelihood function

`(β) = log P(y (1) , . . . , y (n) |x (1) , . . . , x (n) ; β) .

Since the training examples are independent


n n K −1 n o
(m) (m)
X X X
(m) (m)
`(β) = log P(y |x ; β) = yi log(ŷi ) .
m=1 m=1 i=0

This function cannot be analytically optimized. We must resort to


numerical optimization algorithms e.g., gradient ascent method method.

Jorge S. Marques, IST, 2017 208/279


Softmax: gradient

PK −1
log-likelihood (1 training example): ` = i=0 yi log(ŷi )

Derivatives:
∂` yi
=
∂ ŷi ŷi

∂ ŷi ŷi (1 − ŷi ) i = k
=
∂sk −ŷi ŷk i=6 k
K
∂` X ∂` ∂ ŷk ∂` ∂ ŷi X ∂` ∂ ŷk
= = +
∂si ∂ ŷk ∂ŝi ∂ ŷi ∂ŝi ∂ ŷk ∂ŝi
k=1 k6=i
yi X yk
= ŷi (1 − ŷi ) − (ŷk ŷi ) = yi − ŷi
ŷi ŷk
k6=i

∂` ∂` ∂si
= = (yi − ŷi )xj
∂βij ∂si ∂βij

Jorge S. Marques, IST, 2017 209/279


Softmax: update

The previous expressions can be extended to multiple training examples.

log-likelihood (multiple examples):


n K −1
(m) (m)
X X
`(β) = yi log(ŷi )
m=1 i=0

gradient:
n
∂` X (m) (m) (m)
= (yi − ŷi )xj
∂βij m=1

Therefore, the gradient ascent algorithm is given by

(t+1) (t) ∂`
βij = βij + γ , γ>0.
∂βij

Jorge S. Marques, IST, 2017 210/279


Linear discriminant analysis
Let us consider a binary classification problem. The a posteriori
probabilities can be obtained from the Bayes law

p(x|y = i)P(y = i)
P(y = i|x) = ∝ p(x|y = i)Pi ,
p(x)

where the a priori probabilities Pi are approximated by the relative


frequencies of each class in the training set. The main challenge is the
estimation of p(x|y = i).

Linear discriminant analysis assumes that the distribution of the data


associated to class y = i follows a normal distribution N (µi , Σi ) and the
data from all classes share the same covariance matrix Σi = Σ.
Therefore,
1 T
Σ−1 (x−µi )
p(x|y = i) = Ce − 2 (x−µi ) ,

where C = 1/ (2π)p/2 |Σ|1/2 is a normalization constant.

Jorge S. Marques, IST, 2017 211/279


Linear discriminant analysis

Under these hypotheses, the decision boundary between classes i, j is


given by
p(x|y = i)Pi = p(x|y = j)Pj
1 T
Σ−1 (x−µi ) 1 T
Σ−1 (x−µj )
Ce − 2 (x−µi ) Pi = Ce − 2 (x−µj ) Pj .
The constants C are equal because the covariance matrices are the same.
Taking logs leads to
1 Pj
(µi − µj )T Σ−1 x = (µi + µj )T Σ−1 (µi − µj ) + log .
2 Pi

This is the equation of an hyperplane with normal vector (µi − µj )T Σ−1 .

Jorge S. Marques, IST, 2017 212/279


Linear discriminant analysis
Proof
log[p(x|y = i)Pi ] = log[p(x|y = j)Pj ]

1 1
log C − (x−µi )T Σ−1 (x−µi )+log Pi = log C − (x−µj )T Σ−1 (x−µj )+log Pj
2 2

(x − µi )T Σ−1 (x − µi ) − 2 log Pi = (x − µj )T Σ−1 (x − µj ) − 2 log Pj

−1 T −1 T −1 T −1 Pj
−2µT
i Σ x + µi Σ µi = −2µj Σ x + µj Σ µj − 2 log
Pi
1 T −1 1 Pj
(µi − µj )T Σ−1 x = µi Σ µi − µT −1
j Σ µj + log
2 2 Pi

1 Pj
(µi − µj )T Σ−1 x = (µi + µj )T Σ−1 (µi − µj ) + log
2 Pi

Jorge S. Marques, IST, 2017 213/279


Linear discriminant analysis

In practice, the parameters of the Gaussian distributions are learned from


the training data

nk
P̂(ωk ) = n

1
x (i)
P
µ̂k = nk i:y (i) =k

PK hP i
1 (i)
Σ̂ = n−K k=1 i:y (i) =k (x − µ̂k )(x (i) − µ̂k )T .

The covariance matrix has d 2 entries. If d is large (hundreds or


thousands) , Σ̂ may be inaccurate and singular. Its inverse does not exist
and it is required in LDA. It is common practice to enforce additional
constraints on Σ e.g., assume Σ is a diagonal matrix.

Jorge S. Marques, IST, 2017 214/279


Example: LDA with 2D data

Consider two classification problems with 2D features described before.


Only the first can be solved by LDA.
3

-1

-1

-2

-2

-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same
class. The model is too rigid.

Jorge S. Marques, IST, 2017 215/279


Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 216/279


Support vector machines

Support vector machines (SVMs) were proposed by Vapnik and


Chervonenkis in 1963 for binary classification problems with linear
decision boundaries. They were extended later for nonlinear decision
boundaries and for regression problems.

Main idea: separate the cloud of data in two regions, using a carefully
chosen hyperplane.

The SVM classifiers are often described in three steps:


I linear classifiers with hard margin
I linear classifiers with soft margin
I non linear classifiers

Suggested report: Fletcher, Support Vector Machines Explained, UCL,


2008.

Jorge S. Marques, IST, 2017 217/279


Hyperplanes

How do we define a hyperplane in Rp ?

x ·w +b =0

where
I x ∈ Rp - point on the hyperplane
I w ∈ Rp - normal vector to the hyperplane
I b ∈ R - offset
I x · w is the inner product between w , x ∈ Rp

|b|
distance to the origin: kw k

note: parameters w , b are defined up to a scale factor (require some kind


of normalization).

Jorge S. Marques, IST, 2017 218/279


Linear classifiers

A linear classifier compares each input vector with an hyperplane decision


boundary

x · w + b > 0 ⇒ ŷ = +1
x · w + b < 0 ⇒ ŷ = −1

For the sake of simplicity we assume that y ∈ {−1, +1}.

Therefore,
ŷ = sign(x · w + b)

Main question: how do we learn the hyperplane parameters from the


training data?

Jorge S. Marques, IST, 2017 219/279


Case I: linearly separable data
Training set
n o
T = (x (i) , y (i) ), i = 1, . . . , n with x (i) ∈ Rp , y (i) ∈ {−1, 1}

Let us assume that the training data can be separated without errors by
an hyperplane (linearly separable data). In fact if there is one, there is an
infinite number of separating hyperplanes ...

w ·x +b =0

Problem: which hyperplane should


we choose?

Jorge S. Marques, IST, 2017 220/279


Hard margin

Consider a hyperplane that separates the training data without errors and
is equally distant to the nearest examples of both classes.

The training points closest to the


hyperplane are called support
vectors.

The sum of two distances from the


support vectors of each class to the
decision hyperplane is the margin.

The hyperplanes parallel to the decision hyperplane, that contain the


support vectors are known as margin hyperplanes.

Jorge S. Marques, IST, 2017 221/279


Margin hyperplanes

Margin hyperplanes: training data on the margin hyperplanes verify

x (i) · w + b = +1 for y (i) = +1




x (i) · w + b = −1 for y (i) = −1

2
Margin: kw k

Constrains: training data must obey

x (i) · w + b ≥ +1 for y (i) = +1



⇒ y (i) (x (i) · w + b) − 1 ≥ 0, ∀i
x (i) · w + b ≤ −1 for y (i) = −1

Jorge S. Marques, IST, 2017 222/279


Maximum margin classifier

The SVM classifier chooses the hyperplane with the maximum margin
(maximum margin classifier).

Difficulty:

The decision hyperplane can be


computed from the support vectors.

However, initially we do not know the support vectors. Their selection


requires the decision hyperplane.

Question: how can we break this tie?

Jorge S. Marques, IST, 2017 223/279


Exercise
Consider the following data sets.

x1 x2 y
x1 x2 y
0 3 −1
0 9 −1
0 −3 −1
4 1 −1
4 1 −1
4 4 −1
4 −2 −1
0 0 −1
0 0 +1
0 4 +1
0 −2 +1
1 1 +1
1 1 +1

For each of them,


I plot the data,
I find if it is linearly separable,
I find the support vectors and margin hyperplanes
I find the margin

Jorge S. Marques, IST, 2017 224/279


Exercise (cont)
First data set: it is non-separable and cannot be separated by an
hyperplane.

Second data set: it is separable by an hyperplane.

Support vectors:
class -1 : (0, 9), (4, 1)
class +1 : (0, 4)

Margin hyperplanes:
    
0 9 1 w1 −1  
1 4 13
 4 1 1   w2  =  −1  w =− , b= .
5 2 5
0 4 1 b +1

2

Margin: kw k = 5

Jorge S. Marques, IST, 2017 225/279


Optimization problem (hard margin)

We break the tie by solving an optimization problem.

We wish to maximize the margin (2/kw k) under the constrains described


above. This can leads to the following optimization problems

Optimization problem 1: min kw k , s.t. y (i) (x (i) · w + b) − 1 ≥ 0, ∀i.

Optimization problem 2: min 21 kw k2 , s.t. y (i) (x (i) · w + b) − 1 ≥ 0, ∀i.

This is a quadratic optimization problem with linear constraints.

Jorge S. Marques, IST, 2017 226/279


Lagrangian formulation (primary)

Let us adopt a Lagrangian formulation in order to deal with the


constraints on the training points.

Lagrangian function
n
1 X
LP = kw k2 − αi [y (i) (x (i) · w + b) − 1] ,
2
i=1
n n
1 X X
LP = kw k2 − αi y (i) (x (i) · w + b) + αi ,
2
i=1 i=1

where αi ≥ 0 are Lagrange multipliers.

w , b should be chosen to minimize LP , and αi to maximize it. This is


known as the primary Lagrangian problem.

Jorge S. Marques, IST, 2017 227/279


Lagrangian formulation (dual)

Optimization
n
∂LP X
=0⇒w = αi y (i) x (i) ,
∂w
i=1

The normal vector w is obtained by a linear combination of the training


patterns and only the support vectors contribute.

n
∂LP X
=0⇒ αi y (i) = 0 .
∂b
i=1

Jorge S. Marques, IST, 2017 228/279


Lagrangian formulation (dual)

Replacing these variables, we obtain the dual formulation,


n n n
X 1 X X
LD = αi − αi y (i) (x (i) · x (j) )y (j) αj , s.t. αi y (i) = 0 ,
2
i=1 i,j=1 i=1

where αi ≥ 0 are Lagrange multipliers.

The dual formulation depends only on the inner products between input
vectors x (i) · x (j) . This is very important!

Jorge S. Marques, IST, 2017 229/279


Proof
n
1 X h i
Lp = kw k2 − αi y (i) (x (i) · w + b) − 1
2
i=1
n n n
1 X X X
kw k2 −
Lp = αi y (i) x (i) · w − b αi y (i) + αi
2
i=1 i=1 i=1
Pn Pn
Since w = i=1 αi y (i) x (i) , i=1 αi y (i) = 0
n
X 1
Lp = αi − kw k2
2
i=1

n n
!  n 
X 1 X X
Lp = αi − αi y (i) x (i) ·  αj y (j) x (j) 
2
i=1 i=1 j=1

n n n
X 1 XX
Lp = αi − αi y (i) (x (i) · x (j) )y (j) αj
2
i=1 i=1 j=1

Jorge S. Marques, IST, 2017 230/279


Dual Lagrangian problem
n n
X 1 X
max αi − αT Hα , s.t. αi ≥ 0 ∀ i , αi y (i) = 0 ,
α 2
i=1 i=1
T (i) (i) (j) (j)
where α = [α1 . . . αn ] and Hij = y (x · x )y .

This is a convex quadratic programming (QP) problem that can be


solved by standard QP algorithms and provides all αi .

From α’s we may obtain:

1 support vectors (S): all x (s) such that αs > 0,

w = s∈S αs y (s) x (s)


P
2 normal vector:

b = N1s s∈S y (s) − m∈S αm y (m) (x (m) · x (s) )


P  P 
3 offset:

4 Classification of data: f (x) = sign(x · w + b)

S is the set of support vector indices.

Jorge S. Marques, IST, 2017 231/279


Comments

We note that only support vectors contribute to the estimation of w , b.

Matrix H does not require the training patterns themselves, x (i) , but only
inner products of training vectors x (i) · x (j) .

The SVM algorithm provides not only a decision but also a score
f (x) = x · w + b.

Jorge S. Marques, IST, 2017 232/279


Exercises

1. Prove that if we know αi , i = 1, . . . , n, we can obtain the offset


from the support vectors by
" #
1 X (s) X (m) (m) (s)
b= y − αm y (x ·x )
Ns
s∈S m∈S

2. Which formulation (primary or dual) has more parameters to


optimize?

Jorge S. Marques, IST, 2017 233/279


Case II: data that cannot be separated by an hyperplane

SVMs can be extended to deal with data that is not linearly separable. In
this case it is not possible to classify all the training vectors without
errors, using an hyperplane.

Idea: allow data points on the wrong


side of the margin hyperplane,
provided that they suffer a penalty.
This is known as soft margin.

Jorge S. Marques, IST, 2017 234/279


Soft margin

The idea is to assign a slack (folga) variable ξi to each data point x (i)
defined in such way that ξi = 0 if no margin violation occurs and ξi > 0
if the ith point is on the wrong side of the margin hyperplane.
Pn
Soft margin penalty: C i=1 ξi

All the training examples in the


wrong side of the margin are
considered as support vectors since
they influence the decision boundary.

The constraints can be written as


follows

x (i) · w + b ≥ +1 − ξi for y (i) = +1



⇒ y (i) (x (i) ·w +b)−1+ξi ≥ 0, ∀i
x (i) · w + b ≤ −1 + ξi for y (i) = −1

with ξi ≥ 0.

Jorge S. Marques, IST, 2017 235/279


Optimization problem (soft margin)

Optimization problem:
n
1 X
min kw k2 + C ξi s.t. y (i) (x (i) · w + b) − 1 + ξi ≥ 0, ∀i.
2
i=1

Lagrangian function
n n n
1 X X X
LP = kw k2 + C ξi − αi [y (i) (x (i) · w + b) − 1 + ξi ] − µi ξi
2
i=1 i=1 i=1

where αi , µi ≥ 0 are Lagrange multipliers.

w , b and ξi should be chosen to minimize LP , and αi , µi to maximize it.

Jorge S. Marques, IST, 2017 236/279


Dual Lagrangian problem

n n
X 1 X
max αi − αT Hα , s.t. 0 ≤ αi ≤ C ∀ i , αi y (i) = 0
α 2
i=1 i=1
T (i) (i) (i) (i)
where α = [α1 . . . αn ] and Hij = y (x · x )y .

This is a convex quadratic programming (QP) problem that can be


solved by standard QP algorithms and provides the all αi .

The classifier parameters, w , b, are obtained the same way as before.

Jorge S. Marques, IST, 2017 237/279


Example - linearly separable data
This example shows separable data set classified with hard margin (left)
and soft margin (center, right).

hard margin soft margin (C=10) soft margin (C=0.1)

support vectors are identified with a circle.

soft margin classifier with large C is equal to the hard margin classifier.
Jorge S. Marques, IST, 2017 238/279
Example - data not linearly separable

This example shows data not linearly separable, classified with soft
margin.

soft margin (C=10) soft margin (C=0.1)

The choice of C controls the margin width.

Jorge S. Marques, IST, 2017 239/279


Hinge loss

The slack variables can be obtained by using the hinge loss


 
ξi = max 0, 1 − y (i) (x (i) · w + b)

Therefore, the linear SVM with soft margin minimizes


n
X  
max 0, 1 − y (i) (x (i) · w + b) + λkw k2
i=1

with λ = 1/(2C ).

Jorge S. Marques, IST, 2017 240/279


Case III: non linear SVM

Linear SVMs classify data using an hyperplane trained with hard margin
or soft margin. This is too restrictive, especially when the dimension of
input space is low.

Jorge S. Marques, IST, 2017 241/279


Transformed space

Many problems require a decision boundary with curvature which cannot


be synthesized by a linear SVM.

Linear classifiers work better in higher dimensional spaces. Therefore, one


strategy consists of mapping the data from the original input space into a
high dimension space (feature space)

x̃ = φ(x)

where the data can be separated by an hyperplane. φ is a nonlinear map.

Questions: can SVMs be extended to these high dimension feature


spaces? or will they become unfeasible?

Jorge S. Marques, IST, 2017 242/279


The kernel trick

The linear SVM algorithm (dual formulation) does not require the input
vectors x (i) but only inner products between them x (i) · x (j) .

This means that we do not need to compute the feature vectors


(transformed input vectors) x̃ (i) = φ(x (i) ) but only their inner products
φ(x (i) ) · φ(x (j) )

The good new is that we can compute these inner products using a
kernel function
k(x (i) , x (j) ) = φ(x (i) ) · φ(x (j) ),
that can be computed in low dimension input space.

The non linear SVM can be trained and tested using low dimension data,
by replacing the inner products by the kernel.

Jorge S. Marques, IST, 2017 243/279


Typical kernels

The most common choices are:

linear: k(x (i) , x (j) ) = x (i)T x (j)

(i)
1
−x (j) k2
rbf: k(x (i) , x (j) ) = e − 2σ2 kx

polynomial: k(x (i) , x (j) ) = (x (i)T x (j) + a)b

The linear kernel is the one adopted in linear SVM.

We note that some kernels depend on hyperparameters that have to be


specified or learned during the training phase e.g., typically by ad hoc
procedures or by cross validation.

Jorge S. Marques, IST, 2017 244/279


Examples - SVM
Two examples (training data) solved by Matlab (function svmtrain) with
rbf kernel. Support vectors are identified with a circle.
3
2.5

2
2
1.5

1
1

0.5

0
0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3

3
2.5
0
0 1
1 Support Vectors
2
Support Vectors
2
1.5

1
1

0.5

0
0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3

Jorge S. Marques, IST, 2017 245/279


Extension

How can we solve muti-class classification problems with SVMs?

Jorge S. Marques, IST, 2017 246/279


Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 247/279


Decision trees

Decision trees are popular classifiers since they allow to understand why
the input pattern is classified in a specific class.

This is important in several applications (e.g., medical diagnosis).

Decision trees are often used when the features are categorical, although
they have been extended to numerical features as well..

Next slides adress:


I how is categorical data classified by a tree?
I how is the tree trained?
I how can trees be extended to numerical features?

Jorge S. Marques, IST, 2017 248/279


Example: Good days to play tennis

Suppose we wish to predict what are the good days to play tennis and we
have the following dataset associated to a player (John).

Day Outlook Humidity Wind Play


1 Sunny High Weak No
2 Sunny High Strong No
3 Overcast High Weak Yes
4 Rain High Weak Yes
5 Rain Normal Weak Yes
6 Rain Normal Strong No
7 Overcast Normal Strong Yes
8 Sunny High Weak No
9 Sunny Normal Weak Yes
10 Rain Normal Weak Yes
11 Sunny Normal Strong Yes
12 Overcast High Strong Yes
13 Overcast Normal Weak Yes
14 Rain High Strong No
adapted from Quinlan, 1986

Days with the same attributes may have different outcomes (noisy labels).
Jorge S. Marques, IST, 2017 249/279
A decision tree
A decision tree that solves the problem is

adapted from Quinlan, 1986

This tree contains three types of nodes:


I one root node;
I splitting nodes;
I leave nodes (associated to labels).
Each splitting node is associated with a question.
Jorge S. Marques, IST, 2017 250/279
What training data is associated to each node?

Each node has a subset of training examples associated to it.

node 3
Day Outlook Humidity Wind Play
3 Overcast High Weak Yes
7 Overcast Normal Strong Yes pure subset!
12 Overcast High Strong Yes
13 Overcast Normal Weak Yes
node 2 node 4
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
1 Sunny High Weak No 4 Rain High Weak Yes
2 Sunny High Strong No 5 Rain Normal Weak Yes
8 Sunny High Weak No 6 Rain Normal Strong No
9 Sunny Normal Weak Yes 10 Rain Normal Weak Yes
11 Sunny Normal Strong Yes 14 Rain High Strong No

If the training examples associated to a node have the same label, the
node is called pure. Pure nodes are not split anymore and receive a label.
Jorge S. Marques, IST, 2017 251/279
What training data is associated to each node? (2)

node 5 node 7
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
9 Sunny Normal Weak Yes 4 Rain High Weak Yes
11 Sunny Normal Strong Yes 5 Rain Normal Weak Yes
10 Rain Normal Weak Yes
node 6
Day Outlook Humidity Wind Play node 8
1 Sunny High Weak No Day Outlook Humidity Wind Play
2 Sunny High Strong No 6 Rain Normal Strong No
8 Sunny High Weak No 14 Rain High Strong No
both subsets are pure! both subsets are pure!

Jorge S. Marques, IST, 2017 252/279


Classification of new data

Classification of new data is done in the same way. Given a feature vector
x, we travel along the tree based on the feature values until we reach a
leaf (with a label).

There will be classification errors not only in the test set but also in the
training set, if there are examples with the same attributes and different
labels. This is known as noisy labels

Jorge S. Marques, IST, 2017 253/279


Posterior distribution of classes at each node

Consider a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) . A decision
tree associates each training pattern x (i) to a node m through a sequence
of questions.

We can estimate the a posteriori distribution of the labels associated to


each tree node m, using the training data
1 X
P(k|m) = I (y (i) = k)
#Tm (i)
x ∈Tm

where Tm is the set of training patterns associated to node m and I (.) is


the indicator function (the indicator function is 1 if the argument is true
and 0 otherwise).

If node m is a leaf, the most probable label is

k̂(m) = arg max P(k|m)


k

Jorge S. Marques, IST, 2017 254/279


Exercise

Consider the tennis data set and the decision tree shown above. Find the
probability of each label (Yes or No), at each node.

Jorge S. Marques, IST, 2017 255/279


Exercise (cont.)

m node #Yes #No P(Yes|m) k̂(m)


1 9 5 9/14 Yes
2 2 3 2/5 No
3 4 0 1 Yes
4 3 2 3/5 Yes
5 0 3 0 No
6 2 0 1 Yes
7 0 2 0 No
8 3 0 1 Yes

Jorge S. Marques, IST, 2017 256/279


Node impurity
Ideally, each leaf m should be pure i.e., all the training examples arriving
at node m should have the same label (class). Since this is not always
true we need a measure of impurity.

Jorge S. Marques, IST, 2017 257/279


Node impurity
Ideally, each leaf m should be pure i.e., all the training examples arriving
at node m should have the same label (class). Since this is not always
true we need a measure of impurity.

Several impurity measures have been proposed. They all achieve a


minimum if all the data associated to a leaf comes from a single class.

I Misclassification error:

i(m) = 1 − max P(k|m) = 1 − P(k̂(m)|m)


k

I Entropy:
K
X
i(m) = − P(k|m) log2 P(k|m)
k=1
I Gini index:
K
X
i(m) = − P(k|m) (1 − P(k|m))
k=1

Jorge S. Marques, IST, 2017 257/279


Node impurity (binary case)
In binary classification problems (2 labels), the impurity depends on a
single statistic P(k = 1|m). Figure shows the misclassification error
(red), the Gini index (green), and the entropy (blue) as a function of
P(k = 1|m). The vertical scale was modified.
0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 0.2 0.4 0.6 0.8 1

The entropy and Gini index are smoother and they are usually preferred
in model training.

Jorge S. Marques, IST, 2017 258/279


Tree training

Given a data set, we wish to learn a tree T . Each splitting node


corresponds to a question and each leaf corresponds to a label.

Training a tree amounts to minimizing the tree impurity


X
I (T ) = P(m)i(m)
m∈T̃

where P(m) is the fraction of training patterns associated to leaf m and


T̃ is the set of all the leaf nodes.

The tree impurity is an average impurity of the leaf nodes.

Training a tree amounts to finding a tree that minimizes I (T ). We


should generate all the tree configurations, compute the tree impurity for
each configuration, and choose the one with smallest impurity.

Jorge S. Marques, IST, 2017 259/279


Drawbacks

This approach has two drawbacks:

1. The optimal solution cannot be found. Exhaustive search of all tree


configurations is not feasible and greedy approaches are used instead.

2. The criterion I (T ) optimizes the performance on the training set but


this is highly optimistic, leading to overfitting.

Jorge S. Marques, IST, 2017 260/279


Tree growing

To overcome the first difficulty, we start with a single node m (root) and
choose the best attribute for splitting the node. This is done as follows.

For each attribute Xj , we split m and create children nodes s ∈ S, each


of them associated to a different value of the attribute. We compute the
impurity of each son and the impurity drop with respect to the impurity
of node m
X p(s)
∆I = i(m) − i(s)
p(m)
s∈S

The attribute that achieves the greatest drop is selected.

The splitting process is repeated for another leaf node until a stop
condition is met. For example, until all the leafs are pure or all the
attributes have been tested.

Jorge S. Marques, IST, 2017 261/279


ID3 algorithm

The ID3 algorithm is a basic tree learning method for categorical data.

It has the following features:


I impurity criterion: entropy;
I stop criterion: stop when each leaf is pure or, if not, all the attributes
have been tested along the path from the root to the impure leaf.

When the data is noisy (noisy label or noisy attributes) the ID3 algorithm
may overfit the training data leading to poor performance in independent
data sets (test sets).

This drawback can be alleviated by early stop or post-processing (tree


pruning).

Jorge S. Marques, IST, 2017 262/279


Exercises

1. Apply the ID3 algorithm to the tenis data set.

2. Consider the following data set (vertebrates). Check which of the


attributes is chosen by the ID3 algorithm for the root node (don’t
consider the name and the skin cover).

# name body skin gives aquatic aerial has hiber- class


temperature cover birth creature creature legs nates label
1 human warm-blooded hair Yes No No yes no mammal
2 python cold-blooded scales No No No no yes non-mam
3 salmon cold-blooded scales No Yes No no no non-mam
4 whale warm-blooded hair Yes Yes No no no mammal
5 frog cold-blooded none No Semi No yes yes non-mam
6 Komodo cold-blooded scales No No No yes no non-mam
7 dragon
8 bat warm-blooded hair Yes No yes yes yes mammal
9 pigeon warm-blooded feathers No No yes yes no non-mam
10 cat warm-blooded fur Yes No No yes no mammal
11 leopard warm-blooded fur Yes No No yes no mammal
12 turtle cold-blooded scales No semi No yes no non-mam
13 penguin warm-blooded feathers No semi No yes no non-mam
14 porcupine warm-blooded quills Yes No No yes yes mammal
15 heel cold-blooded scales No Yes No yes no non-mam
16 salamander cold-blooded none No semi No no yes non-mam

adapted from Kumar, Introduction to Data Mining, 2014.

Jorge S. Marques, IST, 2017 263/279


Solution of first exercise
Test root

O H W P Outlook N Y
R 2 3
S H W N
S 3 2
S H S N
O 0 4
O H W Y
R H W Y i(O) = 5
× 0.97 + 5
× 0.97 + 4
× 0 = 0.69
14 14 14
R N W Y
R N S N Humidity N Y
O N S Y H 4 3
S H W N N 1 6
S N W Y
R N W Y 7 7
i(H) = 14
× 0.98 + 14
× 0.59 = 0.78
S N S Y
O H S Y Wind N Y
O N W Y S 3 3
R H S N W 2 6
6 8
i(W ) = 14
×1+ 14
× 0.81 = 0.89

Best choice for the root is Outlook and Outlook=Overcast is a pure node with
label Yes.
Jorge S. Marques, IST, 2017 264/279
Solution (cont.)
Node: Outlook=Rain
H W P
H W Y
Node: Outlook=Sunny
H W P N W Y
N S N
H W N N W Y
H S N H S N
H W N
N W Y Humidity N Y
N S Y H 1 1
N 1 2
Humidity N Y
H 3 0 i(H) = 2
×1+ 3
× 0.91 = 0.95
5 5
N 0 2
Wind N Y
Two pure nodes: i(H) = 0
S 2 0
W 0 3

Two pure nodes: i(W ) = 0


Best choice for the node Outlook=Sunny is Humidity and leads to two pure
nodes. Best choice for Outlook=Rain is Wind and leads to two pure nodes.
Jorge S. Marques, IST, 2017 265/279
Solution (cont.)

The decision tree we have obtained:

Jorge S. Marques, IST, 2017 266/279


When to stop? early stop

I The impurity of the tree drops or remains constant every time a


node is split. In the limit we can grow the tree until each leaf is pure
or all the attributes along that path have been used. This approach
leads to overfitting.

I A second approach consists of using a validation technique. The tree


is grown using a subset of training data (70%) and evaluated using
the remaining patterns (30%) (validation set).

I Another strategy consists of growing the tree while the impurity


drop is above a threshold ∆I > β.

I Another approach is based on a regularization criterion

I (T ) + αÑ

where Ñ is the number of leaf nodes.


Jorge S. Marques, IST, 2017 267/279
Example - exclusive OR

Consider a toy problem:

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0

I would you be able to predict y with a tree classifier?

I compute the impurity drop for the splitting of the root node, using
the entropy criterion. What do you conclude?

Jorge S. Marques, IST, 2017 268/279


Drawbacks

I Early stop is not a good strategy to train the model because it


suffers from lack of sufficient look ahead.

I It is better growing the tree until the leaves are pure or all attributes
have been used and then prune the tree.

I Instability: a small change in the training patterns may lead to big


changes of the decision boundaries.

Jorge S. Marques, IST, 2017 269/279


Tree pruning

Tree pruning is a tree simplification, aiming to improve the performance


of the tree. The performance is usually measured as number of
classification errors in the validation set.

This usually involves three steps:


I trying several simplifications
I evaluate each of them
I choose the best

The process is repeated until no further improvements can be achieved.

Jorge S. Marques, IST, 2017 270/279


Subtree replacement

Test all the splitting nodes in a bottom-up way. For each tested node,
remove its descendants (subtree) and replace the splitting node by a leaf.
The change is accepted if the modified tree has a better or equal
performance (number of errors in the validation set).
Jorge S. Marques, IST, 2017 271/279
Error estimation

When there is no validation set (too few training examples) the error in
the validation set is predicted by using a pessimistic estimator given by
q
z2 2 z2
f + 2N + z Nf − fN + 4N 2
e= z2
,
1+ N
where
I z - parameter that depends on the confidence degree c (if c = 25%,
z = 0.69)
I f - percentage of error in the training set
I N - number of training examples in the leaf

Jorge S. Marques, IST, 2017 272/279


Can trees be applied with numerical features?

Three classes (colors)

Jorge S. Marques, IST, 2017 273/279


Yes! using thresholds

Threshold values have to be estimated during the tree growing process,


usually by exhaustive search. All threshold values are considered for each
feature and the best impurity drop is selected.

Jorge S. Marques, IST, 2017 274/279


Bootstrap aggregation (bagging) for regression

Bootstrap aggregation, also called bagging, is an ensemble method that


can be used to improve the performance of regressors and classifiers.

Consider a regression problem with a training set of n independent and


identically distributed (iid) patterns, drawn from a distribution P
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) )

Using previous methods we can learn a function f (x) under uncertainty.

One way to improve the estimate f (x) would be to consider multiple


training sets.

Jorge S. Marques, IST, 2017 275/279


Multiple training sets

Consider B training sets generated from the same (ideal) distribution P


 (1,1) (1,1)
T (1) = (x ,y ) , . . . , (x (1,n) , y (1,n) )
.. .. ..
.  . .
T (B) = (x (B,1) , y (B,1) ) , . . . , (x (B,n) , y (B,n) )

This allows us to estimate B regression functions f (1) (x), . . . , f (B) (x),


each from a different training set. These functions can be combined
(aggregated) by averaging
B
1 X (i)
fˆ(x) = f (x)
B
i=1

to reduce the uncertainty.

Jorge S. Marques, IST, 2017 276/279


Bootstrap

There is only one difficulty. We do not know the ideal distribution P. All
we know is the first data set T and the empirical distribution computed
from it.

The trick consists of generating the multiple data sets T (i) using the
Bootstrap method i.e., by sampling the set T , n times, (with
replacement).

Of course, we cannot claim as before that the T (i) are statistically


independent, but the technique still improves the estimation of f .

Jorge S. Marques, IST, 2017 277/279


Bagging in classification problems
Let us consider a classification problem with a training set
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) )

where the outcomes y (i) belong to a finite set of labels {0, . . . , K − 1}.

Given a new input pattern x, the trained classifier produces a vector of


probability estimates [P(y = 0|x), . . . , P(y = K − 1|x)] subject to
estimation errors.

The Bagging algorithm is as follows


1. generate B training sets T (i) from T , by bootstrap i.e., by sampling
T with replacement.
(i)
 (i)from each set T (i)and compute the
2. train a classifier  a posteriori
distributions P (y = 0|x), . . . , P (y = K − 1|x) .
3. aggregate all the estimates
B
1 X (i)
P̂(y = k|x) = P (y = k|x)
B
i=1
Jorge S. Marques, IST, 2017 278/279
Random forest
Random forest is a very simple and yet a very powerful classifier.
Achieves state-of-the-art results in many problems.

The algorithm is based on a ensemble of tree classifiers trained with


bagging.

The Random forest algorithm


1. generate B training sets T (i) from T by bootstrap i.e., by sampling
T with replacement.
2. train a tree classifier from each set T (i) with the following issue.
Randomly select a subset of features at each node and only those
features are candidates for splitting features. This procedure is
known as random subspace. The percentage of feature candidates at
each node is a parameter of the algorithm.
3. compute the a posteriori distributions 
P (i) (y = 0|x), . . . , P (i) (y = K − 1|x) for each tree.

Jorge S. Marques, IST, 2017 279/279


Random forest
Random forest is a very simple and yet a very powerful classifier.
Achieves state-of-the-art results in many problems.

The algorithm is based on a ensemble of tree classifiers trained with


bagging.

The Random forest algorithm


1. generate B training sets T (i) from T by bootstrap i.e., by sampling
T with replacement.
2. train a tree classifier from each set T (i) with the following issue.
Randomly select a subset of features at each node and only those
features are candidates for splitting features. This procedure is
known as random subspace. The percentage of feature candidates at
each node is a parameter of the algorithm.
3. compute the a posteriori distributions 
P (i) (y = 0|x), . . . , P (i) (y = K − 1|x) for each tree.
4. aggregate all the a posteriori distributions
B
1 X (i)
P̂(y = k|x) = P (y = k|x)
B
i=1
Jorge S. Marques, IST, 2017 279/279

You might also like