0% found this document useful (0 votes)

82 views

Machine Learning Slides

This document provides an overview of machine learning concepts and applications. It discusses supervised learning problems where the goal is to predict an output based on input features. The document introduces key concepts like features, outcomes, and predictors. It distinguishes between regression, where the output is continuous, and classification, where the output is categorical. The document outlines the typical machine learning system of training a model on data and then testing it to make predictions and evaluate performance.

Uploaded by

mariana mourão

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

Machine Learning Slides

Uploaded by

mariana mourão

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 281

Machine Learning Slides

draft - under construction

Jorge S. Marques

July 22, 2022

Jorge S. Marques, IST, 2017 1/279

Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 2/279

What is machine learning?

Many engineering problems can be solved by using models that depend

on a small number of variables.

Examples:
I motion of a rocket → Newton law: mẍ(t) = F (t)
I electromagnetic waves → Maxwell equations

... but other problems are more complex and cannot be tackled with
closed form expressions.

Jorge S. Marques, IST, 2017 3/279

Hospital problem
Suppose a patient enters in an hospital and we wish to predict if he/she
is going to live or die.

There is no general principle that can be used to solve this problem.

May be we have a data set of previous examples, including information

about the patient status (medical tests / symptoms) and the outcome
(live/die).

T1 T2 ... ... ... Tp y

Ti ith medical test /

/symptom
y outcome

How can we use this information to predict the outcome for a new
patient?
Jorge S. Marques, IST, 2017 4/279
What is Machine Learning?

”the field of study that gives computers the ability to learn without being
explicitly programmed.” (Arthur Samuel, 1959)

Arthur Samuel was a pioneer in the area of Machine Learning.

”A computer program is said to learn from experience E with respect to

some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience.” (Tom Mitchell, 1998)

Tom Mitchell is a professor of Computer Science at CMU.

Jorge S. Marques, IST, 2017 5/279

Data growth in internet

The economist
Jorge S. Marques, IST, 2017 6/279
Applications

I prediction
I time series analysis
I speech recognition - conversion of the speech signal into text
I machine translation
I detection of failures
I image denoising
I human activity recognition
I medical image analysis - e.g., cancer detection in images
I robot navigation
I self driving car

Some of these are amongst the most difficult problems in engineering.

Jorge S. Marques, IST, 2017 7/279

Amazing progress in image recognition

Xuedong Huang - Deep Learning and Intelligent Applications

Current machine learning methods perform better than humans in this

task.

Jorge S. Marques, IST, 2017 8/279

Visual recognition - AlexNet (2012)

Jorge S. Marques, IST, 2017 Alexnet 2012 - Krizhevsky, Sutskever, Hinton 9/279
Visual recognition - AlexNet (2012)

Alexnet 2012 - Krizhevsky, Sutskever, Hinton

Jorge S. Marques, IST, 2017 10/279

Image description (2015)

Karpathy, Fei-Fei - CVPR 2015

Jorge S. Marques, IST, 2017 11/279

Course overview

Stucture: lectures (4h/week) + lab (1.5h/week) + problem sessions

(1.5h/week).

Lab: students organized in groups of 2, should perform a project and a

10 pages report. The project includes 2 parts: regression problem and a
classification problem. Lab enrolment is done through fenix in the first
week.

Programming: Phython. An introduction is provided in 1st week

(problem sessions).

Grading: exam (50%) + Lab (50%).

Jorge S. Marques, IST, 2017 12/279

Learning problems
There are several learning problems. The major categories are:

I Supervised learning - the computer receives a set of inputs and

desired outputs and aims to find the map between them.

I Unsupervised learning - the computer receives a set of inputs but no

desired outputs. The goal is to find the structure of data
(probability distribution, groups).

I Reinforcement learning - aims to learn the behavior of software

agents or robots based on feedback from the environment.

This course is focused on the first learning category.

Jorge S. Marques, IST, 2017 13/279

Example 1 (supervised learning)
Suppose we want to predict the price of a flat in Lisbon (in K euros),
knowing its area (in m2 ). Fortunately, we know some examples.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430 100

180 640 0
50 100 150 200

The area is known as a feature and the price is the outcome.

Question: how can we predict the price of a flat?

Jorge S. Marques, IST, 2017 14/279

Example 2 (supervised learning)
A fisher boat has a sonar system that measures the length and volume of
fish (features).

Given the table with the length and volume of tuna and swordfish (class),
we wish to design a system that predicts the class.

length volume class 0.8

0.36 0.67 tuna
0.82 0.56 tuna 0.6
0.46 0.67 sword
0.40 0.30 sword 0.4
0.60 0.80 tuna
0.61 0.47 tuna 0.2
0.21 0.41 sword
0
0 0.2 0.4 0.6 0.8 1

Question: how can we predict the type of fish?

Jorge S. Marques, IST, 2017 15/279
Key concepts

Identify the key concepts in previous examples:

I features

I outcome

I predictor

Jorge S. Marques, IST, 2017 16/279

Problem formulation: supervised learning

Given an input variable x ∈ Rp (vector of features), we wish to predict an

output variable y (outcome) i.e., we wish to find a map between the input
space and the output space, assuming we know a set input-output pairs.

This operation is known as model learning.

Problem: Given a set of examples (training set)

n o
T = (x (i) , y (i) ), i = 1, . . . , n x (i) ∈ Rp

we wish to estimate a function (predictor)

ŷ = f (x)

such that ŷ is, in some sense, close to y .

Jorge S. Marques, IST, 2017 17/279

Regression vs classification
700

600

500

If the output y is a scalar (y ∈ R or 400

Rp ) the problem is known as a 300

regression problem. 200

100

0
50 100 150 200

If the output y is a label (categorical

variable)

y ∈Ω Ω = {ω0 , . . . , ωK −1 }

the problem is known as a

classification problem.

(each color represents a different class)

Jorge S. Marques, IST, 2017 18/279

System architecture

The design of a machine learning system comprises a training phase to

learn a model and a testing phase to predict new data and to eventually
assess the model performance.

This diagram does not consider the choice of features and their
extraction, e.g., if we are dealing with an image or speech analysis
problem, what features do we extract from the signal. This issue is
application dependent and will not be considered.
Jorge S. Marques, IST, 2017 19/279
Main questions

The block diagram suggests three main questions:

I what class of functions should we consider?

I how do we fit the function f to the training data i.e., how do we

select the function?

I how do we evaluate the predictor?

These questions have multiple answers that we will discuss along this
course.

Jorge S. Marques, IST, 2017 20/279

Data sets

Data sets are important tools to train and evaluate machine learning
systems. They allow to compare different techniques and they often
foster the development of new methods.

There are many sites with data sets. One example is:
https://archive.ics.uci.edu/ml/datasets.html

Jorge S. Marques, IST, 2017 21/279

One of the oldest: Fisher Iris flower data set
setosa versicolor virginica

Wikimedia

Sepal length Sepal width Petal length Petal width Species

5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa 150 examples
5.0 3.6 1.4 0.3 I. setosa
.. .. .. .. ..
. . . . .
7.7 2.6 6.9 2.3 I. virginica
7.9 3.8 6.4 2.0 I. virginica
Jorge S. Marques, IST, 2017 22/279
One of the oldest: Fisher Iris flower data set

4.5 3

2.5
4

3.5

1.5

2.5
0.5

2 0
4 4.5 5 5.5 6 6.5 7 7.5 8 1 2 3 4 5 6 7

feature 1 vs 2 feature 3 vs 4

Species: setosa (red), versicolor (green), virginica (blue)

Please note that the scale is not the same in both axis.

Jorge S. Marques, IST, 2017 23/279

ImageNet 2012
URL: www.image-net.org
Data set: 10 million images Classes: 1000+

Jorge S. Marques, IST, 2017 24/279

Nearest neighbor method
Suppose we wish to predict a variable y knowing an input vector x ∈ Rp .

Suppose we also know a collection of training examples (training set)

n o
T = (x (i) , y (i) ), i = 1, . . . , n

A simple strategy to predict y for new values of x consists of finding the

training pattern x (i) nearest to x and approximating y by y (i) .

Let (x(1) , y(1) ), . . . , (x(n) , y(n) ) be a reordering of the training set such
that
kx(1) − xk ≤ kx(2) − xk ≤ · · · ≤ kx(n) − xk
The nearest neighbor (NN) method assigns x to the outcome of the
nearest neighbor.
f (x) = y(1)

This is valid for both classification and regression problems.

Jorge S. Marques, IST, 2017 25/279

k nearest neighbor

The NN method can be extended to take into account not one but k
nearest neighbors of x.

In classification problems, the predicted class is chosen as the most voted

class in the sequence (y(1) , . . . , y(k) )

f (x) = most voted class in (y(1) , . . . , y(k) ).

In regression problems, the predicted value in chosen as the average of

(y(1) , . . . , y(k) )
k
1X
f (x) = y(i)
k
i=1

Jorge S. Marques, IST, 2017 26/279

Example: supervised classification problem

Consider a binary classification problem. The training set is shown in the

figure and was generated by a mixture of Gaussians.

training data

How would you classify this data?

Jorge S. Marques, IST, 2017 27/279

k Nearest neighbor

Decision regions of kNN classifier with k = 1 (left) and k = 10 (right)

What k would you choose?

Jorge S. Marques, IST, 2017 28/279

Recommended Bibliography

I T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical

Learning, Springer, 2009
I T. Michell, Machine Learning, McGraw Hill, 1997.
I J. S. Marques, Reconhecimento de Padrões. Métodos Estatı́sticos e
neuronais, ISTPress, 2nd ed. 2005.
I R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2nd
edition, 2000.
I L. Almeida, Multilayer Perceptrons, in Handbook of Neural
Computation, Oxford Press, 1997.
I I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press,
http://www.deeplearningbook.org, 2016
I T. Fletcher, Support Vector Machines, UCL, 2008.

Jorge S. Marques, IST, 2017 29/279

Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 30/279

Regression Problem

Consider the data set T = (x (i) , y (i) ), i = 1, . . . , n defined in the table.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430 100

180 640 0
50 100 150 200

We wish to predict the price of a flat in Lisbon, taking its area into
account.

The simplest prediction model is a straight line

ŷ = f (x) = β0 + β1 x

β0 is called the intercept or offset.

Jorge S. Marques, IST, 2017 31/279
Predictor estimation
How should we estimate the coefficients β0 , β1 ?

If we know one training example (x (1) , y (1) ), we obtain a single equation

ŷ (1) = β0 + β1 x (1) → infinite (β0 , β1 ) solutions

If we know two training examples, we obtain two equations

ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → unique solution, but bad (noisy)

If we know three training examples, we obtain three equations

ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → no solution, impossible
ŷ (3) = β0 + β1 x (3)

To solve this problem, we must assume that there is an error between the
output of the model ŷ (i) and the data y (i) .

Jorge S. Marques, IST, 2017 32/279

Prediction loss
We assume that there is a prediction error associated to each training
example
e (i) = y (i) − ŷ (i) = y (i) − f (x (i) ),

and define a quadratic loss (cost)

L(y (i) , ŷ (i) ) = (y (i) − ŷ (i) )2 .

The total loss in the training set

n
X
SSE = (y (i) − f (x (i) ))2 ,
i=1

also known as (aka) sum of squared errors (SSE), or least squares

criterion (LS).
Jorge S. Marques, IST, 2017 33/279
Minimization: first order model

Model fit is achieved by minimizing the total loss in the training set
n
X
min (y (i) − β0 − β1 x (i) )2 .
β0 ,β1
i=1

The minimum is achieved at a point (β̂0 , β̂1 ) such that the partial
derivatives are zero (gradient vector is the null vector)
" ∂SSE #T
∂β0
∇β SSE = ∂SSE
= 0,
∂β1

This leads to
 ∂SSE  Pn
 ∂β0 =0  −2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0
⇒
∂SSE Pn (i)
= 0. −2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.
 
∂β1

Jorge S. Marques, IST, 2017 34/279

Analytic optimization

 Pn
 −2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0
Pn (i)
−2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.


 Pn Pn Pn
 i=1 β̂0 + i=1 β̂1 x (i) = i=1 y (i)
 Pn Pn Pn
i=1 β̂0 x (i) + i=1 β̂1 x (i) x (i) = i=1 y (i) x (i) .

This leads to (normal equations):

" Pn Pn (i)
#" # " Pn (i)
#
i=1 1 i=1 x β̂0 i=1 y
Pn Pn = Pn .
(i) (i)2 (i) (i)
i=1 x i=1 x β̂1 i=1 y x

By solving this system of equations, we obtain (β̂0 , β̂1 ).

Jorge S. Marques, IST, 2017 35/279

Analytic optimization

To guarantee a minimum is achieved at this point, we should evaluate

the matrix of second derivatives (Hessian matrix)
 2
∂ 2 SSE

∂ SSE " Pn Pn (i)
#
∂β02 ∂β0 ∂β1 i=1 1 i=1 x
H=   = 2 Pn
∂ 2 SSE ∂ 2 SSE (i)
Pn (i)2
∂β1 ∂β0 ∂β 2 1
i=1 x i=1 x

and check if it is positive definite.

Jorge S. Marques, IST, 2017 36/279

Reminder: positive definite matrix

A symmetric n × n matrix, M, is a positive definite matrix if the scalar

z T Mz is positive for every non-zero vector z ∈ Rn .

If M is a positive definite matrix, then

I it is a non singular matrix (det M 6= 0);
I all the eigenvalues are real and positive;
I all the leading principal minors are positive. The kth leading
principal minor of a matrix M is the determinant of its upper-left k
by k sub-matrix.

Jorge S. Marques, IST, 2017 37/279

Exercise

Suppose we remove the average value of the feature x and outcome y ,

x 0 ← x − x̄ y 0 ← y − ȳ ,

where x̄, ȳ are average values computed in the training set.

Show that the least squares estimates of β0 is β̂0 = 0.

This result is useful to simplify the estimation of the β coefficients.

Jorge S. Marques, IST, 2017 38/279

Exercise

Minimize
n
X 2
SSE = y 0(i) − β00 − β10 x 0(i)
i=1
n
∂SSE X
= 0 ⇒ −2 y 0(i) − β̂00 − β̂10 x 0(i) = 0
∂β0
i=1
n
X n
X
y 0(n) − nβ̂00 − β̂10 x 0(n) = 0
i=1 i=1

0 − nβ̂00 − 0 = 0

β̂00 = 0

Jorge S. Marques, IST, 2017 39/279

Linear regression model (general case)

Let us extend linear regression to the general case in which we have p

features x1 , x2 , . . . , xp ∈ R. The linear regression model is given by

ŷ = β0 + β1 x1 + · · · + βp xp .

Using vector notation, we obtain

 
β0
 β1 
 = 1 xT β ,

ŷ = [1 x1 . . . xp ] 
 
..
 . 
βp
T T
where x = [x1 x2 , . . . , xp ] ∈ Rp and β = [β0 , β1 , . . . , βp ] ∈ Rp+1 ,
ŷ ∈ R.

Jorge S. Marques, IST, 2017 40/279

Problem formulation

Consider a training set T = (x (i) , y (i) ), i = 1, . . . , n , where x (i) ∈ Rp
and y (i) ∈ R, i = 1, . . . , n.

The linear model

f (x) = 1 x T β ,

is trained by finding the vector of coefficients β̂ ∈ Rp+1 that minimizes

the total cost
n
X
SSE (β) = (y (i) − f (x (i) ))2 .
i=1

Jorge S. Marques, IST, 2017 41/279

Matrix notation

Adopting matrix notation

 (1) (1)

y (1)
 
1 x1 ... xp
 1 x1(2) ...
(2)
xp y (2)
   

X = , y = ,
 
 .. .. .. ..
.

 . . ... .   
1 x1
(n)
...
(n)
xp y (n)

X is called the design matrix and y ∈ Rn is the vector of outcomes.

cost function:

SSE (β) = ky − ŷ k2 = ky − X βk2

√
where kzk = z T z denotes the Euclidean norm.

Jorge S. Marques, IST, 2017 42/279

Normal equations

The minimization of the SSE cost functional leads to a system of

equations that are denoted normal equations:

(X T X )β̂ = X T y

The normal equations have a unique solution iif det(X T X ) 6= 0.

The normal equations can be derived from the stationary condition

(necessary condition)
∇SSE (β) = 0

Jorge S. Marques, IST, 2017 43/279

Gradient properties

The proof of the normal equations requires two properties of the gradient.

Let f (x) be a scalar function, where x = [x1 , . . . , xp ]T is a vector.

The gradient of f is defined by

 ∂f 
∂x1
..
∇x f (x) =  .
 
.
∂f
∂xp

Useful properties:

inner product: ∇x (b T x) = b , b ∈ Rp ,
quadratic form: ∇x (x T Mx) = (M + M T )x , M ∈ Rp×p .

Jorge S. Marques, IST, 2017 44/279

Proof
Cost function
SSE = ky − X βk2 = (y − X β)T (y − X β) norm definition

= y T y − y T X β − βT X T y + βT X T X β distributive prop.

= y T y − 2y T X β + β T X T X β transpose prop..

Computing the gradient and making it equal to zero

∇β SSE = −2X T y + 2X T X β = 0,

we conclude
(X T X )β̂ = X T y .
The inverse of matrix X T X may not exist due to two main reasons:
I small amount of data e.g., number of data points smaller than the
number of features.
I redundant features (linearly dependent) e.g., duplicated features.

Jorge S. Marques, IST, 2017 45/279

Summary of linear regression

Model training
normal equations: (X T X )β̂ = X T y

parameter estimates: β̂ = (X T X )−1 X T y

Prediction
new data: f (x0 ) = [1 x0T ]β̂

training data: ŷ = X β̂ = X (X T X )−1 X T y

Attention: the inverse of matrix X T X does not exist if det(X T X ) = 0.

Jorge S. Marques, IST, 2017 46/279

Gauss Markov Theorem

Let y = X β + w where β is unknown, X is known (deterministic) and w

is a realization of a random vector of zero mean and covariance σ 2 I .
Then,

I The least squares estimate of β

β̂ = (X T X )−1 X T y ,

is unbiased with covariance matrix σ 2 (X T X )−1 .

I If β̃ = Py is another unbiased estimator of β, it has a covariance
matrix that is equal or larger then Cov{β̂} 1 .

Proof: Hastie et al., Elements of Statistical Learning, Springer, 2009

1 the inequality A ≥ B means that A − B is a semi-definite positive matrix

Jorge S. Marques, IST, 2017 47/279
Example

Figure shows the least squares fit of a linear model (straight line) to the
flat data.

700

area price 600

130 370 500

60 220 400

87 57 300

125 400 200

147 430
100
180 640
0
50 100 150 200

Jorge S. Marques, IST, 2017 48/279

Polynomial model
The linear model is often very rigid, especially when the number of
features is small.

An alternative is the polynomial model (x scalar)

f (x) = β0 + β1 x + · · · + βp x p .

This can be considered as linear model whose features are the powers of
x (scalar).
700

The model is non-linear in x but 600

linear in the parameters βi . 500

400

Therefore, the β coefficients can be 300

obtained by the least squares 200

method described before, leading to 100

a linear set of equations. 0

50 100 150 200

Jorge S. Marques, IST, 2017 49/279

Model order
The polynomial model becomes numerically unstable when we increase
the order of the polynomial.

Figure shows polynomial fits for p = 1, 3, 4.

700

600

500

400

300

200

100

0
50 100 150 200

How should we choose the best order? is the SSE a good criterion?

Jorge S. Marques, IST, 2017 50/279

Radial basis functions: model

This model is based on a sum of radial basis functions (Gaussian local

functions) defined by
1 (k) 2
Gk (x) = e − 2σ2 kx−c k
k = 1, . . . , p ,
(k) d
where c ∈ R is a center vector (aka centroid) to be computed from
the data.

The Radial basis function model

approximates the outcome y by a
weighted sum of local basis functions
p
X
f (x) = wk Gk (x).
k=1

Jorge S. Marques, IST, 2017 51/279

Radial basis functions: training

The model is estimated as follows. First we estimate the p centroids, ck .

This is not done by least squares. The centroids ck are obtained using a
clustering algorithm such as k-means (to be presented later in this
course).

Then, the coefficients w = [w1 , . . . , wp ]T are estimated by least squares

w = (X T X )−1 X T y ,

where X is given by

G1 (x (1) ) G2 (x (1) ) . . . Gp (x (1) )

 

X = .. .. ..
,
 
. . ... .
G1 (x (n) ) G2 (x (n) ) . . . Gp (x (n) )

σ 2 is an hyperparameter chosen by the user or estimated from the data.

Jorge S. Marques, IST, 2017 52/279

Example - radial basis functions

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
50 100 150 200 50 100 150 200

radial basis functions regression results

(centroids are equally spaced instead of being computed from the data)

Jorge S. Marques, IST, 2017 53/279

Regression with multiple outputs

Suppose that we have multiple outputs y1 , . . . , yK ∈ R, each of them

approximated by a linear model with coefficients βk ∈ Rp+1

yk = X βk + wk , k = 1, . . . , K .

The SSE for the multiple outputs is the sum of the SSE for each output.
K
X
SSE = SSEk (βk ).
k=1

Each regression problem can be independently solved i.e., β̂k can be

obtained by minimizing SSEk (βk )

(X T X )β̂k = X T yk .

Jorge S. Marques, IST, 2017 54/279

Regression with multiple outputs

The problem can also be formulated using matrix notation

Y = Xβ + W,

where Y = [y1 . . . yK ] ∈ Rn×K , β = [β1 . . . βK ] ∈ Rp×K ,

W = [w1 . . . wK ] ∈ Rn×K .

The minimization of the sum of squared errors criterion

SSE (β) = tr (Y − X β)T (Y − X β) ,

leads to
β̂ = (X T X )−1 X T Y .
This is equivalent to independently solving each of the K least squares
problem sharing the same design matrix X .

tr {} denotes the trace of a matrix (sum of the diagonal elements).

Jorge S. Marques, IST, 2017 55/279

Exercises

1. Consider a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) . Write the
normal equations for a least squares fit of a second order polynomial
model (x scalar) to the training data.

2. Repeat the previous problem, assuming 2D features x ∈ R2 .

3. What happens to the predicted outcome ŷ estimated by least

squares (without offset), if the observed features are scaled i.e.,
x 0(i) = Dx (i) , i = 1, . . . , n, where D is a diagonal matrix.

Jorge S. Marques, IST, 2017 56/279

Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 57/279

Motivation

A linear model can be estimated by minimizing the least squares criterion

in the training set
ky − X βk2 ,
where
 (1) (1) (1)

y (1)
   
x1 x2 ... xp β1
(2) (2) (2)
y (2) x1 x2 ... xp β2
     
 
y = , X = , β= .
   
.. .. .. .. ..
. .
 
   . . ... .   
y (n) (n)
x1
(n)
x2 ...
(n)
xp βd

We removed the mean x̄, ȳ from the data in order to make β0 = 0. The
vector β does not include β0 and X does not include a column of ones.

Jorge S. Marques, IST, 2017 58/279

Drawbacks

LS approach leads to the normal equations

(X T X )β̂ = X T y .

If (X T X ) is a singular matrix, the least squares estimate is not unique.

Infinite solutions are available.

Example
Suppose we wish to estimate the model from a single example.

x1 x2 y
ŷ = x1 β1 + x2 β2
1 1 3

Jorge S. Marques, IST, 2017 59/279

Example (cont.)

We obtain 2 parameters and 1 constraint, leading to infinite solutions

β1 + β2 = 3.

How can we solve this difficulty?

By adding a new constraint: minimizing the squared norm of the

coefficients
β12 + β22

This is a measure of ”model complexity”.

Jorge S. Marques, IST, 2017 60/279

Ridge regression

An alternative criterion is ridge regression

β̂ridge = arg min ky − X βk2 +λkβk2 ,

where k.k denotes the Euclidean norm. The new term kβk2 penalizes the
use of large coefficients and it is denoted a regularization term. This
criterion aims to represent the data, keeping the coefficients small.

λ represents the trade-off between both objectives.

Furthermore, we assume that training data X , y have zero mean. The

coefficient β0 is not usually included in the regularization term.

Jorge S. Marques, IST, 2017 61/279

Ridge regression

The ridge problem can be rewritten as a constrained optimization problem

β̂ridge = arg min ky − X βk2 , s.t., kβk2 ≤ τ.

There is a one to one correspondence between the values of τ and λ.

The ridge regression can be solved by computing the gradient vector and
making it equal to zero, leading to

β̂ridge = (X T X + λI )−1 X T y .

Matrix (X T X + λI ), with λ > 0, is always non-singular, even if (X T X ) is

singular.

Jorge S. Marques, IST, 2017 62/279

Proof

Cost function
ERidge = ky − X βk2 + λkβk2

= (y − X β)T (y − X β) + λβ T β

= y T y − 2y T X β + β T X T X β + λβ T β

= y T y − 2y T X β + β T (X T X + λI )β.

Computing the gradient and making it equal to zero

∇β ERidge = −2X T y + 2(X T X + λI )β = 0,

we conclude
(X T X + λI )β̂ridge = X T y .

Jorge S. Marques, IST, 2017 63/279

Exercise

Find the relationship between the eigenvector and eigenvalues of the LS

matrix (X T X ) and ridge matrix (X T X + λI ).

Try to solve it by yourself.

Jorge S. Marques, IST, 2017 64/279

Tentative solution

LS: (X T X )v ls = λls v ls ⇒ (X T X − λls I )v ls = 0

Ridge: (X T X + λI )v ridge = λridge v ridge

(X T X + λI − λridge I )v ridge = 0

Comparing,

λridge = λls + λ

v ridge = v ls

Conclusion:
The eigenvectors are equal and the eigenvalues are shifted by λ. If λ > 0
the eigenvalues of the ridge matrix are positive and the matrix is non
singular.

Jorge S. Marques, IST, 2017 65/279

The Lasso

Another alternative is Lasso.

Lasso regression aims to minimize the sum of squared errors (with

β0 = 0)
min ky − X βk2 ,
β

with a different constraint on the coefficients that penalizes large errors

less.
Xp
|βj | ≤ τ.
j=1

This constraint can be expressed in terms of the `1 norm kβk1 ≤ τ .

Since we are dealing with two norms, the `2 norm (Euclidean) will be
denoted by k.k2 and the `1 norm by k.k1 .

Jorge S. Marques, IST, 2017 66/279

The Lasso

The Lagrangian formulation is given by

βlasso = arg min ky − X βk22 + λkβk1 ,

where the last term can be interpreted as a regularization term.

This optimization problem cannot be solved by a linear system of

equations as before. In fact we have to resort to convex optimization
methods to numerically solve this problem.

Jorge S. Marques, IST, 2017 67/279

CVX software package

Jorge S. Marques, IST, 2017 68/279

Sparse solutions

We often wish to find sparse solutions for β (with ssome zero

coefficients) which corresponds to selecting only a subset of features
(feature selection).

The problem can be formulated as

β̂sparse = arg min ky − X βk22 + λ number of non-zero coefficients,

where the number of non-zero coefficients is often called ”`0 norm”, k.k0 ,
but does not verify the axioms of a norm.

Regularization with the `0 norm is difficult to solve numerically. However,

the solution is often well approximated by the lasso regression which also
leads to sparse solutions in many problems: when a feature is not
important the corresponding coefficient is made equal to zero.

Jorge S. Marques, IST, 2017 69/279

Feature Selection

The lasso estimate βlasso is often a sparse vector of coefficients where less
important features receive a zero coefficient.

This can be interpreted as a feature selection operation. Since

unimportant features are removed, the other ones are better estimated.

Jorge S. Marques, IST, 2017 70/279

Example: lasso vs ridge regression
Regression problem with a subset of features uncorrelated with the
outcome. The training data was generated by y = x T β + w where
w ∼ N(0, σ 2 ), x ∼ N(0, I ) and β = [1 0.5 0 0 0].

Estimates obtained by least squares (horizontal lines), ridge regression

and lasso as a function of λ.

ridge regression lasso

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

-0.1 -0.1

-0.2 -0.2
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40

λ λ
Jorge S. Marques, IST, 2017 71/279
Non-centered data

How
should we proceed if the training data
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) are not centered?

1. pre-processing: x 0(i) = x (i) − x̄, y 0(i) = y (i) − ȳ (x̄, ȳ average values

computed in the training set);

2. estimate linear model without intercept: estimate model y 0 = x 0T β 0 ,

β 0 ∈ Rp , using the pre-processed
with data
T = (x 0(1) , y 0(1) ), . . . , (x 0(n) , y 0(n) ) and regularization;

3. invert pre-processing: β̂ = [β̂0 β̂ 0T ]T where β̂0 = ȳ − x̄ T β̂ 0 ;

Matlab commands ridge, lasso perform all the tree steps.

Jorge S. Marques, IST, 2017 72/279

Exercises

1. Does a linear regressor, estimated by the least squares method,

depend on the scale of the features? what happens if we we use
ridge regression instead?

2. Suppose we wish to predict a variable y ∈ R, using a single feature

x ∈ R (without intercept). Given a training set
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) and assuming that the feature x
is normalized
n
1 X (i) 2
(x ) = 1 ,
n
i=1

Find the ridge and lasso coefficients, β̂ ridge , β̂ lasso , as a function of

the least squares coefficient, β̂ ls , and plot them.

Try to solve by yourself

Jorge S. Marques, IST, 2017 73/279

Tentative solution

1. Suppose we multiply the features by a scale factor X 0 = sX , where s

is the scale factor. Then, the vector of coefficients becomes

β 0ls = (X 0T X 0 )−1 X 0T y = (s 2 X T X )−1 sX T y = s −2 (X T X )−1 sX T y

β 0ls = s −1 β ls .
ls
LS predictor: ŷ 0 = x 0T β 0 = sx T s −1 β ls = x T β ls = ŷ is invariant
under scaling.

The ridge coefficient are the solution of (X 0T X 0 + λI )β 0ridge = X 0 y .

Matrix (X 0T X 0 + λI ) has two terms: one that depends on the scale

and the other that does not. Therefore, the ridge predictor is not
invariant to scale.

Jorge S. Marques, IST, 2017 74/279

Tentative solution
2.

LS : β̂ ls = (X T X )−1 X 0 y = n1 X 0 y

Ridge : β̂ ridge = (X T X + λI )−1 X 0 y = 1 0

n+λ X y

n
β̂ ridge = n+λ β̂
ls

Lasso : minβ ky − X βk22 + λkβk1

minβ (y T − 2y T X β + β 2 X T X ) + λ|β|

d(... )
Hypothesis : β̂ > 0 dβ = 0 ⇒ −2nβ̂ ls + 2β̂ lasso + λ = 0
λ
β̂ lasso = β̂ ls − 2n

the same should be repeated for β̂ lasso < 0.

These relationships should be graphically represented.
Jorge S. Marques, IST, 2017 75/279
Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 76/279

Optimization
Linear regression boils down to minimizing

SSE = ky − X βk2 ,

that can be analytically solved.

Most regression / classification problems involve the solution of an

optimization problem,
θ̂ = arg min J(θ)
θ

where J : R → R is the cost function and θ ∈ Rp denotes the model

parameters.

In most cases, this cannot be analytically solved and we must rely on

numerical (iterative) optimization algorithms that deliver approximate
values for the parameters.

Jorge S. Marques, IST, 2017 77/279

Optimization methods

Optimization methods require different types of information

1. function values: J(θ),
2. first derivatives (gradient vector): ∇θ J,
3. second derivatives (Hessian matrix): H.

Jorge S. Marques, IST, 2017 78/279

Global and Local minima
A function J : Rp → R has a local minimum at θ∗ ∈ Rp if there is an
> 0 such that

kθ − θ∗ k < ⇒ J(θ) ≥ J(θ∗ )

where
I J(θ ∗ ) is called a local minimum,

I θ∗ is called a local minimizer.

A function J : Rp → R has a global minimum at θ∗ ∈ Rp if for all θ ∈ Rp

J(θ) ≥ J(θ∗ )

where
I J(θ ∗ ) is called a global minimum,

I θ∗ is called a global minimizer.

Jorge S. Marques, IST, 2017 79/279

Global and Local minima

We are interested in the global minimum but most algorithms get

trapped in the local minima (left), if they exist and may not converge to
the global minimum.

Notice that if the function is convex (right) it has no more than one
minimum.

Jorge S. Marques, IST, 2017 80/279

Gradient descent -1D case

If θ is a scalar, the derivative of J(θ) conveys information about the tilt

of the function to be minimized.

If we move a small amount in the

opposite direction of the derivative,
the function decreases:
dJ (t)
θ(t+1) = θ(t) −η (θ )
dθ

where η controls the displacement of the point θ(t) and is known as step
size or learning step.

The process starts with an initial guess θ(0) .

Jorge S. Marques, IST, 2017 81/279

Gradient descent - vector case

If θ ∈ Rp and J(θ) is a differentiable function in a neighborhood of a

point θ(t) , then J(θ) decreases fastest if we move along the opposite
direction of the gradient:

θ(t+1) = θ(t) − η∇θ J(θ(t) ).

This procedure is repeated until the function stops decreasing, meaning

that we are in the vicinity of a local minimum or in a plateau. This
algorithm is called gradient descent or stepest descent algorithm. I

Jorge S. Marques, IST, 2017 82/279

Gradient descent - vector case

Another way to motivate the gradient algorithm is based on the first

order approximation of the cost function J(θ(t) + ∆), using a Taylor
series expansion

J(θ(t) + ∆) = J(θ(t) ) + ∇θ J(θ(t) )T ∆ ,

valid for a small displacement ∆.

If we make ∆ = −η∇θ J(θ(t) )T , we obtain

J(θ(t) + ∆) = J(θ(t) )−ηk∇θ J(θ(t) )k2

which corresponds to a decrease of the cost function.

The choice of η is a difficult aspect of the algorithm. The first order

approximation becomes invalid if η is too ”large”.

Jorge S. Marques, IST, 2017 83/279

Choice of step size η

The choice of step size involves a trade-off and it is often obtained by

trial-and-error. If η is too small, the update process becomes very slow.

Figures: isotropic valey + narrow valid

On the contrary, if η is too large the algorithm may skip a local minima
or produce an update of θ that increases the objective function J(θ).

Acceleration techniques can be used to speed up convergence e.g.,

adaptive step size and momentum technique.

Jorge S. Marques, IST, 2017 84/279

Momentum technique

This method performs a lowpass filtering of the gradient sequence and

updates θ(t+1) using the filtered gradient v (t+1)

v (t+1) = αv (t) − η∇θ J(θ(t) )

θ(t+1) = θ(t) + v (t+1) .

The parameter α (pole) typically ranges from 0.5 to 0.95.

This technique improves the convergence rate, specially if the cost

function J(θ) exhibits deep valleys in which the gradient method is slow.

The objective function J is evaluated in each iteration to check if it

decreases as expected. If not, the memory of the moment term is set to
zero.

L. Almeida, Multilayer Perceptrons, in Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 85/279

Nesterov accelerated gradient

This method is similar to the momentum technique but it computes the

gradient in a different position. It computes an approximate position for
the parameters in the next iteration and computes the gradient there
(look ahead).

v (t+1) = αv (t) − η∇θ J(θ(t) +αv (t) )

θ(t+1) = θ(t) + v (t+1) .

This algorithm performs better than the momentum term in many

problems.

Sutskever, Martens, Dahl, Hinton, On the importance of initialization and momentum in deep
learning, 2013.

Jorge S. Marques, IST, 2017 86/279

Adaptive step size (Almeida & Silva)
This method assumes that the step size η is different for each component
of θ and changes in each iteration. Therefore,

(t+1) (t) (t) ∂J (t)

θi = θi − ηi θ .
∂θi

Step size update

(
(t−1) ∂J
∂J

(t) u ηi if ∂θ θ(t) · ∂θi θ(t−1) > 0
ηi = (t−1)
i .
d ηi otherwise

Typical values for the parameters: u = 1.2, d = 0.8. This technique

performs very well if the cost function contains valleys aligned with the x
axes.

The objective function J is evaluated in each iteration to check if it

decreases as expected. If not, the previous values of the parameters are
kept and the step sizes are reduced.
Jorge S. Marques, IST, 2017 87/279
Newton method

The Newton methods assumes that we know not only the gradient vector
∇θ J(θ) but also the matrix of second derivatives (Hessian matrix).
∂2J ∂2J ∂2J
 
 ∂J 
∂θ 2 ∂θ 1 ∂θ 2
... ∂θ1 ∂θ d
∂θ1  ∂ 2 J1
 ∂J  ∂2J ∂2J

 ∂θ2 

 ∂θ2 ∂θ1 ∂θ22
... ∂θ2 ∂θd 

∇θ J =  .  H= .. .. ..
 .. 


 . . ... . 

∂J ∂ J2 2
∂ J ∂ J2
∂θd ∂θd ∂θ1 . . . ∂θ2 ∂θ2 ∂θ 2
d

and requires the inversion of H in each iteration.

Jorge S. Marques, IST, 2017 88/279

Newton method

Given a guess θ(t) we can approximate the cost function J(θ) by the 2nd
order Taylor expansion
1
J(θ(t) + ∆) = J(θ(t) ) + ∇θ J(θ(t) )T ∆ + ∆T H(θ(t) )∆
2
where ∆ is a small displacement vector.

Minimization of J(θ(t) + ∆) with respect to the displacement ∆, can be

achieved by the necessary condition

∇∆ J(θ(t) + ∆) = 0

Jorge S. Marques, IST, 2017 89/279

Newton method

Necessary condition for optimality,

∇∆ J(θ(t) + ∆) = 0

(t) (t) T 1 T (t)
∇∆ J(θ ) + ∇θ J(θ ) ∆ + ∆ H(θ )∆ = 0
2
∇θ J(θ(t) ) + H(θ(t) )∆ = 0
h i−1
∆ = − H(θ(t) ) ∇θ J(θ(t) )
Therefore,
h i−1
θ(t+1) = θ(t) − H(θ(t) ) ∇θ J(θ(t) )

The Newton method gives an exact solution for the parameters if J is a

quadratic function.

Jorge S. Marques, IST, 2017 90/279

Example 1 - gradient vs Newton method

Quadratic function: J(x1 , x2 ) = x12 + x22

1.5
8
1
6
0.5

4
0

2
-0.5

0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

10 iterations of gradient descent (blue) and 1 iteration of Newton

method (red).

Jorge S. Marques, IST, 2017 91/279

Example 1 - gradient, momentum, Nesterov, Newton

Quadratic function: J(x1 , x2 ) = x12 + x22

2
10 1

1.5

10 0
0.5

-0.5
10 -1

-1

-1.5

-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10

Left: 10 iterations of gradient descent (green), gradient with momentum

(blue) Nesterov accelerated gradient (red) and Newton method (cyan).
Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 92/279

Example 2 - gradient vs Newton method

Quadratic function: J(x1 , x2 ) = x12 + 10x22

1.5
50
1
40

0.5
30

20 0

10 -0.5

0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

10 iterations of gradient descent (blue) and 1 iteration of Newton

method (red).

Jorge S. Marques, IST, 2017 93/279

Example 2 - gradient, momentum, Nesterov, Newton

Quadratic function: J(x1 , x2 ) = x12 + 10x22

2
10 1

1.5

10 0
0.5

-0.5
10 -1

-1

-1.5

-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10

Left: 10 iterations of gradient descent (green), gradient with momentum

(blue) and Nesterov accelerated gradient (red) and Newton method
(cyan). Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 94/279

Example 3 - gradient vs Newton method
Rosenbrock function with constant 10, instead of 100, to simplify the
problem: J(x1 , x2 ) = (1 − x1 )2 + 10(x2 − x12 )2

1.5
400
1

300
0.5

200 0

-0.5
100

-1
0
2 -1.5
1 2
0 1 -2
0
-1
-1
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

300 iterations of gradient descent (blue) and 5 iteration of Newton

method (red).

Jorge S. Marques, IST, 2017 95/279

Example 3 - gradient, momentum, Nesterov, Newton

Rosenbrock function with constant 10, instead of 100, to simplify the

problem: J(x1 , x2 ) = (1 − x1 )2 + 10(x2 − x12 )2

2
10 1

1.5

1
10 0

0.5

10 -1
-0.5

-1

10 -2
-1.5

-2

10 -3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 10 20 30 40 50 60 70 80 90 100

Left: 100 iterations of gradient descent (green), gradient with

momentum (blue), Nesterov accelerated gradient (red) and Newton
method (cyan). Right: cost function. Newton outside scope.

Jorge S. Marques, IST, 2017 96/279

Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 97/279

Supervised learning

We wish to predict an outcome y , given a vector of features x ∈ Rp .

Predictor (model):
ŷ = f (x, θ).
The parameters of the model, θ, are estimated from a training set
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) .

But, learned systems are not perfect. The output of a learned system is
not always the desired output.

Learning systems need to be evaluated.

Jorge S. Marques, IST, 2017 98/279

Example: regressor

How do we measure the performance of a regressor?

700

600

500

400

300

200

100

0
50 100 150 200

Polynomial fits (order 1, 3, and 4).

Jorge S. Marques, IST, 2017 99/279

Example: classifier

How do we measure the performance of a classifier?

Training set (left) and predicted classes (right) using the k nearest
neighbor method.

Jorge S. Marques, IST, 2017 100/279

Loss function

If the desired output, y , is different from the predicted outcome,

ŷ = f (x), we define a loss L(y , ŷ ), e.g.,

Regression Classification

0 y = ŷ
L(y , ŷ ) = (y − ŷ )2 L(y , ŷ ) =
1 otherwise
or
L(y = i, ŷ = j) = Lij
diagonal terms (no error) equal to zero.

In the classification problem, the second loss function is more flexible

since it may assign different penalties to different kinds of errors.

Jorge S. Marques, IST, 2017 101/279

Risk

If x, y are realizations of two random variables, it makes sense to define

the expected (average) value of the loss, also known as risk.

R = E {L(y , ŷ (x))}

In the case of regression, the risk would be

Z Z
R= L(y , ŷ (x))p(x, y ) dx dy

This requires the joint distribution of input and output p(x, y ) which is
usually unknown.

Jorge S. Marques, IST, 2017 102/279

Risk

In the case of classification problems, the risk would be

XX
R= L(y , ŷ )P(y , ŷ )
y ŷ

This requires the joint distribution of true and predicted class P(y , ŷ )
which is usually unknown.

Jorge S. Marques, IST, 2017 103/279

Empirical risk

Since the risk cannot be computed in most problems, we can replace the
expected value by an average of the loss computed with the training data,
n
1X
Re = L(y (i) , f (x (i) )) .
n
i=1

This is called the empirical risk.

The empirical risk is often used to train the predictor. However, is it a

good criterion to evaluate the system?

Jorge S. Marques, IST, 2017 104/279

Example: polynomial fit
Polynomial fits of order 1, 3, 4: which model is the best?

700

600

500

400

300

200

100

0
50 100 150 200

The empirical risk of the forth order polynomial is the smallest. But is
this the best model?

The model order is often considered as an hyperparameter.

Jorge S. Marques, IST, 2017 105/279

Generalization
We want to measure the performance of the system with new data. This
property is known as generalization.

To evaluate the generalization of a system, we should consider an

independent data set
n o
T 0 = (x 0(i) , y 0(i) ), i = 1, . . . , n0

and evaluate the model in it.

0
n
1 X
R0e = 0 L(y 0(i) , f (x 0(i) ))
n
i=1

Important questions:
I is Re , (computed in the training set) a good estimate of R0e
(computed in an independent set)?
I can Re or R0e be used to choose the model hyperparameters (e.g.,
polynomial order)?
Jorge S. Marques, IST, 2017 106/279
Evaluation of polynomial order
2

1.5

0.5

0
1 2 3 4 5

Average loss in the training set (blue) and in an independent set (red), as
a function of polynomial degree (hyperparameter).
Conclusions:
I The evaluation in the training set is too optimistic. An independent
data set is mandatory to obtain a reliable evaluation.
I The use of an independent data set allows the choice of model
hyperparameters (polynomial degree).
Jorge S. Marques, IST, 2017 107/279
Evaluation of k Nearest neighbor

0.25

0.2

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Percentage of classification error in the training set (blue) and in an

independent set (red), as a function of 1/k

The conclusions are the same as before.

Jorge S. Marques, IST, 2017 108/279

Overfitting

When there is a large difference between the evaluation of the model in

the training set and in an independent set, it means that the model is too
specialized in representing the training data and performs much worse
with new data.

This fenomena is known as overfitting.

Jorge S. Marques, IST, 2017 109/279

Summary (until now)

To estimate a model and evaluate it it is recommended the use of 2

independent data sets: training set, and a test set.

To estimate a model, select the hyperparameters and evaluate the

selected model it is recommended the use of 3 independent data sets:
training set, validation set, and a test set

More sophisicated techniques are available (cross validation,

leave-one-out)

Jorge S. Marques, IST, 2017 110/279

Model training and testing (without hyperparameters)

In most learning problems, we need to train and evaluate the model.

These operations should be done using two independent data sets known
as training set and test set.

This can be written in pseudocode using the functions f = train(T ),

P = perform(f , T 0 )

Data: training set T and test set T 0 .

f = train(T );
P = perform(f , T 0 );

Algorithm 1: Train and test of a model

Jorge S. Marques, IST, 2017 111/279

Hyperparameter selection
If we need to choose the value of hyperparameters ξ (e.g., polinomial
degree), this should be done using a third independent set known as
validation set.
This can be written in pseudocode using the functions f = train(T , ξ),
P = perform(f , T 0 )

Data: training set T , validation set Tv and test set T 0 .

Result: Select hyperparameters ξ and evaluate model.
for all values of ξ do
f = train(T , ξ);
P(ξ) = perform(f , Tv );
end
ξˆ = arg minξ P(ξ);
ˆ
f = train(T ∪ Tv , ξ);
P = perform(f , T 0 );

Algorithm 2: Training, optimization and testing of a model

This method requires a lot of data.

Jorge S. Marques, IST, 2017 112/279
Cross-validation
Cross validation is a very useful technique when we do not have a large
amount of data. The data set is split into k folds Tk (subsets with the
same number of examples). One fold is used for testing and the others
for training. After, the test fold rotates K times.

Data: k folds Tk .
for k=1, . . . , K do
f = train(T \ Tk );
Pk = perform(f , Tk );
end
P = P¯k
Algorithm 3: Cross validation without hyperparameters. The bar denotes
average for all the folders.
The final score is a combination of the evaluation of K models. The
method does not produce a final classifier/regressor.
Question: if we need one, what should we do?.

Jorge S. Marques, IST, 2017 113/279

Cross-validation with hyperparameters (nested)
Cross validation can be extended to account for the estimation of
hyperparameters. The data is again divided into K folds and two of them
will be used for validation and test.
Data: k folds Tk .
for i = 1, . . . , K do
for all values of ξ do
for j 6= i do
f = train(T \ (Ti ∪ Tj ), ξ);
P(ξ)j = perform(f , Tj );
end
P(ξ) = P(ξ) ¯ j
end
ξˆi = arg minξ P(ξ);
f = train(T \ Ti , ξˆi );
Pi = perform(f , Ti );
end
P = P̄i
Algorithm 4: Cross validation with hyperparameters (nested).
Jorge S. Marques, IST, 2017 114/279
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 115/279

History

MIT

I The human brain has been a source of inspiration in Computer

Science.
I The brain has a large number of processing units (∼ 1011 neurons)
that are slow (∼ 1 ms) but highly connected. Although they are
slow, the neurons are able to perform very complex tasks, e.g., visual
tasks, in real time and almost effortless.
I One of the first models of a neuron was proposed by McCulloch &
Pitts in the 40s and was a starting point for an exciting area of
artificial neural networks (ANN).
Jorge S. Marques, IST, 2017 116/279
Neuron

Wikipedia

The neuron is a cell consisting of dendrites (inputs), a soma (cell body)

and an axon (output).

It receives input signals through its dendrites. These signals are combined
in the soma and, from time to time, an electric impulse is generated that
travels through the axon and influences other cells.

Jorge S. Marques, IST, 2017 117/279

McCulloch & Pitts model

The neuron model proposed by McCulloch & Pitts (1942) has a linear
part, followed by a nonlinearity:

weighted sum of inputs (activation)

s = [1 x T ]w = x̃ T w

output (Heaviside function)

1 if s ≥ 0
ŷ = z = g (s) =
0 otherwise

The weighted sum s is called activation, the nonlinear function

g : R → R is known as activation function and the vector
T
w = [w0 . . . wp ] is the weight vector.

Jorge S. Marques, IST, 2017 118/279

Rosenblatt algorithm

Rosenblatt proposed an iterative algorithm in the 50s to train the weights

of the McCulloch & Pitts unit for the prediction of binary outcomes.

Rosenblatt algorithm

1. training set: T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , with
x (k) ∈ Rp , y (k) ∈ {0, 1};
2. initialization: randomly initialize the weights wi (0), i = 0, . . . , p;
3. new training example: present a new training pattern (x(t), y (t)) to
the model and compute the model output ŷ (t) = g (x̃ T (t)w (t − 1));
4. update: update the weights according to

wi (t) = wi (t − 1) + η x̃i (t)(t), (t) = y (t) − ŷ (t)

where y (t) is the desired outcome for the input x(t).

5. cycle: return to step 3, until a stop condition is met.

Jorge S. Marques, IST, 2017 119/279

Example: Gaussian data

Two trials of the Rosenblatt algorithm applied to the same linearly

separated data.

The training data is the same but the outcome is different in each
experiment (why?)

Jorge S. Marques, IST, 2017 120/279

Pros & cons

Pros
It can be proved that the Rosenblatt algorithm solves any binary problem
in a finite number of iterations, provided the training data can be
separated by a hyperplane in feature space.

Cons
It does not provide a hint to deal with data that cannot be separated by
a hyperplane or to deal with regression problems that are not binary.

Most practical problems are noisy and fit into one of these categories.
Therefore, a single unit trained by the Rosenblatt algorithm is seldom
useful in practice.

Jorge S. Marques, IST, 2017 121/279

Multilayer perceptron
To overcome previous limitations of a single unit, three important issues
were proposed:
I architectures with multiple units, usually organized in layers, known
as multilayer perceptron (MLP), and
I continuous and differentiable activation functions.
I training based on the minimization of a cost function.

For the sake of simplicity, the activation function of each unit is not explicitly
represented (but it exists!). offsets are not shown.
Jorge S. Marques, IST, 2017 122/279
Weights

Each unit i is connected to a unit j

of the next layer through a weight
wij .

unit j (layer ` = 1) unit j (layer ` > 1)

P P
sj = w0j + i∈input wij xi sj = w0j + i∈previous layer wij zi

zj = g (sj ) zj = g (sj )

g (.) is the activation function and the weight w0j is called the offset.

Jorge S. Marques, IST, 2017 123/279

Visible and hidden units

The units of the last layer are considered as visible. They are the output
of the network and we denote their output by ŷi

The units of the other layers are considered as hidden since we do not
know their desired values in the training phase. They are intermediate
variables used to compute the network output.

Jorge S. Marques, IST, 2017 124/279

Activation function

The activation functions should be continuous and differentiable to allow

the evaluation of the influence of weight changes on the network output.

Some common choices are:

sigmoid: logistic function 0.8

0.6

1 0.4

g (s) =
1 + e −s
0.2

0
-5 0 5

1.5

sigmoid: arctangent function 0.5

-0.5

g (s) = arctan s
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5

Jorge S. Marques, IST, 2017 125/279

Activation function

Linear unit 2

g (s) = s -2

-4

-4 -2 0 2 4

ReLU: rectified linear unit (recent) 3

g (s) = max(0, s) 1

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

The ReLU is currently the recommended activation function since it does

not saturate making the convergence of the gradient algorithm faster.
Other units (linear, softmax) are often used in the output layer.

Jorge S. Marques, IST, 2017 126/279

Exercises
1. Compute the derivative of the activation functions:
1
I g (s) = 1+e −s

I g (s) = arctan s
I g (s) = s
I g (s) = max(0, s)

2. Write the equations for the network

Jorge S. Marques, IST, 2017 127/279

Arquitecture & weights
To specify a multilayer perceptron we need to indicate the arquitecture:
I number of layers
I number of units per layer

We also need to indicate

I activation functions
I weights

w = {wij }

where wij is the weight

connecting the output of unit i
to unit j

The network is thus a nonlinear map ŷ = f (x, w ) from the input space to
the output space, controlled by a set of weights w .

Jorge S. Marques, IST, 2017 128/279

How do we choose the architecture?

How should we choose the number of layers?

Cybenko (1989) proved that a multilayer perceptron with 1 hidden layer
is an universal approximator of any continuous function defined on a
compact subset of Rp . This is a useful theorem but it does not explain
how many units are needed nor how should the weights be chosen.

I Common practice shows that it is often better to use more layers

since the network can synthesize a wider variety of nonlinear
functions with less units.
I It also shows that deeper networks (with more layers) are more
difficult to train.

Great improvements were achieved in the last 10 years in the training of

deep neural networks. The state of the art in many problems (vision,
speech, text processing) is now based on neural networks.

Jorge S. Marques, IST, 2017 129/279

Perceptron training
After choosing the NN architecture, we needto learn all the weights,
using a training set of labeled patterns T = (x (k) , y (k) ), k = 1, . . . , n .

Goal: minimize the total loss (cost)

n
X n
X
C= L(y (k) , ŷ (k) ) = L(k) ,
k=1 k=1

where ŷ (k) is the network output for the input x (k) . A typical choice for
the loss function is the quadratic error

L(y , ŷ ) = ky − ŷ k2
The minimization of C is often achieved by using the gradient algorithm

∂C
wij (t + 1) = wij (t) + ∆wij (t) , ∆wij (t) = −η
∂wij w (t)

or a modified version of it; η denotes the learning step.

Jorge S. Marques, IST, 2017 130/279

Training modes: batch, mini-batch and on-line

The gradient vector includes the contribution of all the training patterns.
The weight update using all the training patterns in each iteration is
called the batch mode.
∂C X ∂L(y (k) , ŷ (k) ) X ∂L(k)
∆wij = −η = −η = −η .
∂wij ∂wij ∂wij
k k

Another alternative consists of using one training pattern k, only, and

updating the weights with that information. This is called the on-line
mode or stochastic gradient

∂L(k)
∆wij = −η .
∂wij

A third hypothesis consists of updating the NN weights using a small

subset of training patterns. This is known as mini-batch mode.

Jorge S. Marques, IST, 2017 131/279

Chain rule of differentiation

To train the weights wij we need the gradient of the loss function L.

This task relies on the chain rule of differentiation.

dz dz dy dz ∂z dw ∂z dv
= = +
dx dy dx dx ∂w dx ∂v dx

Jorge S. Marques, IST, 2017 132/279

Training a single unit
Let us start by a simple problem: a network with a single unit trained
with a single pattern (x, y ).

Forward network:
Pp
s = w0 + i=1 wi xi ,

ŷ = z = g (s)

Gradient:
∂L dL ∂s
∂wp = ds ∂wp = xp

Therefore, the gradient is given by

∂L dL dL d ŷ dL
= xp = = = g 0 (s)
∂wp ds d ŷ ds d ŷ

Compare with the Rosemblat algorithm.

Jorge S. Marques, IST, 2017 133/279
Gradient structure
The structure of the gradient can be extended to more general cases.

If unit q belongs to a layer ` higher than 1,

X
sq = w0q + wiq zi
i∈previous layer

Using the chain rule, the derivative of L with respect to a weight wpq can
be computed as
∂L ∂L ∂sq
= = z p q ,
∂wpq ∂sq ∂wpq

Therefore,
∂L ∂L
= z p q , q = q ∈ layer higher than 1
∂wpq ∂sq

If the unit q belongs to the first layer, zp is replaced by xp .

Jorge S. Marques, IST, 2017 134/279

Training the output layer
These ideas can be applied to NN with multiple layers. Let us start by
the output layer.

Forward network (unit j ∈ {6, 7}):

P
sj = w0j + i∈previous layer wij zi ,

zj = g (sj )

Gradient (unit q ∈ {6, 7}):

∂L
∂wpq = z p q

where
∂L ∂L ∂zq ∂L
q = = = g 0 (sq )
∂sq ∂zq ∂sq ∂zq

Jorge S. Marques, IST, 2017 135/279

Training a hidden layer
Let us consider units from a hidden layer.

Forward network (j ∈ {3, 4, 5}):

Pp
sj = w0j + i=1 wij zi ,
zj = g (sj )

Gradient (q ∈ {3, 4, 5}):

∂L
∂wpq = z p q

where

∂L X ∂L ∂sj ∂zq X
q = = = g 0 (sq ) wqj j
∂sq ∂sj ∂zq ∂sq
j∈next layer j∈next layer

Jorge S. Marques, IST, 2017 136/279

Backpropagation algorithm
The gradient components are given by
∂L
= z i j ,
∂wij

where zi is obtained from the multiplayer perceptron and j is obtained

from auxiliary network called backpropagation network.

This algorithm for the computation of the gradient using the

backpropagation network is known as the backpropagation algorithm.
Jorge S. Marques, IST, 2017 137/279
Backpropagation network
How do we build the backpropagation network?

The backpropagation network (right) is obtained from the original

network (left)by
I linearizing nonlinear units (activation functions);
I inverting the direction of links, converting sums into derivation
points (and vice versa);
I the output of linearized branches are the variables i ;
I the input of backpropagation network are the derivatives of the loss
with respect to forward network outputs.
Jorge S. Marques, IST, 2017 138/279
Acceleration techniques

The convergence of the gradient algorithm is often very slow and

acceleration techniques are usually adopted, namely:

I momentum term;
I adaptive weights.

These techniques modify the weight update rule and were discussed
before in the optimization lesson. Next we summarize the steps involved
in the gradient algorithm with momentum term (bach and on-line).

Jorge S. Marques, IST, 2017 139/279

Gradient algorithm (batch) with momentum term
Set t = 1 and ∆wij (0) = 0. Repeat steps 1 through 4 below until
stopping criteria is met
1. Set the variables gij to zero. These variables will be used to
accumulate the gradient components.
2. For k = 1, . . . , n, perform steps 2.1 through 2.4
2.1 propagate forward: apply the training pattern x (k) to the perceptron
(k)
and compute the variables zi and outputs ŷj
∂Lk
2.2 compute the cost derivatives: (k)
∂ ŷj
∂Lk
2.3 propagate backwards: apply (k) to the inputs of backpropagation
∂ ŷj
network and compute its internal variables j
2.4 compute and accumulate components: compute the variables
∂Lk
∂wij
= zi j and accumulate each of them in the corresponding
variable i.e., gij ← gij + zi j
3. Apply momentum: set ∆wij (t) = −ηgij + α∆wij (t − 1)
4. Update the weights: set wij (t + 1) = wij (t) + ∆wij (t)
adapted from L. Almeida, Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 140/279

Gradient algorithm (on-line) with momentum term

Set t = 1 and ∆wij (0) = 0. Repeat step 1 until stopping criteria is met
1. For k = 1, . . . , n, perform steps 1.1 through 1.6
1.1 propagate forward: apply the training pattern x k to the perceptron
and compute the variables zi and outputs ŷ (k)
k
1.2 compute the cost derivatives: ∂L(k)
∂ ŷj
∂Lk
1.3 propagate backwards: apply (k) to the inputs of backpropagation
∂ ŷj
network and compute its internal variables j
∂Lk
1.4 compute the gradient components: compute the variables ∂wij
= zi j
1.5 Apply momentum: set ∆wij (t) = −ηzi j + α∆wij (t − 1)
1.6 Update the weights: set wij (t + 1) = wij (t) + ∆wij (t)

adapted from L. Almeida, Handbook of Neural Computation, 1997.

Jorge S. Marques, IST, 2017 141/279

Example
Output of a multi layer perceptron trained by the gradient algorithm
using the backpropagation method.
I data: 150 training patterns; binary outcome
I architecture: 2 inputs, 2 hidden layers and one output layer (5-3-1
units)
I activation function: logistic
I training mode: on-line, no speeding algorithms

Jorge S. Marques, IST, 2017 142/279

Regression vs classification

MLPs can be used for regression and for classification tasks.

In regression tasks the output units typically have linear activation

functions and the network is trained with quadratic loss (SSE).

In classification tasks the output units typically have logistic or Softmax

activation functions and the network is often trained with negative
log-likelihood (cross-entropy) loss. This will be discussed later.

Jorge S. Marques, IST, 2017 143/279

Example

Write all the equations required to compute the gradient components,

assuming one training example and the SSE cost

SSE = (y1 − ŷ1 )2 + (y2 − ŷ2 )2

Forward network Backward network
input: 2(ŷ1 − y1 ) 2(ŷ2 − y2 )
s1 = w0 1 + w11 x1 , z1 = g (s1 )
4 = g 0 (s4 ) × 2(ŷ2 − y2 )
s2 = w0 2 + w22 x2 , z2 = g (s2 )
3 = g 0 (s3 ) × 2(ŷ1 − y1 )
s3 = w0 3 + w13 z1 + w23 z2 , z3 = g (s3 )
2 = g 0 (s2 )[w24 4 + w23 3 ]
s4 = w0 4 + w14 z1 + w24 z2 , z4 = g (s4 )
1 = g 0 (s1 )[w14 4 + w13 3 ]

Jorge S. Marques, IST, 2017 144/279

Example (cont)

∆01 = 11 , ∆11 = x1 1 ∆03 = 13 , ∆13 = z1 3 , ∆23 = z2 3

∆02 = 12 , ∆22 = x2 2 ∆04 = 14 , ∆14 = z1 4 , ∆24 = z2 4

Jorge S. Marques, IST, 2017 145/279

Exercises

1. Consider the multi-layer perceptron sketched in previous slides. Write

the equations for the gradient of the loss function with respect to all the
weights.

2. Write the equations for the forward and backpropagation networks

using matrix notation. Consider vectors s (`) , y (`) , (`) containing the s, y
and variables associated to layer `.

Jorge S. Marques, IST, 2017 146/279

Exercises

I Consider a MLP with one layer and linear units. Prove that the
input output map performed by the perceptron is a linear (affine)
transformation
f (x) = Ax + b
I prove this statement for the case of a MLP with two layers and
linear units.

This property can be extended to an arbitrary number of layers provided

the units are linear. Matrix A may be rank deficient if some of the hidden
layers has less units than the input or the output.

Jorge S. Marques, IST, 2017 147/279

Analysis of images with neural networks

Imagine that you want to distinguish images of horses and cats. How
would you proceed?

Jorge S. Marques, IST, 2017 148/279

Analysis of images with neural networks

Images encode information in very complicated ways (different

viewpoints, shapes, colors, textures, illumination).

Finding a set of rules on low level image features (e.g., color, corners)
seems to be unfeasible!

Jorge S. Marques, IST, 2017 149/279

Analysis of images with neural networks

However, some properties should hold:

I There is a spatial dependence in images.
I There are small regions that convey important information (e.g.,
eyes, ears). Some kind of pattern matching might work.
I The system should be invariant to translation, color, and
illumination changes.

How do we put this information in an algorithm?

May be the human brain can inspire us again.

Jorge S. Marques, IST, 2017 150/279

Convolutional neural networks

Convolutional neural networks (CNN) have recently achieved an

enormous success in the analysis of images.

History

I receptive fields (Hubel and Wiebel, 1950s, 60s): individual neurons

in the visual cortex respond to small regions in the field of view.

I neocognitron (Fukushima, 1980): hierarchical model using receptive

fields.

I LeNet-5 (LeCun et al., 1998): convolutional neural network

proposed for digit recognition.

I Alexnet (Krizhevsky et al., 2012): convolutional neural network.

Breakthrough in Imagenet international challenge.

Jorge S. Marques, IST, 2017 151/279

ImageNet - Large Scale Visual Recognition Challenge

LSVRC - Imagenet dataset

I 1K categories
I 1M images (1K images per category)
I annotation: manual annotation using Amazon Mechanical Turk.

Jorge S. Marques, IST, 2017 152/279

Breakthrough 2012 - Alexnet
Alexnet won the ImageNet challenge in 2012 by a large margin.

A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with Deep Convolution

Neural Networks, NIPS, 2012

Since then, all the winners of the ImageNet challenge are convolutional
neural networks
Jorge S. Marques, IST, 2017 153/279
End-to-end architecture

In Alexnet, no image features (handcrafted features) are defined by the

user.

Alexnet learns to directly compute the image label (class) for the input
image. Of course this requires a large training set (more than 1 million
images).

This strategy is called the end-to-end approach.

The classic blocks (feature extraction and classification) are both learned
from the training data without the use of handcrafted features.

Jorge S. Marques, IST, 2017 154/279

Basic CNN architecture

A convolutional neural network (CNN) receives an input image and

predicts an output image or a label, based on a sequence of internal
representations that extract useful information (features) from the image.

Most of these representations are 3D arrays. Each 3D array can be

viewed as a collection of 2D arrays known as channels or feature maps.

Internal representations of the image are obtained by a concatenation of

layers, including:
I convolutional layers: convolution followed by non-linearity
(activation function)
I pooling layers: dimensionality reduction
I fully connected layers: used in classification problems

Jorge S. Marques, IST, 2017 155/279

Convolution layer
A convolution layer receives a 3D input, convolves it with a set of kernels
(filters) and applies an activation function (typically RELU) to the filter
outputs.

Each kernel has a localized support in the first two (spatial) coordinates
and it is full range in the third (depth) coordinate.

Jorge S. Marques, IST, 2017 156/279

Convolutional layer

`−1
3D input: zijk ` − 1 - number of input layer

`
3D kernel: hijk

2D output:
XXX
`−1
sij` = `
hpqr zi+p,j+q,0+r
p q r

zij` = g (sij` )

Each filter produces a 2D output known as a feature map. Stacking the

feature maps produced by multiple filters leads to a 3D array.

Jorge S. Marques, IST, 2017 157/279

Pooling
Pooling reduces the size of a 3D array.

Each channel is separately processed. First the channel is divided into

non-overlapping cells (e.g., ∆ × ∆). Then, each cell is replaced by a
numeric value (e.g., its maximum, or mean).

Jorge S. Marques, IST, 2017 158/279

Pooling

`−1
3D input: zijk ` - number of input layer

3D output:
n o
` `−1
zijk = max z∆i+p,∆j+q,k
p,q∈{0,...,∆−1}

Jorge S. Marques, IST, 2017 159/279

Fully connected layer
The fully connected layer is used when the image representation is
converted into a 1D array.

It is often used as an output layer in classification problems.

Jorge S. Marques, IST, 2017 160/279
Alexnet

Layer 6 - pooling fully connected

Layer 2- max pooling followed I maxpooling: 2 × 2

Layer 1- convolutional by convolutional
I input: 13 × 13 × 256
I maxpooling: No I maxpooling: 2 × 2
I units: 4096
I Input: 224 × 224 × 3 I Input: 55 × 55 × 96
Layer 7 - fully connected
I kernel: 96 × 11 × 11 × 3 I kernel: 256 × 5 × 5 × 96
I input: 4096
I stride: 4 I stride: 1
I units: 4096
I units: 55 × 55 × 96 I units: 27 × 27 × 256
Layer 8- fully connected
Layer 3,4,5- similar
I input: 4096
I units: 1000

Jorge S. Marques, IST, 2017 161/279

Alexnet
Number of weights to be learned

layer expression weights

1 (11 × 11) × 3 × 96 0.03 M
2 (5 × 5) × 96 × 256 0.6 M
3 (3 × 3) × 256 × 384 0.8 M
4 (3 × 3) × 384 × 384 1.3 M
5 (3 × 3) × 384 × 256 0.8 M
6 (6 × 6) × 256 × 4096 37.7 M
7 4096 × 4096 16.7M
8 4096 × 1000 4.1 M

The Alexnet has 60 million weights. Almost all of them associated to the
last three layers (fully connected layers).

This high number of weights leads to overfitting problems in the training

phase. Some kind of regularization must be considered.

Jorge S. Marques, IST, 2017 162/279

Other convolutional neural networks: VGG

I deeper network
I deeper layers
I kernels: smaller spatial dimensions (3 × 3)

Jorge S. Marques, IST, 2017 163/279

Other convolutional neural networks: GoogLeNet

convolution, pooling, softmax, merge

I inception module
I 1x1 convolution

Jorge S. Marques, IST, 2017 164/279

Other convolutional neural networks: ResNet

I very deep network

I shortcut connections

Jorge S. Marques, IST, 2017 165/279

Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 166/279

What is a classifier?
length volume class
0.36 0.67 tuna
0.82 0.56 tuna
0.46 0.67 sword
Example: fish classification
0.40 0.30 sword
0.60 0.80 tuna
0.61 0.47 tuna
0.21 0.41 sword

Given an observation x ∈ Rd , we wish to predict its class y ∈ Ω, where

Ω = {ω0 , . . . , ωK −1 } or Ω = {0, . . . , K − 1}.

K is the number of classes and the ith class will be denoted by ωi or

simply by i.

We wish to learn a function f (x) that associates each feature vector

x ∈ Rp with the predicted class ŷ = f (x) ∈ Ω. This function is known as
a classifier.

Jorge S. Marques, IST, 2017 167/279

Discriminant functions

An alternative way to define a classifier is by using K functions

fi : Rp → R , i = 0, . . . , K − 1

such that x is classified in class ωi iif

fi (x) ≥ fj (x) , ∀j 6= i.

These functions fi (x) are called discriminant functions.

Note: if fi (x) = fj (x) the classification is ambiguous.

Jorge S. Marques, IST, 2017 168/279

Decision regions and decision boundary

A classifier f (x) splits the input space Rd into K disjoint regions, Rj ,

each of them associated to a specific class ωj , with j ∈ {0, . . . , K − 1}.

Rj = x ∈ Rd : f (x) = ωj .

These regions are known as decision regions.

The boundary points of these decision regions are called decision

boundaries or decision surfaces.

Knowing the decision regions is equivalent to knowing the classifier f (x),

or a set of discriminant functions fi (x), i = 0, . . . , K − 1. In fact, the
indicator functions of the regions is a set of discriminant functions.

Jorge S. Marques, IST, 2017 169/279

Classifier design

The main question is:

Given a classification problem, how do we define the classifier f (x) or a

set of discriminant functions f0 (x), . . . , fK −1 (x) or the decision regions
R0 , . . . , RK −1 ?

The three representations are equivalent.

Two cases will be considered:

I we know the probability distribution of the data (ideal case);
I we only know a data set (training set).

Jorge S. Marques, IST, 2017 170/279

Classifier evaluation - confusion matrix

The confusion matrix P is a K × K matrix whose generic element Pij is

the joint probability of true class i being predicted as class j.

Pij = Pr {y = i, ŷ = j}

Properties:
Pij ∈ [0, 1], ∀i, j
PK −1 PK −1
i=0 j=0 Pij = 1.

Matrix P is a joint probability distribution: the diagonal elementos

correspond to true decisions and the off diagonal elements correspond to
errors.

If we normalize each line i to sum 1, the i − th normalized line explains

how the classifier predicts the data from class i: what is the probaility of
error in class i and what errors are most probable.

Jorge S. Marques, IST, 2017 171/279

Probability of error

The probability of error can be obtained from the confusion matrix P

K
X −1
P(error) = 1 − Pii
i=0

Proof

P(error) = 1 − P(correct decision)

PK −1
=1− i=0 P(correct decision, y = i)
PK −1
=1− i=0 P(y = i, ŷ = i)
PK −1
=1− i=0 Pii

Jorge S. Marques, IST, 2017 172/279

How to compute the confusion matrix

In simple cases, the confusion matrix can be analytically evaluated.

Assuming that a feature vector associated to class i is generated

according to a pdf p(x|y = i), the class j is chosen by the classifier if
x ∈ Rj ,
Z
Pij = Pr {y = i, x ∈ Rj } = p(x|y = i)P(y = i) dx
Rj

When the integral cannot be evaluated, the confusion matrix may be

experimentally obtained:
I perform N classification experiments;
I count how many training examples from class i are classified in class
j (Nij );
Nij
I estimate Pij using the relative frequency P̂ij = PK −1 PK −1
Npq
.
p=0 q=0

Jorge S. Marques, IST, 2017 173/279

Example (1)
Compute the confusion matrix and the probability of error, assuming that
I x ∈ [0, 1], y ∈ {0, 1}, p(x|y = 0) = 1, p(x|y = 1) = 2x, ∀x ∈ [0, 1];

I the classifier is characterized by the decision regions

R0 = [0, T [, R1 = [T , 1].

Confusion matrix
P00 = Pr (y = 0, ŷ = 0) = Pr (ŷ = 0|y = 0)Pr (y = 0) =
R RT
= P0 R0 p(x|y = 0)dx = P0 0 1 dx = P0 T ,

P01 = P0 − P00 = P0 (1 − T ),
P10 = Pr (y = 1, ŷ = 0) = Pr (ŷ = 0|y = 1)Pr (y = 1) =
RT
= P1 R0 p(x|y = 1)dx = P1 0 2x dx = P1 T 2 ,
R

P11 = P1 − P10 = P1 (1 − T 2 ).

Jorge S. Marques, IST, 2017 174/279

Example (2)

Confusion matrix
 
P0 T P0 (1 − T )
P= 
P1 T 2 P1 (1 − T 2 )

Probability of error

P(error) = 1 − (P00 + P11 ) = P1 T 2 − P0 T + 1 − P1

The threshold T is chosen by the user. For example, it can be chosen by

minimizing P(error).

Jorge S. Marques, IST, 2017 175/279

Loss function
The confusion matrix is enough if all the errors are equally important and
the classes are equally probable.

Sometimes this is not true. We need to define a loss function L(y , ŷ )

that assigns a penalty when the true class of x is y and the predicted
class is ŷ .

Examples:

0 if ŷ = y
binary loss: L(y , ŷ ) = .
1 otherwise

general loss: L(y = ωi , ŷ = ωj ) = Lij , Lii = 0 , Lij > 0 , i 6= j.

The first case is a binary loss (no error/error).

The second case is a square K × K matrix of penalties with zeros in the

diagonal (decisions without error) and different costs associated to the
different types of errors.
Jorge S. Marques, IST, 2017 176/279
Loss function (cont)

Example: medical diagnosis: ω0 - no tumor , ω1 - tumor

0 1 0 1
L= L=
1 0 5 0

Question: which loss is more appropriate for this problem?

The loss function is not differentiable. We cannot use optimization

algorithms based on the gradient or Hessian matrix to reduce the loss :-(.

Jorge S. Marques, IST, 2017 177/279

Is there an ideal classifier?

The answer is Yes, if x, y are realizations of random variables X , Y with

known distribution and we wish to minimize the expected loss, also
known as (aka)risk (ideal case of known distributions)

R = E {L(y , ŷ (x))} .

If the loss is binary, the optimal classifier is given by

ŷ = arg max P(y = ω|x) .

ω∈Ω

This is known as the Bayes classifier and chooses the class with greatest
a posteriori probability (the most probable class, given the observations).

The Bayes classifier with binary loss is optimal in the sense that it
minimizes the probability of decision error.

Jorge S. Marques, IST, 2017 178/279

Is there an ideal classifier? (cont)

If we adopt a general loss function, the optimal classifier is also simple.

We compute the expected cost of choosing class ŷ = ω
X
cω (x) = L(y , ω)P(y |x),
y ∈Ω

and choose the class with smallest cost, i.e., the feature vector x should
be classified as follows

f (x) = arg min cω (x).

ω∈Ω

This is an optimal classifier in the sense that minimizes the risk for a
general loss function and it is also known as Bayes classifier.

Jorge S. Marques, IST, 2017 179/279

Proof
Risk with general loss matrix
Z X
R = E {L(y , ŷ (x))} = L(y , ŷ (x))p(x, y )dx
y ∈Ω
 
Z X
=  L(y , ŷ (x))P(y |x) p(x)dx
y ∈Ω

Minimization can be independently performed at each feature vector x.

 
X
f (x) = arg min  L(y , ω)P(y |x) = arg min cω (x)
ω∈Ω ω∈Ω
y ∈Ω

If the loss is binary we obtain,

f (x) = arg min [1 − P(ω|x)] = arg max P(ω|x)

ω∈Ω ω∈Ω

Jorge S. Marques, IST, 2017 180/279

a posteriori distribution of the classes

The a posteriori distribution of the classes P(y = i|x) is the distribution

of the classes after observing the feature vector x.

These probabilities can be obtained by using the Bayes law

p(x|y = i)P(y = i)
P(y = i|x) = ,
p(x)
where
I p(x|y = i) - distribution of the feature vector x associated to class i;
I P(y = i) - a priori distribution of the classes (before knowing the
observations);
I p(x) - normalization term that does not influence the decision,
X
p(x) = p(x|y )P(y ).
y ∈Ω

Jorge S. Marques, IST, 2017 181/279

Example: binary classification with Gaussian features (1d)

This example considers two classes (K = 2) with Gaussian 1d features:

x|ωi ∼ N(µi , σi2 ), i = 0, 1 (σ12 = σ22 = σ 2 ).
A priori distribution: P1 = P(ω1 ), P0 = P(ω0 ) = 1 − P1 .

conditional distribution joint distribution of data a posteriori distribution

of data p(x|y ) and classes p(x, y ) P(y |x)
0.4 0.4 1

0.9
0.35 0.35
0.8
0.3 0.3
0.7
0.25 0.25
0.6

0.2 0.2 0.5

0.4
0.15 0.15
0.3
0.1 0.1
0.2
0.05 0.05
0.1

0 0 0
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

Data classification can be obtained by maximizing the joint distribution

of data and class p(x, y ) or the a posteriori distribution of classes P(y |x),
with respect to y .
Jorge S. Marques, IST, 2017 182/279
Exercises

1. Consider a discrete feature variable x ∈ {1, 2, 3, 4} and an associate

binary class y ∈ {ω0 , ω1 }. Assume that y is characterized by the a
priori distribution P(ω0 ) = 0.4, P(ω1 ) = 0.6 and the observations x
are characterized by a conditional distribution defined by the table.
ω
p(x|y ) ω0 ω1
1 0.3 0.2
x 2 0.2 0.3
3 0.1 0.4
4 0.4 0.1

Derive the Bayes classifier assuming a binary loss.

Jorge S. Marques, IST, 2017 183/279

Exercises

2. Consider an observation x ∈ {0, . . . , p} generated by one of the two

binomial distributions

p
P(x|ωi ) = αix (1 − αi )p−x i = 0, 1,
x

where p, α1 > α0 are known parameters. Find the decision regions

of the Bayes classifier, assuming a binary loss matrix and equally
probable classes.

3. Assume that x ∈ R+ 0 is a realization of a random variable with one

of the following density functions

p(x|y = k) = αk e −αk x , k = 0, 1, α1 > α0 > 0.

Find the decision regions associated to both classes. Assume that

P1 = 2 P0

Jorge S. Marques, IST, 2017 184/279

Learning the classifier

In practice, we often do not know the joint distribution of the features

and true class p(x, y ), required in the design of the Bayes classifier.

In many
practical problems, all we know is a training set
T = (x (i) , y (i) ), i = 1, . . . , n with n realizations of the pair X , Y .

We could learn a classifier by minimizing the empirical risk

n
1X
R= L(y (i) , f (x (i) ))
n
i=1

However, this is a difficult approach because y (i) and f (x (i) ) are

categorical variables and most optimization algorithms cannot be used.

Jorge S. Marques, IST, 2017 185/279

Learning the classifier

Alternative approaches are required. Some classification techniques try to

approximate the ideal (Bayes) classifier by estimating the a posteriori
probabilities of the classes, P(y |x), as a function of the feature vector x.

This can be done directly by proposing a class of functions for such

probabilities or by estimating the data distribution p(x|y = k),
k = 1, . . . , K and applying the Bayes law.

Other methods try to directly estimate a set of discriminant functions

without trying to estimate the data distribution which is considered to be
a more difficult problem.

This approach is supported by the Vapnik principle.

Vapnik principle: When trying to solve a problem, we should not solve a

more difficult problem as an intermediate step.

Jorge S. Marques, IST, 2017 186/279

Example - digit recognition
Digit recognition aims to recognize handwritten digits in images, in an
automatic way. It involves two steps:
I The first step consists of computing a bounding box for each digit
with, e.g., 20 × 20 pixels.
I The second step involves the classification of each 20 × 20 image.

examples from MNIST data set

If the feature vector, x, contains the intensity of 400 pixels, it is very

difficult to estimate the conditional distribution p(x|ωi ) : R400 → R.

Jorge S. Marques, IST, 2017 187/279

Naı̈ve Bayes classifier
When the feature vector x = [x1 , . . . , xp ]T contains many features, the
estimation of the conditional distribution p(x|y = k) is a difficult
problem.

The Naı̈ve Bayes classifier simplifies the problem by making a drastic

assumption: it assumes that features are conditionally independent
p
Y p
Y
p(x1 , . . . , xp |y = k) = p(xi |x1 , . . . , xi−1 , y = k) = p(xi |y = k)
i=1 i=1

This means that we only need to estimated the conditional distribution of

each feature.

In the digit recognition problem this means that we have to estimate the
conditional distribution of each pixel which is a simple task.

The Naı̈ve Bayes classifier is a suboptimal classifier if the independence

assumption is not true but it often leads to surprisingly good results.

Jorge S. Marques, IST, 2017 188/279

Exercises

1. Draw a pair of scatter plots for two features (x1 , x2 ), assuming that
they are
I dependent;
I independent.

2. Discuss the problem of e-mail classification (spam/non-spam).

Jorge S. Marques, IST, 2017 189/279

Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 190/279

Linear methods for classification

We denote by linear classifiers those whose decision boundaries are linear

(hyperplane) or piece-wise linear.

One example are the methods based on affine discriminant functions

fi (x) = [1 x T ]βi , where i stands for the class.

The decision boundary between two classes ωi , ωj is the set

x ∈ Rp : [1 x T ](βi − βj ) = 0 ,

which is (a subset of) an hyperplane in the feature space Rp .

Jorge S. Marques, IST, 2017 191/279

Class coding

Classification problems aim to predict a class label y ∈ {ω0 , . . . , ωK −1 }.

Some classifiers represent the class label by numbers and use regression
methods to predict those numbers.

One idea: ω0 → 0
ω1 → 1
ω2 → 2
This does not make much sense because in most problems there is no
natural order among the class labels.

A more interesting approach is the use of binary indicator variables:

y0 y1 y2
ω0 → 1 0 0
ω1 → 0 1 0
ω2 → 0 0 1

Jorge S. Marques, IST, 2017 192/279

One hot encoding

The indicator variable of class ωi , is

1 if class ωi occurs
yi =
0 otherwise

The representation of the class labels through a set of indicator variables

is known as one hot encoding.

Classification works as followed. In the training phase, a set of predictors

fi (x) are learned to fit the indicator variables.

In the test phase, new feature vectors are classified by computing the
predictors fi (x) and selecting the one with greatest value

ŷ = arg max fi (x)

Jorge S. Marques, IST, 2017 193/279

Training indicator variable with constant input

Consider n realizations of an indicator variable y associated to an

arbitrary class ω: y (1) , . . . , y (n) (x (k) is assumed constant).

Let us minimize the

n
X
SSE = (y (k) − ŷ )2
k=1
(k)
Since y is a binary variable, the SSE can be split into two terms

SSE = n0 (0 − ŷ )2 + n1 (1 − ŷ )2

where n0 , n1 is the number of 0s and 1s. The minimization of SSE leads

to
n1
ŷ =
n0 + n1
This means that the minimization of SSE leads to the estimation of the
class probability P(ω). This idea can be extended as we will see.

Jorge S. Marques, IST, 2017 194/279

Linear regression of indicator variables

Consider a binary classification problem with classes ω0 , ω1 and let us

assume that y is the indicator variable of class ω1

We fit a linear model f (x) = [1 x T ]β to the training set

T = (x (i) , y (i) ), i = 1, . . . , n using least squares.

The function f (x) can be considered as an estimate of the a posteriori

distribution of class ω1 . Since f (x) is linear, it takes values outside the
interval [0, 1].

After training the model, a new observation x can be classified by

comparing f (x) with a threshold 0.5

1 if f (x) > 0.5
ŷ = .
0 otherwise

Jorge S. Marques, IST, 2017 195/279

Example - 1D data
This example discusses two binary classification problems with 1D
features for which we know a training set (see figures). The data was fit
by a straight line and we display the decision boundary (cyan).

The first problem is well solved by linear regression of the indicator

variables. The second is not: all the features are classified in the same
class. Why?
1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5

Why does the linear regression with indicator variables fail in the second
example?
Jorge S. Marques, IST, 2017 196/279
Example - 2D data

This slide shows two problems with 2D features and a linear model. Only
the first problem can be solved by linear models. Why?
3

-1

-2

-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same class.

Jorge S. Marques, IST, 2017 197/279

Regression with more flexible models
The previous difficulties can be circumvented by using more flexible
models (e.g. 2nd order polynomials in R and R2 ).

-1
1

0.5 -2

0 0.5 1 1.5 2 2.5

0
0 0.5 1 1.5 2 2.5

Notice, though that these are not linear models with respect to x but
they are linear models in the parameters that can be estimated by a linear
system of equations.

Jorge S. Marques, IST, 2017 198/279

Drawbacks & extensions
The regressor function f (x) can be interpreted as an estimate of the a
posteriori probabilty P(ω1 |x) but it is not constrained to be in the
interval [0, 1]. Since the model is linear, it will take all real values.

The decision boundary between two classes is hyperplane. Therefore, the

technique can only be used if the data is well separated by a hyperplane.

The model can be easily extended to more flexible classes of functions

e.g., polynomials, radial basis functions, neural networks.

This approach can be easily extended to more than 2 classes by

considering K indicator variables (one per class) and fit a linear model to
predict these labels (one vs. all). This is often called one hot encoding.
Label prediction is performed by choosing the discriminant function with
the highest value
ŷ (x) = arg max fi (x).
i

Jorge S. Marques, IST, 2017 199/279

Logistic regression

Consider a binary classification problem in which y ∈ {0, 1}. The Bayes

classifier is based on the a posteriori distribution of the classes

P(y = 1|x), P(y = 0|x).

The logistic regression proposes a parametric model for the a posteriori
probabilities
T
1 e −x β
P(y = 1|x) = , P(y = 0|x) = .
1 + e −x T β 1 + e −x T β

Where x ∈ Rp+1 is the feature vector and β ∈ Rp+1 the vector of

parameters to be estimated. We have included β0 in vector β and
extended the feature vector x with 1.

Jorge S. Marques, IST, 2017 200/279

Logistic regression

This model guarantees that

P(y = 1|x), P(y = 0|x) ∈ [0, 1]

P(y = 0|x) + P(y = 1|x) = 1.

It can be rewritten as follows

P(y = 1|x) = g (x T β) , P(y = 0|x) = 1 − g (x T β)

where
1
g (s) =
1 + e −s

What is the relationship between the logistic regression and a perceptron

unit?
What is the meaning of the perceptron output ŷ , in this context?

Jorge S. Marques, IST, 2017 201/279

Logistic regression: learning

Given a training set T = (x (i) , y (i) ), i = 1, . . . , n , the coefficients β
can be estimated by the maximum likelihood method

β̂ = arg max `(β) ,

where `(β) is the conditional log-likelihood function

`(β) = log P(y (1) , . . . , y (n) |x (1) , . . . , x (n) ; β) .

Since the training examples are independent

n
X
`(β) = log P(y (i) |x (i) ) .
i=1

n n
X o
`(β) = y (i) log[g (x (i)T β)] + (1 − y (i) ) log[1 − g (x (i)T β)] .
i=1

This function cannot be analytically optimized. We have to use

numerical optimization algorithms e.g., gradient ascent method.
Jorge S. Marques, IST, 2017 202/279
Logistic regression - gradient ascent

The gradient of the conditional log-likelihood function `(β) can be easily

computed
Xn
∇β `(β) = [y (i) − g (x (i)T β)]x (i) .
i=1

Therefore, the gradient ascent algorithm is given by

β (t+1) = β (t) + γ∇β `(β (t) ) , γ>0.

Is it possible to train the logistic regressor using the SSE criterion?

Jorge S. Marques, IST, 2017 203/279

Log-likelihood gradient

Proof for 1 pattern (drop variable i)

∇β `(β) = ∇β y log g (x T β) + (1 − y ) log[1 − g (x T β)]

g 0 (x T β) −g 0 (x T β)
=y g (x T β)
x + (1 − y ) [1−g (x T β)]
x

g (x T β) [1−g (x T β)] −g (x T β) [1−g (x T β)]

=y g (x T β)
x + (1 − y ) [1−g (x T β)]
x

= y [1 − g (x T β)] x − (1 − y ) g (x T β) x

= [y − g (x T β)] x

Jorge S. Marques, IST, 2017 204/279

Example - logistic regression with 2D data

Consider two classification problems with 2D features described before.

Figures show the decision boundaries obtained by logistic regression.
Only the first problem can be solved by linear models.

3
2.5

2
2

1.5

1
1

0.5
0

-1
-0.5

-1
-2

-1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same
class. The model is too rigid.
Jorge S. Marques, IST, 2017 205/279
Logistic regression with more flexible models

The previous difficulties can be circumvented by using more flexible

models (e.g., 2nd order polynomials).

-1

-2

0 0.5 1 1.5 2 2.5

Notice that these models are not linear models with respect to x.

Jorge S. Marques, IST, 2017 206/279

Softmax
Softmax extends logistic regression to classification problems with an
arbitrary number of classes K .
The true class is expressed using indicator variables, a.k.a. one hot
encoding
K
X −1
y = (y0 , . . . , yK −1 ), yi ∈ {0, 1}, yi = 1
i=0

SOFTMAX proposes a model for the a posteriori probabilities of the

classes
e si
ŷi = P(yi = 1|x, β) = PK −1
sc
c=0 e
where
K
X −1
si = βji xj
j=0

This model guarantees that P(yi = 1|x, β) ∈ [0, 1], i = 0, . . . , K − 1 and

PK −1
i=0 P(yi = 1|x, β) = 1.

Jorge S. Marques, IST, 2017 207/279

Softmax: learning

Given a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , the coefficients β
can be estimated by the maximum likelihood method

β̂ = arg max `(β) ,

where `(β) is the conditional log-likelihood function

`(β) = log P(y (1) , . . . , y (n) |x (1) , . . . , x (n) ; β) .

Since the training examples are independent

n n K −1 n o
(m) (m)
X X X
(m) (m)
`(β) = log P(y |x ; β) = yi log(ŷi ) .
m=1 m=1 i=0

This function cannot be analytically optimized. We must resort to

numerical optimization algorithms e.g., gradient ascent method method.

Jorge S. Marques, IST, 2017 208/279

Softmax: gradient

PK −1
log-likelihood (1 training example): ` = i=0 yi log(ŷi )

Derivatives:
∂` yi
=
∂ ŷi ŷi

∂ ŷi ŷi (1 − ŷi ) i = k
=
∂sk −ŷi ŷk i=6 k
K
∂` X ∂` ∂ ŷk ∂` ∂ ŷi X ∂` ∂ ŷk
= = +
∂si ∂ ŷk ∂ŝi ∂ ŷi ∂ŝi ∂ ŷk ∂ŝi
k=1 k6=i
yi X yk
= ŷi (1 − ŷi ) − (ŷk ŷi ) = yi − ŷi
ŷi ŷk
k6=i

∂` ∂` ∂si
= = (yi − ŷi )xj
∂βij ∂si ∂βij

Jorge S. Marques, IST, 2017 209/279

Softmax: update

The previous expressions can be extended to multiple training examples.

log-likelihood (multiple examples):

n K −1
(m) (m)
X X
`(β) = yi log(ŷi )
m=1 i=0

gradient:
n
∂` X (m) (m) (m)
= (yi − ŷi )xj
∂βij m=1

Therefore, the gradient ascent algorithm is given by

(t+1) (t) ∂`
βij = βij + γ , γ>0.
∂βij

Jorge S. Marques, IST, 2017 210/279

Linear discriminant analysis
Let us consider a binary classification problem. The a posteriori
probabilities can be obtained from the Bayes law

p(x|y = i)P(y = i)
P(y = i|x) = ∝ p(x|y = i)Pi ,
p(x)

where the a priori probabilities Pi are approximated by the relative

frequencies of each class in the training set. The main challenge is the
estimation of p(x|y = i).

Linear discriminant analysis assumes that the distribution of the data

associated to class y = i follows a normal distribution N (µi , Σi ) and the
data from all classes share the same covariance matrix Σi = Σ.
Therefore,
1 T
Σ−1 (x−µi )
p(x|y = i) = Ce − 2 (x−µi ) ,

where C = 1/ (2π)p/2 |Σ|1/2 is a normalization constant.

Jorge S. Marques, IST, 2017 211/279

Linear discriminant analysis

Under these hypotheses, the decision boundary between classes i, j is

given by
p(x|y = i)Pi = p(x|y = j)Pj
1 T
Σ−1 (x−µi ) 1 T
Σ−1 (x−µj )
Ce − 2 (x−µi ) Pi = Ce − 2 (x−µj ) Pj .
The constants C are equal because the covariance matrices are the same.
Taking logs leads to
1 Pj
(µi − µj )T Σ−1 x = (µi + µj )T Σ−1 (µi − µj ) + log .
2 Pi

This is the equation of an hyperplane with normal vector (µi − µj )T Σ−1 .

Jorge S. Marques, IST, 2017 212/279

Linear discriminant analysis
Proof
log[p(x|y = i)Pi ] = log[p(x|y = j)Pj ]

1 1
log C − (x−µi )T Σ−1 (x−µi )+log Pi = log C − (x−µj )T Σ−1 (x−µj )+log Pj
2 2

(x − µi )T Σ−1 (x − µi ) − 2 log Pi = (x − µj )T Σ−1 (x − µj ) − 2 log Pj

−1 T −1 T −1 T −1 Pj
−2µT
i Σ x + µi Σ µi = −2µj Σ x + µj Σ µj − 2 log
Pi
1 T −1 1 Pj
(µi − µj )T Σ−1 x = µi Σ µi − µT −1
j Σ µj + log
2 2 Pi

1 Pj
(µi − µj )T Σ−1 x = (µi + µj )T Σ−1 (µi − µj ) + log
2 Pi

Jorge S. Marques, IST, 2017 213/279

Linear discriminant analysis

In practice, the parameters of the Gaussian distributions are learned from

the training data

nk
P̂(ωk ) = n

1
x (i)
P
µ̂k = nk i:y (i) =k

PK hP i
1 (i)
Σ̂ = n−K k=1 i:y (i) =k (x − µ̂k )(x (i) − µ̂k )T .

The covariance matrix has d 2 entries. If d is large (hundreds or

thousands) , Σ̂ may be inaccurate and singular. Its inverse does not exist
and it is required in LDA. It is common practice to enforce additional
constraints on Σ e.g., assume Σ is a diagonal matrix.

Jorge S. Marques, IST, 2017 214/279

Example: LDA with 2D data

Consider two classification problems with 2D features described before.

Only the first can be solved by LDA.
3

-1

-2

-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5

In the second case all the training features are classified in the same
class. The model is too rigid.

Jorge S. Marques, IST, 2017 215/279

Table of contents
Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 216/279

Support vector machines

Support vector machines (SVMs) were proposed by Vapnik and

Chervonenkis in 1963 for binary classification problems with linear
decision boundaries. They were extended later for nonlinear decision
boundaries and for regression problems.

Main idea: separate the cloud of data in two regions, using a carefully
chosen hyperplane.

The SVM classifiers are often described in three steps:

I linear classifiers with hard margin
I linear classifiers with soft margin
I non linear classifiers

Suggested report: Fletcher, Support Vector Machines Explained, UCL,

2008.

Jorge S. Marques, IST, 2017 217/279

Hyperplanes

How do we define a hyperplane in Rp ?

x ·w +b =0

where
I x ∈ Rp - point on the hyperplane
I w ∈ Rp - normal vector to the hyperplane
I b ∈ R - offset
I x · w is the inner product between w , x ∈ Rp

|b|
distance to the origin: kw k

note: parameters w , b are defined up to a scale factor (require some kind

of normalization).

Jorge S. Marques, IST, 2017 218/279

Linear classifiers

A linear classifier compares each input vector with an hyperplane decision

boundary

x · w + b > 0 ⇒ ŷ = +1
x · w + b < 0 ⇒ ŷ = −1

For the sake of simplicity we assume that y ∈ {−1, +1}.

Therefore,
ŷ = sign(x · w + b)

Main question: how do we learn the hyperplane parameters from the

training data?

Jorge S. Marques, IST, 2017 219/279

Case I: linearly separable data
Training set
n o
T = (x (i) , y (i) ), i = 1, . . . , n with x (i) ∈ Rp , y (i) ∈ {−1, 1}

Let us assume that the training data can be separated without errors by
an hyperplane (linearly separable data). In fact if there is one, there is an
infinite number of separating hyperplanes ...

w ·x +b =0

Problem: which hyperplane should

we choose?

Jorge S. Marques, IST, 2017 220/279

Hard margin

Consider a hyperplane that separates the training data without errors and
is equally distant to the nearest examples of both classes.

The training points closest to the

hyperplane are called support
vectors.

The sum of two distances from the

support vectors of each class to the
decision hyperplane is the margin.

The hyperplanes parallel to the decision hyperplane, that contain the

support vectors are known as margin hyperplanes.

Jorge S. Marques, IST, 2017 221/279

Margin hyperplanes

Margin hyperplanes: training data on the margin hyperplanes verify

x (i) · w + b = +1 for y (i) = +1

x (i) · w + b = −1 for y (i) = −1

2
Margin: kw k

Constrains: training data must obey

x (i) · w + b ≥ +1 for y (i) = +1

⇒ y (i) (x (i) · w + b) − 1 ≥ 0, ∀i
x (i) · w + b ≤ −1 for y (i) = −1

Jorge S. Marques, IST, 2017 222/279

Maximum margin classifier

The SVM classifier chooses the hyperplane with the maximum margin
(maximum margin classifier).

Difficulty:

The decision hyperplane can be

computed from the support vectors.

However, initially we do not know the support vectors. Their selection

requires the decision hyperplane.

Question: how can we break this tie?

Jorge S. Marques, IST, 2017 223/279

Exercise
Consider the following data sets.

x1 x2 y
x1 x2 y
0 3 −1
0 9 −1
0 −3 −1
4 1 −1
4 1 −1
4 4 −1
4 −2 −1
0 0 −1
0 0 +1
0 4 +1
0 −2 +1
1 1 +1
1 1 +1

For each of them,

I plot the data,
I find if it is linearly separable,
I find the support vectors and margin hyperplanes
I find the margin

Jorge S. Marques, IST, 2017 224/279

Exercise (cont)
First data set: it is non-separable and cannot be separated by an
hyperplane.

Second data set: it is separable by an hyperplane.

Support vectors:
class -1 : (0, 9), (4, 1)
class +1 : (0, 4)

Margin hyperplanes:
    
0 9 1 w1 −1
1 4 13
 4 1 1   w2  =  −1  w =− , b= .
5 2 5
0 4 1 b +1

2
√
Margin: kw k = 5

Jorge S. Marques, IST, 2017 225/279

Optimization problem (hard margin)

We break the tie by solving an optimization problem.

We wish to maximize the margin (2/kw k) under the constrains described

above. This can leads to the following optimization problems

Optimization problem 1: min kw k , s.t. y (i) (x (i) · w + b) − 1 ≥ 0, ∀i.

Optimization problem 2: min 21 kw k2 , s.t. y (i) (x (i) · w + b) − 1 ≥ 0, ∀i.

This is a quadratic optimization problem with linear constraints.

Jorge S. Marques, IST, 2017 226/279

Lagrangian formulation (primary)

Let us adopt a Lagrangian formulation in order to deal with the

constraints on the training points.

Lagrangian function
n
1 X
LP = kw k2 − αi [y (i) (x (i) · w + b) − 1] ,
2
i=1
n n
1 X X
LP = kw k2 − αi y (i) (x (i) · w + b) + αi ,
2
i=1 i=1

where αi ≥ 0 are Lagrange multipliers.

w , b should be chosen to minimize LP , and αi to maximize it. This is

known as the primary Lagrangian problem.

Jorge S. Marques, IST, 2017 227/279

Lagrangian formulation (dual)

Optimization
n
∂LP X
=0⇒w = αi y (i) x (i) ,
∂w
i=1

The normal vector w is obtained by a linear combination of the training

patterns and only the support vectors contribute.

n
∂LP X
=0⇒ αi y (i) = 0 .
∂b
i=1

Jorge S. Marques, IST, 2017 228/279

Lagrangian formulation (dual)

Replacing these variables, we obtain the dual formulation,

n n n
X 1 X X
LD = αi − αi y (i) (x (i) · x (j) )y (j) αj , s.t. αi y (i) = 0 ,
2
i=1 i,j=1 i=1

where αi ≥ 0 are Lagrange multipliers.

The dual formulation depends only on the inner products between input
vectors x (i) · x (j) . This is very important!

Jorge S. Marques, IST, 2017 229/279

Proof
n
1 X h i
Lp = kw k2 − αi y (i) (x (i) · w + b) − 1
2
i=1
n n n
1 X X X
kw k2 −
Lp = αi y (i) x (i) · w − b αi y (i) + αi
2
i=1 i=1 i=1
Pn Pn
Since w = i=1 αi y (i) x (i) , i=1 αi y (i) = 0
n
X 1
Lp = αi − kw k2
2
i=1

n n
!  n 
X 1 X X
Lp = αi − αi y (i) x (i) ·  αj y (j) x (j) 
2
i=1 i=1 j=1

n n n
X 1 XX
Lp = αi − αi y (i) (x (i) · x (j) )y (j) αj
2
i=1 i=1 j=1

Jorge S. Marques, IST, 2017 230/279

Dual Lagrangian problem
n n
X 1 X
max αi − αT Hα , s.t. αi ≥ 0 ∀ i , αi y (i) = 0 ,
α 2
i=1 i=1
T (i) (i) (j) (j)
where α = [α1 . . . αn ] and Hij = y (x · x )y .

This is a convex quadratic programming (QP) problem that can be

solved by standard QP algorithms and provides all αi .

From α’s we may obtain:

1 support vectors (S): all x (s) such that αs > 0,

w = s∈S αs y (s) x (s)

P
2 normal vector:

b = N1s s∈S y (s) − m∈S αm y (m) (x (m) · x (s) )

P P
3 offset:

4 Classification of data: f (x) = sign(x · w + b)

S is the set of support vector indices.

Jorge S. Marques, IST, 2017 231/279

Comments

We note that only support vectors contribute to the estimation of w , b.

Matrix H does not require the training patterns themselves, x (i) , but only
inner products of training vectors x (i) · x (j) .

The SVM algorithm provides not only a decision but also a score
f (x) = x · w + b.

Jorge S. Marques, IST, 2017 232/279

Exercises

1. Prove that if we know αi , i = 1, . . . , n, we can obtain the offset

from the support vectors by
" #
1 X (s) X (m) (m) (s)
b= y − αm y (x ·x )
Ns
s∈S m∈S

2. Which formulation (primary or dual) has more parameters to

optimize?

Jorge S. Marques, IST, 2017 233/279

Case II: data that cannot be separated by an hyperplane

SVMs can be extended to deal with data that is not linearly separable. In
this case it is not possible to classify all the training vectors without
errors, using an hyperplane.

Idea: allow data points on the wrong

side of the margin hyperplane,
provided that they suffer a penalty.
This is known as soft margin.

Jorge S. Marques, IST, 2017 234/279

Soft margin

The idea is to assign a slack (folga) variable ξi to each data point x (i)
defined in such way that ξi = 0 if no margin violation occurs and ξi > 0
if the ith point is on the wrong side of the margin hyperplane.
Pn
Soft margin penalty: C i=1 ξi

All the training examples in the

wrong side of the margin are
considered as support vectors since
they influence the decision boundary.

The constraints can be written as

follows

x (i) · w + b ≥ +1 − ξi for y (i) = +1

⇒ y (i) (x (i) ·w +b)−1+ξi ≥ 0, ∀i
x (i) · w + b ≤ −1 + ξi for y (i) = −1

with ξi ≥ 0.

Jorge S. Marques, IST, 2017 235/279

Optimization problem (soft margin)

Optimization problem:
n
1 X
min kw k2 + C ξi s.t. y (i) (x (i) · w + b) − 1 + ξi ≥ 0, ∀i.
2
i=1

Lagrangian function
n n n
1 X X X
LP = kw k2 + C ξi − αi [y (i) (x (i) · w + b) − 1 + ξi ] − µi ξi
2
i=1 i=1 i=1

where αi , µi ≥ 0 are Lagrange multipliers.

w , b and ξi should be chosen to minimize LP , and αi , µi to maximize it.

Jorge S. Marques, IST, 2017 236/279

Dual Lagrangian problem

n n
X 1 X
max αi − αT Hα , s.t. 0 ≤ αi ≤ C ∀ i , αi y (i) = 0
α 2
i=1 i=1
T (i) (i) (i) (i)
where α = [α1 . . . αn ] and Hij = y (x · x )y .

This is a convex quadratic programming (QP) problem that can be

solved by standard QP algorithms and provides the all αi .

The classifier parameters, w , b, are obtained the same way as before.

Jorge S. Marques, IST, 2017 237/279

Example - linearly separable data
This example shows separable data set classified with hard margin (left)
and soft margin (center, right).

hard margin soft margin (C=10) soft margin (C=0.1)

support vectors are identified with a circle.

soft margin classifier with large C is equal to the hard margin classifier.
Jorge S. Marques, IST, 2017 238/279
Example - data not linearly separable

This example shows data not linearly separable, classified with soft
margin.

soft margin (C=10) soft margin (C=0.1)

The choice of C controls the margin width.

Jorge S. Marques, IST, 2017 239/279

Hinge loss

The slack variables can be obtained by using the hinge loss

ξi = max 0, 1 − y (i) (x (i) · w + b)

Therefore, the linear SVM with soft margin minimizes

n
X
max 0, 1 − y (i) (x (i) · w + b) + λkw k2
i=1

with λ = 1/(2C ).

Jorge S. Marques, IST, 2017 240/279

Case III: non linear SVM

Linear SVMs classify data using an hyperplane trained with hard margin
or soft margin. This is too restrictive, especially when the dimension of
input space is low.

Jorge S. Marques, IST, 2017 241/279

Transformed space

Many problems require a decision boundary with curvature which cannot

be synthesized by a linear SVM.

Linear classifiers work better in higher dimensional spaces. Therefore, one

strategy consists of mapping the data from the original input space into a
high dimension space (feature space)

x̃ = φ(x)

where the data can be separated by an hyperplane. φ is a nonlinear map.

Questions: can SVMs be extended to these high dimension feature

spaces? or will they become unfeasible?

Jorge S. Marques, IST, 2017 242/279

The kernel trick

The linear SVM algorithm (dual formulation) does not require the input
vectors x (i) but only inner products between them x (i) · x (j) .

This means that we do not need to compute the feature vectors

(transformed input vectors) x̃ (i) = φ(x (i) ) but only their inner products
φ(x (i) ) · φ(x (j) )

The good new is that we can compute these inner products using a
kernel function
k(x (i) , x (j) ) = φ(x (i) ) · φ(x (j) ),
that can be computed in low dimension input space.

The non linear SVM can be trained and tested using low dimension data,
by replacing the inner products by the kernel.

Jorge S. Marques, IST, 2017 243/279

Typical kernels

The most common choices are:

linear: k(x (i) , x (j) ) = x (i)T x (j)

(i)
1
−x (j) k2
rbf: k(x (i) , x (j) ) = e − 2σ2 kx

polynomial: k(x (i) , x (j) ) = (x (i)T x (j) + a)b

The linear kernel is the one adopted in linear SVM.

We note that some kernels depend on hyperparameters that have to be

specified or learned during the training phase e.g., typically by ad hoc
procedures or by cross validation.

Jorge S. Marques, IST, 2017 244/279

Examples - SVM
Two examples (training data) solved by Matlab (function svmtrain) with
rbf kernel. Support vectors are identified with a circle.
3
2.5

2
2
1.5

1
1

0.5

0
0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3

3
2.5
0
0 1
1 Support Vectors
2
Support Vectors
2
1.5

1
1

0.5

0
0

-0.5

-1
-1

-1.5
-2

-2

-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3

Jorge S. Marques, IST, 2017 245/279

Extension

How can we solve muti-class classification problems with SVMs?

Jorge S. Marques, IST, 2017 246/279

Machine Learning

Linear Regression

Regularization

Optimization

Evaluation & Generalization

Neural networks

Data classification

Linear classifiers

Support vector machines

Decision Trees and Random Forest

Jorge S. Marques, IST, 2017 247/279

Decision trees

Decision trees are popular classifiers since they allow to understand why
the input pattern is classified in a specific class.

This is important in several applications (e.g., medical diagnosis).

Decision trees are often used when the features are categorical, although
they have been extended to numerical features as well..

Next slides adress:

I how is categorical data classified by a tree?
I how is the tree trained?
I how can trees be extended to numerical features?

Jorge S. Marques, IST, 2017 248/279

Example: Good days to play tennis

Suppose we wish to predict what are the good days to play tennis and we
have the following dataset associated to a player (John).

Day Outlook Humidity Wind Play

1 Sunny High Weak No
2 Sunny High Strong No
3 Overcast High Weak Yes
4 Rain High Weak Yes
5 Rain Normal Weak Yes
6 Rain Normal Strong No
7 Overcast Normal Strong Yes
8 Sunny High Weak No
9 Sunny Normal Weak Yes
10 Rain Normal Weak Yes
11 Sunny Normal Strong Yes
12 Overcast High Strong Yes
13 Overcast Normal Weak Yes
14 Rain High Strong No
adapted from Quinlan, 1986

Days with the same attributes may have different outcomes (noisy labels).
Jorge S. Marques, IST, 2017 249/279
A decision tree
A decision tree that solves the problem is

adapted from Quinlan, 1986

This tree contains three types of nodes:

I one root node;
I splitting nodes;
I leave nodes (associated to labels).
Each splitting node is associated with a question.
Jorge S. Marques, IST, 2017 250/279
What training data is associated to each node?

Each node has a subset of training examples associated to it.

node 3
Day Outlook Humidity Wind Play
3 Overcast High Weak Yes
7 Overcast Normal Strong Yes pure subset!
12 Overcast High Strong Yes
13 Overcast Normal Weak Yes
node 2 node 4
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
1 Sunny High Weak No 4 Rain High Weak Yes
2 Sunny High Strong No 5 Rain Normal Weak Yes
8 Sunny High Weak No 6 Rain Normal Strong No
9 Sunny Normal Weak Yes 10 Rain Normal Weak Yes
11 Sunny Normal Strong Yes 14 Rain High Strong No

If the training examples associated to a node have the same label, the
node is called pure. Pure nodes are not split anymore and receive a label.
Jorge S. Marques, IST, 2017 251/279
What training data is associated to each node? (2)

node 5 node 7
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
9 Sunny Normal Weak Yes 4 Rain High Weak Yes
11 Sunny Normal Strong Yes 5 Rain Normal Weak Yes
10 Rain Normal Weak Yes
node 6
Day Outlook Humidity Wind Play node 8
1 Sunny High Weak No Day Outlook Humidity Wind Play
2 Sunny High Strong No 6 Rain Normal Strong No
8 Sunny High Weak No 14 Rain High Strong No
both subsets are pure! both subsets are pure!

Jorge S. Marques, IST, 2017 252/279

Classification of new data

Classification of new data is done in the same way. Given a feature vector
x, we travel along the tree based on the feature values until we reach a
leaf (with a label).

There will be classification errors not only in the test set but also in the
training set, if there are examples with the same attributes and different
labels. This is known as noisy labels

Jorge S. Marques, IST, 2017 253/279

Posterior distribution of classes at each node

Consider a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) . A decision
tree associates each training pattern x (i) to a node m through a sequence
of questions.

We can estimate the a posteriori distribution of the labels associated to

each tree node m, using the training data
1 X
P(k|m) = I (y (i) = k)
#Tm (i)
x ∈Tm

where Tm is the set of training patterns associated to node m and I (.) is

the indicator function (the indicator function is 1 if the argument is true
and 0 otherwise).

If node m is a leaf, the most probable label is

k̂(m) = arg max P(k|m)

Jorge S. Marques, IST, 2017 254/279

Exercise

Consider the tennis data set and the decision tree shown above. Find the
probability of each label (Yes or No), at each node.

Jorge S. Marques, IST, 2017 255/279

Exercise (cont.)

m node #Yes #No P(Yes|m) k̂(m)

1 9 5 9/14 Yes
2 2 3 2/5 No
3 4 0 1 Yes
4 3 2 3/5 Yes
5 0 3 0 No
6 2 0 1 Yes
7 0 2 0 No
8 3 0 1 Yes

Jorge S. Marques, IST, 2017 256/279

Node impurity
Ideally, each leaf m should be pure i.e., all the training examples arriving
at node m should have the same label (class). Since this is not always
true we need a measure of impurity.

Jorge S. Marques, IST, 2017 257/279

Node impurity
Ideally, each leaf m should be pure i.e., all the training examples arriving
at node m should have the same label (class). Since this is not always
true we need a measure of impurity.

Several impurity measures have been proposed. They all achieve a

minimum if all the data associated to a leaf comes from a single class.

I Misclassification error:

i(m) = 1 − max P(k|m) = 1 − P(k̂(m)|m)

I Entropy:
K
X
i(m) = − P(k|m) log2 P(k|m)
k=1
I Gini index:
K
X
i(m) = − P(k|m) (1 − P(k|m))
k=1

Jorge S. Marques, IST, 2017 257/279

Node impurity (binary case)
In binary classification problems (2 labels), the impurity depends on a
single statistic P(k = 1|m). Figure shows the misclassification error
(red), the Gini index (green), and the entropy (blue) as a function of
P(k = 1|m). The vertical scale was modified.
0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 0.2 0.4 0.6 0.8 1

The entropy and Gini index are smoother and they are usually preferred
in model training.

Jorge S. Marques, IST, 2017 258/279

Tree training

Given a data set, we wish to learn a tree T . Each splitting node

corresponds to a question and each leaf corresponds to a label.

Training a tree amounts to minimizing the tree impurity

X
I (T ) = P(m)i(m)
m∈T̃

where P(m) is the fraction of training patterns associated to leaf m and

T̃ is the set of all the leaf nodes.

The tree impurity is an average impurity of the leaf nodes.

Training a tree amounts to finding a tree that minimizes I (T ). We

should generate all the tree configurations, compute the tree impurity for
each configuration, and choose the one with smallest impurity.

Jorge S. Marques, IST, 2017 259/279

Drawbacks

This approach has two drawbacks:

1. The optimal solution cannot be found. Exhaustive search of all tree

configurations is not feasible and greedy approaches are used instead.

2. The criterion I (T ) optimizes the performance on the training set but

this is highly optimistic, leading to overfitting.

Jorge S. Marques, IST, 2017 260/279

Tree growing

To overcome the first difficulty, we start with a single node m (root) and
choose the best attribute for splitting the node. This is done as follows.

For each attribute Xj , we split m and create children nodes s ∈ S, each

of them associated to a different value of the attribute. We compute the
impurity of each son and the impurity drop with respect to the impurity
of node m
X p(s)
∆I = i(m) − i(s)
p(m)
s∈S

The attribute that achieves the greatest drop is selected.

The splitting process is repeated for another leaf node until a stop
condition is met. For example, until all the leafs are pure or all the
attributes have been tested.

Jorge S. Marques, IST, 2017 261/279

ID3 algorithm

The ID3 algorithm is a basic tree learning method for categorical data.

It has the following features:

I impurity criterion: entropy;
I stop criterion: stop when each leaf is pure or, if not, all the attributes
have been tested along the path from the root to the impure leaf.

When the data is noisy (noisy label or noisy attributes) the ID3 algorithm
may overfit the training data leading to poor performance in independent
data sets (test sets).

This drawback can be alleviated by early stop or post-processing (tree

pruning).

Jorge S. Marques, IST, 2017 262/279

Exercises

1. Apply the ID3 algorithm to the tenis data set.

2. Consider the following data set (vertebrates). Check which of the

attributes is chosen by the ID3 algorithm for the root node (don’t
consider the name and the skin cover).

# name body skin gives aquatic aerial has hiber- class

temperature cover birth creature creature legs nates label
1 human warm-blooded hair Yes No No yes no mammal
2 python cold-blooded scales No No No no yes non-mam
3 salmon cold-blooded scales No Yes No no no non-mam
4 whale warm-blooded hair Yes Yes No no no mammal
5 frog cold-blooded none No Semi No yes yes non-mam
6 Komodo cold-blooded scales No No No yes no non-mam
7 dragon
8 bat warm-blooded hair Yes No yes yes yes mammal
9 pigeon warm-blooded feathers No No yes yes no non-mam
10 cat warm-blooded fur Yes No No yes no mammal
11 leopard warm-blooded fur Yes No No yes no mammal
12 turtle cold-blooded scales No semi No yes no non-mam
13 penguin warm-blooded feathers No semi No yes no non-mam
14 porcupine warm-blooded quills Yes No No yes yes mammal
15 heel cold-blooded scales No Yes No yes no non-mam
16 salamander cold-blooded none No semi No no yes non-mam

adapted from Kumar, Introduction to Data Mining, 2014.

Jorge S. Marques, IST, 2017 263/279

Solution of first exercise
Test root

O H W P Outlook N Y
R 2 3
S H W N
S 3 2
S H S N
O 0 4
O H W Y
R H W Y i(O) = 5
× 0.97 + 5
× 0.97 + 4
× 0 = 0.69
14 14 14
R N W Y
R N S N Humidity N Y
O N S Y H 4 3
S H W N N 1 6
S N W Y
R N W Y 7 7
i(H) = 14
× 0.98 + 14
× 0.59 = 0.78
S N S Y
O H S Y Wind N Y
O N W Y S 3 3
R H S N W 2 6
6 8
i(W ) = 14
×1+ 14
× 0.81 = 0.89

Best choice for the root is Outlook and Outlook=Overcast is a pure node with
label Yes.
Jorge S. Marques, IST, 2017 264/279
Solution (cont.)
Node: Outlook=Rain
H W P
H W Y
Node: Outlook=Sunny
H W P N W Y
N S N
H W N N W Y
H S N H S N
H W N
N W Y Humidity N Y
N S Y H 1 1
N 1 2
Humidity N Y
H 3 0 i(H) = 2
×1+ 3
× 0.91 = 0.95
5 5
N 0 2
Wind N Y
Two pure nodes: i(H) = 0
S 2 0
W 0 3

Two pure nodes: i(W ) = 0

Best choice for the node Outlook=Sunny is Humidity and leads to two pure
nodes. Best choice for Outlook=Rain is Wind and leads to two pure nodes.
Jorge S. Marques, IST, 2017 265/279
Solution (cont.)

The decision tree we have obtained:

Jorge S. Marques, IST, 2017 266/279

When to stop? early stop

I The impurity of the tree drops or remains constant every time a

node is split. In the limit we can grow the tree until each leaf is pure
or all the attributes along that path have been used. This approach
leads to overfitting.

I A second approach consists of using a validation technique. The tree

is grown using a subset of training data (70%) and evaluated using
the remaining patterns (30%) (validation set).

I Another strategy consists of growing the tree while the impurity

drop is above a threshold ∆I > β.

I Another approach is based on a regularization criterion

I (T ) + αÑ

where Ñ is the number of leaf nodes.

Jorge S. Marques, IST, 2017 267/279
Example - exclusive OR

Consider a toy problem:

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0

I would you be able to predict y with a tree classifier?

I compute the impurity drop for the splitting of the root node, using
the entropy criterion. What do you conclude?

Jorge S. Marques, IST, 2017 268/279

Drawbacks

I Early stop is not a good strategy to train the model because it

suffers from lack of sufficient look ahead.

I It is better growing the tree until the leaves are pure or all attributes
have been used and then prune the tree.

I Instability: a small change in the training patterns may lead to big

changes of the decision boundaries.

Jorge S. Marques, IST, 2017 269/279

Tree pruning

Tree pruning is a tree simplification, aiming to improve the performance

of the tree. The performance is usually measured as number of
classification errors in the validation set.

This usually involves three steps:

I trying several simplifications
I evaluate each of them
I choose the best

The process is repeated until no further improvements can be achieved.

Jorge S. Marques, IST, 2017 270/279

Subtree replacement

Test all the splitting nodes in a bottom-up way. For each tested node,
remove its descendants (subtree) and replace the splitting node by a leaf.
The change is accepted if the modified tree has a better or equal
performance (number of errors in the validation set).
Jorge S. Marques, IST, 2017 271/279
Error estimation

When there is no validation set (too few training examples) the error in
the validation set is predicted by using a pessimistic estimator given by
q
z2 2 z2
f + 2N + z Nf − fN + 4N 2
e= z2
,
1+ N
where
I z - parameter that depends on the confidence degree c (if c = 25%,
z = 0.69)
I f - percentage of error in the training set
I N - number of training examples in the leaf

Jorge S. Marques, IST, 2017 272/279

Can trees be applied with numerical features?

Three classes (colors)

Jorge S. Marques, IST, 2017 273/279

Yes! using thresholds

Threshold values have to be estimated during the tree growing process,

usually by exhaustive search. All threshold values are considered for each
feature and the best impurity drop is selected.

Jorge S. Marques, IST, 2017 274/279

Bootstrap aggregation (bagging) for regression

Bootstrap aggregation, also called bagging, is an ensemble method that

can be used to improve the performance of regressors and classifiers.

Consider a regression problem with a training set of n independent and

identically distributed (iid) patterns, drawn from a distribution P
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) )

Using previous methods we can learn a function f (x) under uncertainty.

One way to improve the estimate f (x) would be to consider multiple

training sets.

Jorge S. Marques, IST, 2017 275/279

Multiple training sets

Consider B training sets generated from the same (ideal) distribution P

(1,1) (1,1)
T (1) = (x ,y ) , . . . , (x (1,n) , y (1,n) )
.. .. ..
. . .
T (B) = (x (B,1) , y (B,1) ) , . . . , (x (B,n) , y (B,n) )

This allows us to estimate B regression functions f (1) (x), . . . , f (B) (x),

each from a different training set. These functions can be combined
(aggregated) by averaging
B
1 X (i)
fˆ(x) = f (x)
B
i=1

to reduce the uncertainty.

Jorge S. Marques, IST, 2017 276/279

Bootstrap

There is only one difficulty. We do not know the ideal distribution P. All
we know is the first data set T and the empirical distribution computed
from it.

The trick consists of generating the multiple data sets T (i) using the
Bootstrap method i.e., by sampling the set T , n times, (with
replacement).

Of course, we cannot claim as before that the T (i) are statistically

independent, but the technique still improves the estimation of f .

Jorge S. Marques, IST, 2017 277/279

Bagging in classification problems
Let us consider a classification problem with a training set
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) )

where the outcomes y (i) belong to a finite set of labels {0, . . . , K − 1}.

Given a new input pattern x, the trained classifier produces a vector of

probability estimates [P(y = 0|x), . . . , P(y = K − 1|x)] subject to
estimation errors.

The Bagging algorithm is as follows

1. generate B training sets T (i) from T , by bootstrap i.e., by sampling
T with replacement.
(i)
(i)from each set T (i)and compute the
2. train a classifier a posteriori
distributions P (y = 0|x), . . . , P (y = K − 1|x) .
3. aggregate all the estimates
B
1 X (i)
P̂(y = k|x) = P (y = k|x)
B
i=1
Jorge S. Marques, IST, 2017 278/279
Random forest
Random forest is a very simple and yet a very powerful classifier.
Achieves state-of-the-art results in many problems.

The algorithm is based on a ensemble of tree classifiers trained with

bagging.

The Random forest algorithm

1. generate B training sets T (i) from T by bootstrap i.e., by sampling
T with replacement.
2. train a tree classifier from each set T (i) with the following issue.
Randomly select a subset of features at each node and only those
features are candidates for splitting features. This procedure is
known as random subspace. The percentage of feature candidates at
each node is a parameter of the algorithm.
3. compute the a posteriori distributions
P (i) (y = 0|x), . . . , P (i) (y = K − 1|x) for each tree.

Jorge S. Marques, IST, 2017 279/279

Random forest
Random forest is a very simple and yet a very powerful classifier.
Achieves state-of-the-art results in many problems.

The algorithm is based on a ensemble of tree classifiers trained with

bagging.

The Random forest algorithm

Control Systems For Complete Idiots (DAVID SMITH) (Z-Library)
No ratings yet
Control Systems For Complete Idiots (DAVID SMITH) (Z-Library)
118 pages
Worley Parsons Design Guide Pumps and Pump Circuits
100% (2)
Worley Parsons Design Guide Pumps and Pump Circuits
37 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Introduction To Control Engineering: Andy Pomfret and Tim Clarke
100% (1)
Introduction To Control Engineering: Andy Pomfret and Tim Clarke
54 pages
EE5103/ME5403: Computer Control Systems: Associate Professor
No ratings yet
EE5103/ME5403: Computer Control Systems: Associate Professor
64 pages
Stability
No ratings yet
Stability
16 pages
Air Pilot Electronic Unit
100% (2)
Air Pilot Electronic Unit
10 pages
Tug Asmik Ro Biolog Ike Lomp Ok
100% (1)
Tug Asmik Ro Biolog Ike Lomp Ok
9 pages
Nonlinear and Adaptive Control: Zhengtao Ding
No ratings yet
Nonlinear and Adaptive Control: Zhengtao Ding
116 pages
Stability: EE-601 Linear System Theory
No ratings yet
Stability: EE-601 Linear System Theory
26 pages
EE5103 Part1 Lecture 1
No ratings yet
EE5103 Part1 Lecture 1
66 pages
Linear Matrix Inequalities in System and Control Theory - Stephen Boyd
100% (1)
Linear Matrix Inequalities in System and Control Theory - Stephen Boyd
205 pages
Lab 4 QUBE-Servo Bump Test Modeling Workbook (Student)
No ratings yet
Lab 4 QUBE-Servo Bump Test Modeling Workbook (Student)
6 pages
Material: Glad & Ljung Ch. 12.2 Khalil Ch. 4.1-4.3 Lecture Notes
No ratings yet
Material: Glad & Ljung Ch. 12.2 Khalil Ch. 4.1-4.3 Lecture Notes
42 pages
Design of A Linear State Feedback Controller
No ratings yet
Design of A Linear State Feedback Controller
27 pages
Control Theory - Robust Systems, Theory and Applications
No ratings yet
Control Theory - Robust Systems, Theory and Applications
258 pages
Dynamics of Linear Systems: September 27, 2017
No ratings yet
Dynamics of Linear Systems: September 27, 2017
101 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
Artificial Neural Networks - Introduction
No ratings yet
Artificial Neural Networks - Introduction
31 pages
Linear Systems Raymond A DeCarlo Cap 4 - A - 6
No ratings yet
Linear Systems Raymond A DeCarlo Cap 4 - A - 6
55 pages
Shubhobrata Rudra Presentation On Backstepping Control1
No ratings yet
Shubhobrata Rudra Presentation On Backstepping Control1
139 pages
Notes On Linearisation (H.K.Khalil)
No ratings yet
Notes On Linearisation (H.K.Khalil)
11 pages
Control Principles For Engineered Systems 5SMC0: State Reconstruction & Observer Design
No ratings yet
Control Principles For Engineered Systems 5SMC0: State Reconstruction & Observer Design
19 pages
Review - 3 - Load Forecasting PDF
No ratings yet
Review - 3 - Load Forecasting PDF
25 pages
Lec Note - Robust Control - 1501
No ratings yet
Lec Note - Robust Control - 1501
83 pages
Advanced Control Theory
No ratings yet
Advanced Control Theory
497 pages
Nonlinear Tracking Differentiator For Velocity Estimation From Shaft Encoder
No ratings yet
Nonlinear Tracking Differentiator For Velocity Estimation From Shaft Encoder
12 pages
Balanced Truncation
No ratings yet
Balanced Truncation
15 pages
Final Phase 1 PPT Major Project
No ratings yet
Final Phase 1 PPT Major Project
21 pages
Non-Linear Control For Underactuated Mechanical Systems (Isabelle Fantoni y Rogelio Lozano)
No ratings yet
Non-Linear Control For Underactuated Mechanical Systems (Isabelle Fantoni y Rogelio Lozano)
307 pages
Chapter 01 Introduction To Neural Networks
No ratings yet
Chapter 01 Introduction To Neural Networks
9 pages
Data-Driven Aerospace Engineering With ML
No ratings yet
Data-Driven Aerospace Engineering With ML
28 pages
Full State Feedback Control
No ratings yet
Full State Feedback Control
8 pages
ML QP
No ratings yet
ML QP
6 pages
EE5103/ME5403 Lecture Three Analysis of Discrete-Time Systems
No ratings yet
EE5103/ME5403 Lecture Three Analysis of Discrete-Time Systems
49 pages
The Kalman Filter: State-Space Derivation For Mass-Spring-Damper System
No ratings yet
The Kalman Filter: State-Space Derivation For Mass-Spring-Damper System
10 pages
LF Control For Two Area Power System: Submitted By: Project Guide
No ratings yet
LF Control For Two Area Power System: Submitted By: Project Guide
13 pages
15hc11 Optimization Techniques in Engineering
No ratings yet
15hc11 Optimization Techniques in Engineering
1 page
Model Predictive Control Notes
100% (3)
Model Predictive Control Notes
135 pages
Colah Github Io Posts 2015 08 Understanding LSTMs
No ratings yet
Colah Github Io Posts 2015 08 Understanding LSTMs
16 pages
Razumikhin and Krasovskii Stability Theorems For Time-Varying Time-Delay Systems
No ratings yet
Razumikhin and Krasovskii Stability Theorems For Time-Varying Time-Delay Systems
11 pages
A Mathematical Approach From Classical Control To Advanced Control
No ratings yet
A Mathematical Approach From Classical Control To Advanced Control
346 pages
Stability of Nonlinear Systems
No ratings yet
Stability of Nonlinear Systems
18 pages
Adaptive Control Basics and Research: Gang Tao
No ratings yet
Adaptive Control Basics and Research: Gang Tao
85 pages
Adaptive Control Stability Convergence and Robustness (S. Sastry and M. Bodson.)
No ratings yet
Adaptive Control Stability Convergence and Robustness (S. Sastry and M. Bodson.)
196 pages
EE5103/ME5403 Lecture Four Pole-Placement Problem State-Space Approach
No ratings yet
EE5103/ME5403 Lecture Four Pole-Placement Problem State-Space Approach
46 pages
Lyapunov Stability
No ratings yet
Lyapunov Stability
26 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
105 pages
Linear System Theory: 2.1 Discrete-Time Signals
No ratings yet
Linear System Theory: 2.1 Discrete-Time Signals
31 pages
Mech ch3
No ratings yet
Mech ch3
7 pages
ROS Slides
No ratings yet
ROS Slides
35 pages
Adaptive Control: Stability, Convergence, and Robustness
No ratings yet
Adaptive Control: Stability, Convergence, and Robustness
201 pages
Slotine Li - Applied Nonlinear Control 31 53
No ratings yet
Slotine Li - Applied Nonlinear Control 31 53
23 pages
Lecture Notes of Advanced Probability
No ratings yet
Lecture Notes of Advanced Probability
101 pages
Fuzzy
No ratings yet
Fuzzy
343 pages
ET410 Scope of The Capstone Project
No ratings yet
ET410 Scope of The Capstone Project
5 pages
Download Kalman Filter from the Ground Up 1st Edition Alex Becker ebook All Chapters PDF
100% (4)
Download Kalman Filter from the Ground Up 1st Edition Alex Becker ebook All Chapters PDF
81 pages
Engineering Systems Dynamics Modelling Simulation and Design 1634844603
100% (1)
Engineering Systems Dynamics Modelling Simulation and Design 1634844603
267 pages
Dynamic Programming and Optimal Control, Volumes I Solution Selected
No ratings yet
Dynamic Programming and Optimal Control, Volumes I Solution Selected
30 pages
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Solutions Manual to accompany An Introduction to Numerical Methods and Analysis
From Everand
Solutions Manual to accompany An Introduction to Numerical Methods and Analysis
James F. Epperson
5/5 (1)
Computer Vision for the Web
From Everand
Computer Vision for the Web
Akhmadeev Foat
No ratings yet
Abc de Coordinacion de Corriente 1-10 PDF
No ratings yet
Abc de Coordinacion de Corriente 1-10 PDF
10 pages
10 Science Imp ch10 1
No ratings yet
10 Science Imp ch10 1
10 pages
Under The Hood The Loadrunner Compiler
No ratings yet
Under The Hood The Loadrunner Compiler
7 pages
Activation Instructions For Adove
No ratings yet
Activation Instructions For Adove
4 pages
ملزمة الاحصاء د.عبدالخالق
No ratings yet
ملزمة الاحصاء د.عبدالخالق
106 pages
Module 9 - Forecasting
No ratings yet
Module 9 - Forecasting
56 pages
Cutter Option: Installation Instructions
No ratings yet
Cutter Option: Installation Instructions
14 pages
Midsem Solutions
No ratings yet
Midsem Solutions
8 pages
IMO Maths Important Questions Class 1
No ratings yet
IMO Maths Important Questions Class 1
10 pages
F5 Waves
No ratings yet
F5 Waves
29 pages
Productivity+™ Active Editor Pro Probe Software For Machine Tools
No ratings yet
Productivity+™ Active Editor Pro Probe Software For Machine Tools
10 pages
AQR (2014) Capital Market Assumptions For Major Asset Classes
No ratings yet
AQR (2014) Capital Market Assumptions For Major Asset Classes
12 pages
Metal Matrix Composites For Aerospace Application: Advanced Materials
100% (1)
Metal Matrix Composites For Aerospace Application: Advanced Materials
3 pages
DTMF Decoder Using MATLAB
No ratings yet
DTMF Decoder Using MATLAB
6 pages
RSB Parameter Limits
No ratings yet
RSB Parameter Limits
10 pages
Micro-Grid Modeling and Control
No ratings yet
Micro-Grid Modeling and Control
24 pages
COE201 Lab 1
No ratings yet
COE201 Lab 1
48 pages
Sir Nasir TEST 6 CMA Solution
No ratings yet
Sir Nasir TEST 6 CMA Solution
6 pages
Ficha Tecnica YGFC-En 2012
No ratings yet
Ficha Tecnica YGFC-En 2012
27 pages
Lecture 2 Physics and Physical Measurement
No ratings yet
Lecture 2 Physics and Physical Measurement
22 pages
Dosing Pump Guidance Notes
No ratings yet
Dosing Pump Guidance Notes
25 pages
Examen Soa PDF
No ratings yet
Examen Soa PDF
49 pages
Course Outline: Addis Ababa University Department of Computer Science
No ratings yet
Course Outline: Addis Ababa University Department of Computer Science
1 page
Transformers, How Do They Work?: Generative AI To Create Content
No ratings yet
Transformers, How Do They Work?: Generative AI To Create Content
14 pages
Crankcase Component Overhaul
No ratings yet
Crankcase Component Overhaul
13 pages
1.0 - Properties Fluids
No ratings yet
1.0 - Properties Fluids
5 pages
Graph Theory
No ratings yet
Graph Theory
38 pages