Machine Learning Slides
Machine Learning Slides
Jorge S. Marques
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
Examples:
I motion of a rocket → Newton law: mẍ(t) = F (t)
I electromagnetic waves → Maxwell equations
... but other problems are more complex and cannot be tackled with
closed form expressions.
How can we use this information to predict the outcome for a new
patient?
Jorge S. Marques, IST, 2017 4/279
What is Machine Learning?
”the field of study that gives computers the ability to learn without being
explicitly programmed.” (Arthur Samuel, 1959)
The economist
Jorge S. Marques, IST, 2017 6/279
Applications
I prediction
I time series analysis
I speech recognition - conversion of the speech signal into text
I machine translation
I detection of failures
I image denoising
I human activity recognition
I medical image analysis - e.g., cancer detection in images
I robot navigation
I self driving car
Jorge S. Marques, IST, 2017 Alexnet 2012 - Krizhevsky, Sutskever, Hinton 9/279
Visual recognition - AlexNet (2012)
700
60 220 400
87 57 300
180 640 0
50 100 150 200
Given the table with the length and volume of tuna and swordfish (class),
we wish to design a system that predicts the class.
I features
I outcome
I predictor
ŷ = f (x)
600
500
100
0
50 100 150 200
y ∈Ω Ω = {ω0 , . . . , ωK −1 }
This diagram does not consider the choice of features and their
extraction, e.g., if we are dealing with an image or speech analysis
problem, what features do we extract from the signal. This issue is
application dependent and will not be considered.
Jorge S. Marques, IST, 2017 19/279
Main questions
These questions have multiple answers that we will discuss along this
course.
Data sets are important tools to train and evaluate machine learning
systems. They allow to compare different techniques and they often
foster the development of new methods.
There are many sites with data sets. One example is:
https://archive.ics.uci.edu/ml/datasets.html
Wikimedia
4.5 3
2.5
4
3.5
1.5
2.5
0.5
2 0
4 4.5 5 5.5 6 6.5 7 7.5 8 1 2 3 4 5 6 7
feature 1 vs 2 feature 3 vs 4
Please note that the scale is not the same in both axis.
The NN method can be extended to take into account not one but k
nearest neighbors of x.
training data
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
700
60 220 400
87 57 300
180 640 0
50 100 150 200
We wish to predict the price of a flat in Lisbon, taking its area into
account.
ŷ = f (x) = β0 + β1 x
ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → unique solution, but bad (noisy)
ŷ (1) = β0 + β1 x (1)
ŷ (2) = β0 + β1 x (2) → no solution, impossible
ŷ (3) = β0 + β1 x (3)
To solve this problem, we must assume that there is an error between the
output of the model ŷ (i) and the data y (i) .
Model fit is achieved by minimizing the total loss in the training set
n
X
min (y (i) − β0 − β1 x (i) )2 .
β0 ,β1
i=1
The minimum is achieved at a point (β̂0 , β̂1 ) such that the partial
derivatives are zero (gradient vector is the null vector)
" ∂SSE #T
∂β0
∇β SSE = ∂SSE
= 0,
∂β1
This leads to
∂SSE Pn
∂β0 =0 −2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0
⇒
∂SSE Pn (i)
= 0. −2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.
∂β1
Pn
−2 i=1 (y (i) − β̂0 − β̂1 x (i) ) = 0
Pn (i)
−2 i=1 (y − β̂0 − β̂1 x (i) )x (i) = 0.
Pn Pn Pn
i=1 β̂0 + i=1 β̂1 x (i) = i=1 y (i)
Pn Pn Pn
i=1 β̂0 x (i) + i=1 β̂1 x (i) x (i) = i=1 y (i) x (i) .
x 0 ← x − x̄ y 0 ← y − ȳ ,
Minimize
n
X 2
SSE = y 0(i) − β00 − β10 x 0(i)
i=1
n
∂SSE X
= 0 ⇒ −2 y 0(i) − β̂00 − β̂10 x 0(i) = 0
∂β0
i=1
n
X n
X
y 0(n) − nβ̂00 − β̂10 x 0(n) = 0
i=1 i=1
0 − nβ̂00 − 0 = 0
β̂00 = 0
ŷ = β0 + β1 x1 + · · · + βp xp .
Consider a training set T = (x (i) , y (i) ), i = 1, . . . , n , where x (i) ∈ Rp
and y (i) ∈ R, i = 1, . . . , n.
cost function:
(X T X )β̂ = X T y
The proof of the normal equations requires two properties of the gradient.
Useful properties:
inner product: ∇x (b T x) = b , b ∈ Rp ,
quadratic form: ∇x (x T Mx) = (M + M T )x , M ∈ Rp×p .
= y T y − y T X β − βT X T y + βT X T X β distributive prop.
= y T y − 2y T X β + β T X T X β transpose prop..
∇β SSE = −2X T y + 2X T X β = 0,
we conclude
(X T X )β̂ = X T y .
The inverse of matrix X T X may not exist due to two main reasons:
I small amount of data e.g., number of data points smaller than the
number of features.
I redundant features (linearly dependent) e.g., duplicated features.
Model training
normal equations: (X T X )β̂ = X T y
Prediction
new data: f (x0 ) = [1 x0T ]β̂
β̂ = (X T X )−1 X T y ,
Figure shows the least squares fit of a linear model (straight line) to the
flat data.
700
60 220 400
87 57 300
147 430
100
180 640
0
50 100 150 200
f (x) = β0 + β1 x + · · · + βp x p .
This can be considered as linear model whose features are the powers of
x (scalar).
700
400
700
600
500
400
300
200
100
0
50 100 150 200
How should we choose the best order? is the SSE a good criterion?
w = (X T X )−1 X T y ,
where X is given by
X = .. .. ..
,
. . ... .
G1 (x (n) ) G2 (x (n) ) . . . Gp (x (n) )
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
50 100 150 200 50 100 150 200
(centroids are equally spaced instead of being computed from the data)
yk = X βk + wk , k = 1, . . . , K .
The SSE for the multiple outputs is the sum of the SSE for each output.
K
X
SSE = SSEk (βk ).
k=1
(X T X )β̂k = X T yk .
Y = Xβ + W,
leads to
β̂ = (X T X )−1 X T Y .
This is equivalent to independently solving each of the K least squares
problem sharing the same design matrix X .
1. Consider a training set T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) . Write the
normal equations for a least squares fit of a second order polynomial
model (x scalar) to the training data.
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
We removed the mean x̄, ȳ from the data in order to make β0 = 0. The
vector β does not include β0 and X does not include a column of ones.
(X T X )β̂ = X T y .
Example
Suppose we wish to estimate the model from a single example.
x1 x2 y
ŷ = x1 β1 + x2 β2
1 1 3
β1 + β2 = 3.
where k.k denotes the Euclidean norm. The new term kβk2 penalizes the
use of large coefficients and it is denoted a regularization term. This
criterion aims to represent the data, keeping the coefficients small.
The ridge regression can be solved by computing the gradient vector and
making it equal to zero, leading to
β̂ridge = (X T X + λI )−1 X T y .
Cost function
ERidge = ky − X βk2 + λkβk2
= (y − X β)T (y − X β) + λβ T β
= y T y − 2y T X β + β T X T X β + λβ T β
= y T y − 2y T X β + β T (X T X + λI )β.
we conclude
(X T X + λI )β̂ridge = X T y .
(X T X + λI − λridge I )v ridge = 0
Comparing,
λridge = λls + λ
v ridge = v ls
Conclusion:
The eigenvectors are equal and the eigenvalues are shifted by λ. If λ > 0
the eigenvalues of the ridge matrix are positive and the matrix is non
singular.
where the number of non-zero coefficients is often called ”`0 norm”, k.k0 ,
but does not verify the axioms of a norm.
The lasso estimate βlasso is often a sparse vector of coefficients where less
important features receive a zero coefficient.
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.1 -0.1
-0.2 -0.2
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
λ λ
Jorge S. Marques, IST, 2017 71/279
Non-centered data
How
should we proceed if the training data
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) are not centered?
β 0ls = s −1 β ls .
ls
LS predictor: ŷ 0 = x 0T β 0 = sx T s −1 β ls = x T β ls = ŷ is invariant
under scaling.
LS : β̂ ls = (X T X )−1 X 0 y = n1 X 0 y
n
β̂ ridge = n+λ β̂
ls
minβ (y T − 2y T X β + β 2 X T X ) + λ|β|
d(... )
Hypothesis : β̂ > 0 dβ = 0 ⇒ −2nβ̂ ls + 2β̂ lasso + λ = 0
λ
β̂ lasso = β̂ ls − 2n
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
SSE = ky − X βk2 ,
parameters.
where
I J(θ ∗ ) is called a local minimum,
J(θ) ≥ J(θ∗ )
where
I J(θ ∗ ) is called a global minimum,
Notice that if the function is convex (right) it has no more than one
minimum.
where η controls the displacement of the point θ(t) and is known as step
size or learning step.
On the contrary, if η is too large the algorithm may skip a local minima
or produce an update of θ that increases the objective function J(θ).
Sutskever, Martens, Dahl, Hinton, On the importance of initialization and momentum in deep
learning, 2013.
The Newton methods assumes that we know not only the gradient vector
∇θ J(θ) but also the matrix of second derivatives (Hessian matrix).
∂2J ∂2J ∂2J
∂J
∂θ 2 ∂θ 1 ∂θ 2
... ∂θ1 ∂θ d
∂θ1 ∂ 2 J1
∂J ∂2J ∂2J
∂θ2
∂θ2 ∂θ1 ∂θ22
... ∂θ2 ∂θd
∇θ J = . H= .. .. ..
..
. . ... .
∂J ∂ J2 2
∂ J ∂ J2
∂θd ∂θd ∂θ1 . . . ∂θ2 ∂θ2 ∂θ 2
d
Given a guess θ(t) we can approximate the cost function J(θ) by the 2nd
order Taylor expansion
1
J(θ(t) + ∆) = J(θ(t) ) + ∇θ J(θ(t) )T ∆ + ∆T H(θ(t) )∆
2
where ∆ is a small displacement vector.
∇∆ J(θ(t) + ∆) = 0
1.5
8
1
6
0.5
4
0
2
-0.5
0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
2
10 1
1.5
10 0
0.5
-0.5
10 -1
-1
-1.5
-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10
1.5
50
1
40
0.5
30
20 0
10 -0.5
0 -1
2
1 2
1 -1.5
0
0
-1
-1 -2
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
2
10 1
1.5
10 0
0.5
-0.5
10 -1
-1
-1.5
-2 10 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10
1.5
400
1
300
0.5
200 0
-0.5
100
-1
0
2 -1.5
1 2
0 1 -2
0
-1
-1
-2 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
2
10 1
1.5
1
10 0
0.5
10 -1
-0.5
-1
10 -2
-1.5
-2
10 -3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 10 20 30 40 50 60 70 80 90 100
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
Predictor (model):
ŷ = f (x, θ).
The parameters of the model, θ, are estimated from a training set
n o
T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) .
But, learned systems are not perfect. The output of a learned system is
not always the desired output.
700
600
500
400
300
200
100
0
50 100 150 200
Training set (left) and predicted classes (right) using the k nearest
neighbor method.
Regression Classification
0 y = ŷ
L(y , ŷ ) = (y − ŷ )2 L(y , ŷ ) =
1 otherwise
or
L(y = i, ŷ = j) = Lij
diagonal terms (no error) equal to zero.
R = E {L(y , ŷ (x))}
This requires the joint distribution of input and output p(x, y ) which is
usually unknown.
This requires the joint distribution of true and predicted class P(y , ŷ )
which is usually unknown.
Since the risk cannot be computed in most problems, we can replace the
expected value by an average of the loss computed with the training data,
n
1X
Re = L(y (i) , f (x (i) )) .
n
i=1
700
600
500
400
300
200
100
0
50 100 150 200
The empirical risk of the forth order polynomial is the smallest. But is
this the best model?
Important questions:
I is Re , (computed in the training set) a good estimate of R0e
(computed in an independent set)?
I can Re or R0e be used to choose the model hyperparameters (e.g.,
polynomial order)?
Jorge S. Marques, IST, 2017 106/279
Evaluation of polynomial order
2
1.5
0.5
0
1 2 3 4 5
Average loss in the training set (blue) and in an independent set (red), as
a function of polynomial degree (hyperparameter).
Conclusions:
I The evaluation in the training set is too optimistic. An independent
data set is mandatory to obtain a reliable evaluation.
I The use of an independent data set allows the choice of model
hyperparameters (polynomial degree).
Jorge S. Marques, IST, 2017 107/279
Evaluation of k Nearest neighbor
0.25
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Data: k folds Tk .
for k=1, . . . , K do
f = train(T \ Tk );
Pk = perform(f , Tk );
end
P = P¯k
Algorithm 3: Cross validation without hyperparameters. The bar denotes
average for all the folders.
The final score is a combination of the evaluation of K models. The
method does not produce a final classifier/regressor.
Question: if we need one, what should we do?.
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
MIT
Wikipedia
It receives input signals through its dendrites. These signals are combined
in the soma and, from time to time, an electric impulse is generated that
travels through the axon and influences other cells.
The neuron model proposed by McCulloch & Pitts (1942) has a linear
part, followed by a nonlinearity:
s = [1 x T ]w = x̃ T w
Rosenblatt algorithm
1. training set: T = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , with
x (k) ∈ Rp , y (k) ∈ {0, 1};
2. initialization: randomly initialize the weights wi (0), i = 0, . . . , p;
3. new training example: present a new training pattern (x(t), y (t)) to
the model and compute the model output ŷ (t) = g (x̃ T (t)w (t − 1));
4. update: update the weights according to
The training data is the same but the outcome is different in each
experiment (why?)
Pros
It can be proved that the Rosenblatt algorithm solves any binary problem
in a finite number of iterations, provided the training data can be
separated by a hyperplane in feature space.
Cons
It does not provide a hint to deal with data that cannot be separated by
a hyperplane or to deal with regression problems that are not binary.
Most practical problems are noisy and fit into one of these categories.
Therefore, a single unit trained by the Rosenblatt algorithm is seldom
useful in practice.
For the sake of simplicity, the activation function of each unit is not explicitly
represented (but it exists!). offsets are not shown.
Jorge S. Marques, IST, 2017 122/279
Weights
zj = g (sj ) zj = g (sj )
g (.) is the activation function and the weight w0j is called the offset.
The units of the last layer are considered as visible. They are the output
of the network and we denote their output by ŷi
The units of the other layers are considered as hidden since we do not
know their desired values in the training phase. They are intermediate
variables used to compute the network output.
0.6
1 0.4
g (s) =
1 + e −s
0.2
0
-5 0 5
1.5
-0.5
g (s) = arctan s
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Linear unit 2
g (s) = s -2
-4
-4 -2 0 2 4
g (s) = max(0, s) 1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
I g (s) = arctan s
I g (s) = s
I g (s) = max(0, s)
w = {wij }
The network is thus a nonlinear map ŷ = f (x, w ) from the input space to
the output space, controlled by a set of weights w .
where ŷ (k) is the network output for the input x (k) . A typical choice for
the loss function is the quadratic error
L(y , ŷ ) = ky − ŷ k2
The minimization of C is often achieved by using the gradient algorithm
∂C
wij (t + 1) = wij (t) + ∆wij (t) , ∆wij (t) = −η
∂wij w (t)
The gradient vector includes the contribution of all the training patterns.
The weight update using all the training patterns in each iteration is
called the batch mode.
∂C X ∂L(y (k) , ŷ (k) ) X ∂L(k)
∆wij = −η = −η = −η .
∂wij ∂wij ∂wij
k k
∂L(k)
∆wij = −η .
∂wij
To train the weights wij we need the gradient of the loss function L.
dz dz dy dz ∂z dw ∂z dv
= = +
dx dy dx dx ∂w dx ∂v dx
Forward network:
Pp
s = w0 + i=1 wi xi ,
ŷ = z = g (s)
Gradient:
∂L dL ∂s
∂wp = ds ∂wp = xp
Using the chain rule, the derivative of L with respect to a weight wpq can
be computed as
∂L ∂L ∂sq
= = z p q ,
∂wpq ∂sq ∂wpq
Therefore,
∂L ∂L
= z p q , q = q ∈ layer higher than 1
∂wpq ∂sq
zj = g (sj )
where
∂L ∂L ∂zq ∂L
q = = = g 0 (sq )
∂sq ∂zq ∂sq ∂zq
where
∂L X ∂L ∂sj ∂zq X
q = = = g 0 (sq ) wqj j
∂sq ∂sj ∂zq ∂sq
j∈next layer j∈next layer
I momentum term;
I adaptive weights.
These techniques modify the weight update rule and were discussed
before in the optimization lesson. Next we summarize the steps involved
in the gradient algorithm with momentum term (bach and on-line).
Set t = 1 and ∆wij (0) = 0. Repeat step 1 until stopping criteria is met
1. For k = 1, . . . , n, perform steps 1.1 through 1.6
1.1 propagate forward: apply the training pattern x k to the perceptron
and compute the variables zi and outputs ŷ (k)
k
1.2 compute the cost derivatives: ∂L(k)
∂ ŷj
∂Lk
1.3 propagate backwards: apply (k) to the inputs of backpropagation
∂ ŷj
network and compute its internal variables j
∂Lk
1.4 compute the gradient components: compute the variables ∂wij
= zi j
1.5 Apply momentum: set ∆wij (t) = −ηzi j + α∆wij (t − 1)
1.6 Update the weights: set wij (t + 1) = wij (t) + ∆wij (t)
I Consider a MLP with one layer and linear units. Prove that the
input output map performed by the perceptron is a linear (affine)
transformation
f (x) = Ax + b
I prove this statement for the case of a MLP with two layers and
linear units.
Imagine that you want to distinguish images of horses and cats. How
would you proceed?
Finding a set of rules on low level image features (e.g., color, corners)
seems to be unfeasible!
History
Since then, all the winners of the ImageNet challenge are convolutional
neural networks
Jorge S. Marques, IST, 2017 153/279
End-to-end architecture
Alexnet learns to directly compute the image label (class) for the input
image. Of course this requires a large training set (more than 1 million
images).
The classic blocks (feature extraction and classification) are both learned
from the training data without the use of handcrafted features.
Each kernel has a localized support in the first two (spatial) coordinates
and it is full range in the third (depth) coordinate.
`−1
3D input: zijk ` − 1 - number of input layer
`
3D kernel: hijk
2D output:
XXX
`−1
sij` = `
hpqr zi+p,j+q,0+r
p q r
zij` = g (sij` )
`−1
3D input: zijk ` - number of input layer
3D output:
n o
` `−1
zijk = max z∆i+p,∆j+q,k
p,q∈{0,...,∆−1}
The Alexnet has 60 million weights. Almost all of them associated to the
last three layers (fully connected layers).
I deeper network
I deeper layers
I kernels: smaller spatial dimensions (3 × 3)
I inception module
I 1x1 convolution
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
fi : Rp → R , i = 0, . . . , K − 1
fi (x) ≥ fj (x) , ∀j 6= i.
Rj = x ∈ Rd : f (x) = ωj .
Pij = Pr {y = i, ŷ = j}
Properties:
Pij ∈ [0, 1], ∀i, j
PK −1 PK −1
i=0 j=0 Pij = 1.
Proof
Confusion matrix
P00 = Pr (y = 0, ŷ = 0) = Pr (ŷ = 0|y = 0)Pr (y = 0) =
R RT
= P0 R0 p(x|y = 0)dx = P0 0 1 dx = P0 T ,
P01 = P0 − P00 = P0 (1 − T ),
P10 = Pr (y = 1, ŷ = 0) = Pr (ŷ = 0|y = 1)Pr (y = 1) =
RT
= P1 R0 p(x|y = 1)dx = P1 0 2x dx = P1 T 2 ,
R
P11 = P1 − P10 = P1 (1 − T 2 ).
Confusion matrix
P0 T P0 (1 − T )
P=
P1 T 2 P1 (1 − T 2 )
Probability of error
Examples:
0 if ŷ = y
binary loss: L(y , ŷ ) = .
1 otherwise
R = E {L(y , ŷ (x))} .
This is known as the Bayes classifier and chooses the class with greatest
a posteriori probability (the most probable class, given the observations).
The Bayes classifier with binary loss is optimal in the sense that it
minimizes the probability of decision error.
and choose the class with smallest cost, i.e., the feature vector x should
be classified as follows
This is an optimal classifier in the sense that minimizes the risk for a
general loss function and it is also known as Bayes classifier.
p(x|y = i)P(y = i)
P(y = i|x) = ,
p(x)
where
I p(x|y = i) - distribution of the feature vector x associated to class i;
I P(y = i) - a priori distribution of the classes (before knowing the
observations);
I p(x) - normalization term that does not influence the decision,
X
p(x) = p(x|y )P(y ).
y ∈Ω
0.9
0.35 0.35
0.8
0.3 0.3
0.7
0.25 0.25
0.6
0.4
0.15 0.15
0.3
0.1 0.1
0.2
0.05 0.05
0.1
0 0 0
-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8
In many
practical problems, all we know is a training set
T = (x (i) , y (i) ), i = 1, . . . , n with n realizations of the pair X , Y .
In the digit recognition problem this means that we have to estimate the
conditional distribution of each pixel which is a simple task.
1. Draw a pair of scatter plots for two features (x1 , x2 ), assuming that
they are
I dependent;
I independent.
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
x ∈ Rp : [1 x T ](βi − βj ) = 0 ,
Some classifiers represent the class label by numbers and use regression
methods to predict those numbers.
One idea: ω0 → 0
ω1 → 1
ω2 → 2
This does not make much sense because in most problems there is no
natural order among the class labels.
y0 y1 y2
ω0 → 1 0 0
ω1 → 0 1 0
ω2 → 0 0 1
In the test phase, new feature vectors are classified by computing the
predictors fi (x) and selecting the one with greatest value
SSE = n0 (0 − ŷ )2 + n1 (1 − ŷ )2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5
Why does the linear regression with indicator variables fail in the second
example?
Jorge S. Marques, IST, 2017 196/279
Example - 2D data
This slide shows two problems with 2D features and a linear model. Only
the first problem can be solved by linear models. Why?
3
-1
-1
-2
-2
-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5
In the second case all the training features are classified in the same class.
-1
1
0.5 -2
Notice, though that these are not linear models with respect to x but
they are linear models in the parameters that can be estimated by a linear
system of equations.
where
1
g (s) =
1 + e −s
n n
X o
`(β) = y (i) log[g (x (i)T β)] + (1 − y (i) ) log[1 − g (x (i)T β)] .
i=1
∇β `(β) = ∇β y log g (x T β) + (1 − y ) log[1 − g (x T β)]
g 0 (x T β) −g 0 (x T β)
=y g (x T β)
x + (1 − y ) [1−g (x T β)]
x
= y [1 − g (x T β)] x − (1 − y ) g (x T β) x
= [y − g (x T β)] x
3
2.5
2
2
1.5
1
1
0.5
0
-1
-0.5
-1
-2
-1.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5
In the second case all the training features are classified in the same
class. The model is too rigid.
Jorge S. Marques, IST, 2017 205/279
Logistic regression with more flexible models
-1
-2
Notice that these models are not linear models with respect to x.
PK −1
log-likelihood (1 training example): ` = i=0 yi log(ŷi )
Derivatives:
∂` yi
=
∂ ŷi ŷi
∂ ŷi ŷi (1 − ŷi ) i = k
=
∂sk −ŷi ŷk i=6 k
K
∂` X ∂` ∂ ŷk ∂` ∂ ŷi X ∂` ∂ ŷk
= = +
∂si ∂ ŷk ∂ŝi ∂ ŷi ∂ŝi ∂ ŷk ∂ŝi
k=1 k6=i
yi X yk
= ŷi (1 − ŷi ) − (ŷk ŷi ) = yi − ŷi
ŷi ŷk
k6=i
∂` ∂` ∂si
= = (yi − ŷi )xj
∂βij ∂si ∂βij
gradient:
n
∂` X (m) (m) (m)
= (yi − ŷi )xj
∂βij m=1
(t+1) (t) ∂`
βij = βij + γ , γ>0.
∂βij
p(x|y = i)P(y = i)
P(y = i|x) = ∝ p(x|y = i)Pi ,
p(x)
1 1
log C − (x−µi )T Σ−1 (x−µi )+log Pi = log C − (x−µj )T Σ−1 (x−µj )+log Pj
2 2
−1 T −1 T −1 T −1 Pj
−2µT
i Σ x + µi Σ µi = −2µj Σ x + µj Σ µj − 2 log
Pi
1 T −1 1 Pj
(µi − µj )T Σ−1 x = µi Σ µi − µT −1
j Σ µj + log
2 2 Pi
1 Pj
(µi − µj )T Σ−1 x = (µi + µj )T Σ−1 (µi − µj ) + log
2 Pi
nk
P̂(ωk ) = n
1
x (i)
P
µ̂k = nk i:y (i) =k
PK hP i
1 (i)
Σ̂ = n−K k=1 i:y (i) =k (x − µ̂k )(x (i) − µ̂k )T .
-1
-1
-2
-2
-3
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5
In the second case all the training features are classified in the same
class. The model is too rigid.
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
Main idea: separate the cloud of data in two regions, using a carefully
chosen hyperplane.
x ·w +b =0
where
I x ∈ Rp - point on the hyperplane
I w ∈ Rp - normal vector to the hyperplane
I b ∈ R - offset
I x · w is the inner product between w , x ∈ Rp
|b|
distance to the origin: kw k
x · w + b > 0 ⇒ ŷ = +1
x · w + b < 0 ⇒ ŷ = −1
Therefore,
ŷ = sign(x · w + b)
Let us assume that the training data can be separated without errors by
an hyperplane (linearly separable data). In fact if there is one, there is an
infinite number of separating hyperplanes ...
w ·x +b =0
Consider a hyperplane that separates the training data without errors and
is equally distant to the nearest examples of both classes.
2
Margin: kw k
The SVM classifier chooses the hyperplane with the maximum margin
(maximum margin classifier).
Difficulty:
x1 x2 y
x1 x2 y
0 3 −1
0 9 −1
0 −3 −1
4 1 −1
4 1 −1
4 4 −1
4 −2 −1
0 0 −1
0 0 +1
0 4 +1
0 −2 +1
1 1 +1
1 1 +1
Support vectors:
class -1 : (0, 9), (4, 1)
class +1 : (0, 4)
Margin hyperplanes:
0 9 1 w1 −1
1 4 13
4 1 1 w2 = −1 w =− , b= .
5 2 5
0 4 1 b +1
2
√
Margin: kw k = 5
Lagrangian function
n
1 X
LP = kw k2 − αi [y (i) (x (i) · w + b) − 1] ,
2
i=1
n n
1 X X
LP = kw k2 − αi y (i) (x (i) · w + b) + αi ,
2
i=1 i=1
Optimization
n
∂LP X
=0⇒w = αi y (i) x (i) ,
∂w
i=1
n
∂LP X
=0⇒ αi y (i) = 0 .
∂b
i=1
The dual formulation depends only on the inner products between input
vectors x (i) · x (j) . This is very important!
n n
! n
X 1 X X
Lp = αi − αi y (i) x (i) · αj y (j) x (j)
2
i=1 i=1 j=1
n n n
X 1 XX
Lp = αi − αi y (i) (x (i) · x (j) )y (j) αj
2
i=1 i=1 j=1
Matrix H does not require the training patterns themselves, x (i) , but only
inner products of training vectors x (i) · x (j) .
The SVM algorithm provides not only a decision but also a score
f (x) = x · w + b.
SVMs can be extended to deal with data that is not linearly separable. In
this case it is not possible to classify all the training vectors without
errors, using an hyperplane.
The idea is to assign a slack (folga) variable ξi to each data point x (i)
defined in such way that ξi = 0 if no margin violation occurs and ξi > 0
if the ith point is on the wrong side of the margin hyperplane.
Pn
Soft margin penalty: C i=1 ξi
with ξi ≥ 0.
Optimization problem:
n
1 X
min kw k2 + C ξi s.t. y (i) (x (i) · w + b) − 1 + ξi ≥ 0, ∀i.
2
i=1
Lagrangian function
n n n
1 X X X
LP = kw k2 + C ξi − αi [y (i) (x (i) · w + b) − 1 + ξi ] − µi ξi
2
i=1 i=1 i=1
n n
X 1 X
max αi − αT Hα , s.t. 0 ≤ αi ≤ C ∀ i , αi y (i) = 0
α 2
i=1 i=1
T (i) (i) (i) (i)
where α = [α1 . . . αn ] and Hij = y (x · x )y .
soft margin classifier with large C is equal to the hard margin classifier.
Jorge S. Marques, IST, 2017 238/279
Example - data not linearly separable
This example shows data not linearly separable, classified with soft
margin.
with λ = 1/(2C ).
Linear SVMs classify data using an hyperplane trained with hard margin
or soft margin. This is too restrictive, especially when the dimension of
input space is low.
x̃ = φ(x)
The linear SVM algorithm (dual formulation) does not require the input
vectors x (i) but only inner products between them x (i) · x (j) .
The good new is that we can compute these inner products using a
kernel function
k(x (i) , x (j) ) = φ(x (i) ) · φ(x (j) ),
that can be computed in low dimension input space.
The non linear SVM can be trained and tested using low dimension data,
by replacing the inner products by the kernel.
(i)
1
−x (j) k2
rbf: k(x (i) , x (j) ) = e − 2σ2 kx
2
2
1.5
1
1
0.5
0
0
-0.5
-1
-1
-1.5
-2
-2
-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3
3
2.5
0
0 1
1 Support Vectors
2
Support Vectors
2
1.5
1
1
0.5
0
0
-0.5
-1
-1
-1.5
-2
-2
-2.5 -3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -3 -2 -1 0 1 2 3
Linear Regression
Regularization
Optimization
Neural networks
Data classification
Linear classifiers
Decision trees are popular classifiers since they allow to understand why
the input pattern is classified in a specific class.
Decision trees are often used when the features are categorical, although
they have been extended to numerical features as well..
Suppose we wish to predict what are the good days to play tennis and we
have the following dataset associated to a player (John).
Days with the same attributes may have different outcomes (noisy labels).
Jorge S. Marques, IST, 2017 249/279
A decision tree
A decision tree that solves the problem is
node 3
Day Outlook Humidity Wind Play
3 Overcast High Weak Yes
7 Overcast Normal Strong Yes pure subset!
12 Overcast High Strong Yes
13 Overcast Normal Weak Yes
node 2 node 4
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
1 Sunny High Weak No 4 Rain High Weak Yes
2 Sunny High Strong No 5 Rain Normal Weak Yes
8 Sunny High Weak No 6 Rain Normal Strong No
9 Sunny Normal Weak Yes 10 Rain Normal Weak Yes
11 Sunny Normal Strong Yes 14 Rain High Strong No
If the training examples associated to a node have the same label, the
node is called pure. Pure nodes are not split anymore and receive a label.
Jorge S. Marques, IST, 2017 251/279
What training data is associated to each node? (2)
node 5 node 7
Day Outlook Humidity Wind Play Day Outlook Humidity Wind Play
9 Sunny Normal Weak Yes 4 Rain High Weak Yes
11 Sunny Normal Strong Yes 5 Rain Normal Weak Yes
10 Rain Normal Weak Yes
node 6
Day Outlook Humidity Wind Play node 8
1 Sunny High Weak No Day Outlook Humidity Wind Play
2 Sunny High Strong No 6 Rain Normal Strong No
8 Sunny High Weak No 14 Rain High Strong No
both subsets are pure! both subsets are pure!
Classification of new data is done in the same way. Given a feature vector
x, we travel along the tree based on the feature values until we reach a
leaf (with a label).
There will be classification errors not only in the test set but also in the
training set, if there are examples with the same attributes and different
labels. This is known as noisy labels
Consider the tennis data set and the decision tree shown above. Find the
probability of each label (Yes or No), at each node.
I Misclassification error:
I Entropy:
K
X
i(m) = − P(k|m) log2 P(k|m)
k=1
I Gini index:
K
X
i(m) = − P(k|m) (1 − P(k|m))
k=1
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
The entropy and Gini index are smoother and they are usually preferred
in model training.
To overcome the first difficulty, we start with a single node m (root) and
choose the best attribute for splitting the node. This is done as follows.
The splitting process is repeated for another leaf node until a stop
condition is met. For example, until all the leafs are pure or all the
attributes have been tested.
The ID3 algorithm is a basic tree learning method for categorical data.
When the data is noisy (noisy label or noisy attributes) the ID3 algorithm
may overfit the training data leading to poor performance in independent
data sets (test sets).
O H W P Outlook N Y
R 2 3
S H W N
S 3 2
S H S N
O 0 4
O H W Y
R H W Y i(O) = 5
× 0.97 + 5
× 0.97 + 4
× 0 = 0.69
14 14 14
R N W Y
R N S N Humidity N Y
O N S Y H 4 3
S H W N N 1 6
S N W Y
R N W Y 7 7
i(H) = 14
× 0.98 + 14
× 0.59 = 0.78
S N S Y
O H S Y Wind N Y
O N W Y S 3 3
R H S N W 2 6
6 8
i(W ) = 14
×1+ 14
× 0.81 = 0.89
Best choice for the root is Outlook and Outlook=Overcast is a pure node with
label Yes.
Jorge S. Marques, IST, 2017 264/279
Solution (cont.)
Node: Outlook=Rain
H W P
H W Y
Node: Outlook=Sunny
H W P N W Y
N S N
H W N N W Y
H S N H S N
H W N
N W Y Humidity N Y
N S Y H 1 1
N 1 2
Humidity N Y
H 3 0 i(H) = 2
×1+ 3
× 0.91 = 0.95
5 5
N 0 2
Wind N Y
Two pure nodes: i(H) = 0
S 2 0
W 0 3
I (T ) + αÑ
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
I compute the impurity drop for the splitting of the root node, using
the entropy criterion. What do you conclude?
I It is better growing the tree until the leaves are pure or all attributes
have been used and then prune the tree.
Test all the splitting nodes in a bottom-up way. For each tested node,
remove its descendants (subtree) and replace the splitting node by a leaf.
The change is accepted if the modified tree has a better or equal
performance (number of errors in the validation set).
Jorge S. Marques, IST, 2017 271/279
Error estimation
When there is no validation set (too few training examples) the error in
the validation set is predicted by using a pessimistic estimator given by
q
z2 2 z2
f + 2N + z Nf − fN + 4N 2
e= z2
,
1+ N
where
I z - parameter that depends on the confidence degree c (if c = 25%,
z = 0.69)
I f - percentage of error in the training set
I N - number of training examples in the leaf
There is only one difficulty. We do not know the ideal distribution P. All
we know is the first data set T and the empirical distribution computed
from it.
The trick consists of generating the multiple data sets T (i) using the
Bootstrap method i.e., by sampling the set T , n times, (with
replacement).
where the outcomes y (i) belong to a finite set of labels {0, . . . , K − 1}.