Introduction To Machine Learning Lecture 3: Linear Classification Methods
Introduction To Machine Learning Lecture 3: Linear Classification Methods
Introduction To Machine Learning Lecture 3: Linear Classification Methods
• Use N discriminant
functions, yi , y j , yk ...
and pick the max.
– This is guaranteed to
give consistent and
convex decision
regions if y is linear.
yk (x A ) y j (x A ) and yk (x B ) y j (x B )
implies ( for positive ) that
yk x A (1 ) x B y j x A (1 ) x B
Using “least squares” for classification
least
squares
regression
(m2 m1 ) 2 between
• Fisher’s objective function is: J (w )
s12 s22 within
More math of Fisher’s linear discriminants
(m2 m1 ) 2 wT S B w
J (w )
s12 s22 w T SW w
S B (m 2 m1 ) (m 2 m1 )T
SW n 1 n 1
( x m ) ( x m ) T
n 2 n 2
( x m ) ( x m ) T
nC1 nC2
1
Optimal solution : w SW (m 2 m1 )
Perceptrons
• “Perceptrons” describes a whole family of learning
machines, but the standard type consisted of a layer of
fixed non-linear basis functions followed by a simple
linear discriminant function.
– They were introduced in the late 1950’s and they had
a simple online learning procedure.
– Grand claims were made about their abilities. This led
to lots of controversy.
– Researchers in symbolic AI emphasized their
limitations (as part of an ideological campaign against
real numbers, probabilities, and learning)
• Support Vector Machines are just perceptrons with a
clever way of choosing the non-adaptive, non-linear
basis functions and a better learning procedure.
– They have all the same limitations as perceptrons in
what types of function they can learn.
• But people seem to have forgotten this.
The perceptron convergence procedure
• Add an extra component with value 1 to each feature
vector. The “bias” weight on this component is minus the
threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that
every training case will keep getting picked
– If the output is correct, leave its weights alone.
– If the output is 0 but should be 1, add the feature
vector to the weight vector.
– If the output is 1 but should be 0, subtract the feature
vector from the weight vector
• This is guaranteed to find a set of weights that gets the
right answer on the whole training set if any such set exists
• There is no need to choose a learning rate.
A natural way to try to prove convergence
• The obvious approach is to write down an error function
and try to show that each step of the learning procedure
reduces the error.
– For stochastic online learning we would like to show
that each step reduces the expected error, where the
expectation is across the choice of training cases.
– It cannot be a squared error because the size of the
update does not depend on the size of the mistake.
• The textbook tries to use the sum of the distances on the
wrong side of the decision surface as an error measure.
– Its conclusion is that the perceptron convergence
procedure is not guaranteed to reduce the total error
at each step.
• This is true for that error function even if there is a set of
weights that gets the right answer for every training case.
Weight and data space
A feature
vector with
• Imagine a space in which each axis correct
corresponds to a feature value or to answer=0
the weight on that feature
– A point in this space is a weight
vector. Feature vectors are bad
shown in blue translated away weights
from the origin to reduce clutter.
good
• Each training case defines a plane.
weights
– On one side of the plane the
output is wrong.
• To get all training cases right we
need to find a point on the right side A feature o
of all the planes. vector with the origin
– This feasible region (if it exists) correct
is a cone with its tip at the origin answer=1
A better way to prove the convergence
(using the convexity of the solutions in weight-space)
• The obvious type of error function measures the
discrepancy between the targets and the model’s
outputs.
• A different type of cost function is to use the
squared distance between the current weights
and a feasible set of weights.
– Using this cost function we can show that every
step of the procedure reduces the error.
• Provided a set of feasible weights exists.
• Using this type of cost function, the procedure can
easily be generalized to more than two classes
using the MAX decision rule.
Why the learning procedure works
• Consider the squared • So consider “generously satisfactory”
distance between any weight vectors that lie within the
satisfactory weight vector feasible region by a margin at least as
and the current weight great as the largest update.
vector. – Every time the perceptron makes a
– Every time the mistake, the squared distance to all
perceptron makes a of these weight vectors is always
mistake, the learning decreased by at least the squared
algorithm reduces the length of the smallest update vector.
squared distance
between the current
weight vector and any
satisfactory weight
vector (unless it crosses
the decision plane).
What perceptrons cannot learn
• The adaptive part of a
perceptron cannot even tell if Data Space
two single bit features have the
same value! 0,1 1,1
Same: (1,1) 1; (0,0) 1
Different: (1,0) 0; (0,1) 0
• The four feature-output pairs
give four inequalities that are
impossible to satisfy:
w1 w2 , 0 0,0 1,0
w1 , w2 The positive and negative cases
cannot be separated by a plane
What can perceptrons do?
z w T x w0
The picture can't be display ed.
En tn 1 tn
yn yn 1 yn
error derivative on yn t n
training case n
yn (1 yn )
Using the chain rule to get the error
derivatives
T zn
z n w x n w0 , xn
w
En yn t n dyn
, yn (1 yn )
yn yn (1 yn ) dz n
y1 y2 y3 j
output
units yi
yi (1 yi )
z1 z2 z3 zi
target value
The natural cost function is the
negative log prob of the right E t j ln y j
answer j
z1
e 1
y1 z1 z0
( z1 z0 )
e e 1 e
p (C1 ) p (x | C1 ) p (C1 | x)
where z ln ln
p(C0 ) p (x | C0 ) 1 p(C1 | x)
constant
(x μ k )T 1 (x μ k )
1
p (x | Ck ) a exp
2
• For two classes, C1 and C0, the posterior is a logistic:
p (C1 | x) (w T x w0 )
w Σ 1 (μ1 μ 0 )
1 T 1 1 T 1 p(C1 )
w0 μ1 Σ μ1 μ0 Σ μ0 ln
2 2 p(C0 )
A picture of the two Gaussian models and
the resulting posterior for the red class
A way of thinking about the role of the
inverse covariance matrix
• If the Gaussian is spherical we
don’t need to worry about the
w Σ 1 (μ1 μ 0 )
covariance matrix. gives the same value
• So we could start by
transforming the data space to for w T x as :
make the Gaussian spherical
1 1
– This is called “whitening”
the data. w aff Σ μ1 Σ μ 0
2 2
– It pre-multiplies by the 1
matrix square root of the
inverse covariance matrix.
and x aff Σ x 2
new Gaussian