Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Introduction To Machine Learning Lecture 3: Linear Classification Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

CSC2515 Fall 2007/8

Introduction to Machine Learning

Lecture 3: Linear Classification Methods

All lecture slides will be available as .ppt, .ps, & .htm at


www.cs.toronto.edu/~hinton

Many of the figures are provided by Chris Bishop


from his textbook: ”Pattern Recognition and Machine Learning”
What is “linear” classification?
• Classification is intrinsically non-linear
– It puts non-identical things in the same class, so a
difference in the input vector sometimes causes zero
change in the answer (what does this show?)
• “Linear classification” means that the part that adapts is
linear
– The adaptive part is followed by a fixed non-linearity.
– It may also be preceded by a fixed non-linearity (e.g.
nonlinear basis functions).
y (x)  w T x  w0 , Decision  f ( y (x))

adaptive linear fixed non-


function linear function
Representing the target values for
classification
• If there are only two classes, we typically use a single real
valued output that has target values of 1 for the “positive”
class and 0 (or sometimes -1) for the other class
– For probabilistic class labels the target value can then
be the probability of the positive class and the output of
the model can also represent the probability the model
gives to the positive class.
• If there are N classes we often use a vector of N target
values containing a single 1 for the correct class and zeros
elsewhere.
– For probabilistic labels we can then use a vector of
class probabilities as the target vector.
Three approaches to classification
• Use discriminant functions directly without probabilities:
– Convert the input vector into one or more real values
so that a simple operation (like threshholding) can be
applied to get the class.
• The real values should be chosen to maximize the useable
information about the class label that is in the real value.
• Infer conditional class probabilities: p (class  Ck | x)
– Compute the conditional probability of each class.
• Then make a decision that minimizes some loss function
• Compare the probability of the input under separate,
class-specific, generative models.
– E.g. fit a multivariate Gaussian to the input vectors of
each class and see which Gaussian makes a test
data vector most probable. (Is this the best bet?)
The planar decision surface
in data-space for the simple
linear discriminant function:
w T x  w0  0
Reminder: Three different spaces that are
easy to confuse
• Weight-space
– Each axis corresponds to a weight
– A point is a weight vector
– Dimensionality = #weights +1 extra dimension for the loss
• Data-space
– Each axis corresponds to an input value
– A point is a data vector. A decision surface is a plane.
– Dimensionality = dimensionality of a data vector
• “Case-space” (used in Bishop figure 3.2)
– Each axis corresponds to a training case
– A point assigns a scalar value to every training case
• So it can represent the 1-D targets or it can represent the
value of one input component over all the training data.
– Dimensionality = #training cases
Discriminant functions for N>2 classes

• One possibility is to use N two-way discriminant


functions.
– Each function discriminates one class from
the rest.
• Another possibility is to use N(N-1)/2 two-way
discriminant functions
– Each function discriminates between two
particular classes.
• Both these methods have problems
Problems with multi-class discriminant functions

More than one Two-way preferences


good answer need not be transitive!
A simple solution

• Use N discriminant
functions, yi , y j , yk ...
and pick the max.
– This is guaranteed to
give consistent and
convex decision
regions if y is linear.

yk (x A )  y j (x A ) and yk (x B )  y j (x B )
implies ( for positive  ) that
yk  x A  (1   ) x B   y j  x A  (1   ) x B 
Using “least squares” for classification

• This is not the right thing to do and it doesn’t work as


well as better methods, but it is easy:
– It reduces classification to least squares regression.
– We already know how to do regression. We can just
solve for the optimal weights with some matrix
algebra (see lecture 2).
• We use targets that are equal to the conditional
probability of the class given the input.
– When there are more than two classes, we treat each
class as a separate problem (we cannot get away with this
if we use the “max” decision function).
Problems with using least
squares for classification logistic
regression

least
squares
regression

If the right answer is 1 and


the model says 1.5, it loses,
so it changes the boundary
to avoid being “too correct”
Another example where least squares
regression gives poor decision surfaces
Fisher’s linear discriminant
• A simple linear discriminant function is a projection of the
data down to 1-D.
– So choose the projection that gives the best
separation of the classes. What do we mean by “best
separation”?
• An obvious direction to choose is the direction of the line
joining the class means.
– But if the main direction of variance in each class is
not orthogonal to this line, this will not give good
separation (see the next figure).
• Fisher’s method chooses the direction that maximizes
the ratio of between class variance to within class
variance.
– This is the direction in which the projected points
contain the most information about class membership
(under Gaussian assumptions)
A picture showing the advantage of Fisher’s
linear discriminant.

When projected onto the Fisher chooses a direction that


line joining the class means, makes the projected classes much
the classes are not well tighter, even though their projected
separated. means are less far apart.
Math of Fisher’s linear discriminants
T
• What linear transformation is best for yw x
discrimination?
• The projection onto the vector
separating the class means seems w  m 2  m1
sensible:

• But we also want small variance


2
s1   ( yn  m1 )
within each class: n  C1
2
s2   ( yn  m2 )
n  C2

(m2  m1 ) 2 between
• Fisher’s objective function is: J (w ) 
s12  s22 within
More math of Fisher’s linear discriminants

(m2  m1 ) 2 wT S B w
J (w )  
s12  s22 w T SW w

S B  (m 2  m1 ) (m 2  m1 )T

SW   n 1 n 1 
( x  m ) ( x  m ) T
 n 2 n 2
( x  m ) ( x  m ) T

nC1 nC2

1
Optimal solution : w  SW (m 2  m1 )
Perceptrons
• “Perceptrons” describes a whole family of learning
machines, but the standard type consisted of a layer of
fixed non-linear basis functions followed by a simple
linear discriminant function.
– They were introduced in the late 1950’s and they had
a simple online learning procedure.
– Grand claims were made about their abilities. This led
to lots of controversy.
– Researchers in symbolic AI emphasized their
limitations (as part of an ideological campaign against
real numbers, probabilities, and learning)
• Support Vector Machines are just perceptrons with a
clever way of choosing the non-adaptive, non-linear
basis functions and a better learning procedure.
– They have all the same limitations as perceptrons in
what types of function they can learn.
• But people seem to have forgotten this.
The perceptron convergence procedure
• Add an extra component with value 1 to each feature
vector. The “bias” weight on this component is minus the
threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that
every training case will keep getting picked
– If the output is correct, leave its weights alone.
– If the output is 0 but should be 1, add the feature
vector to the weight vector.
– If the output is 1 but should be 0, subtract the feature
vector from the weight vector
• This is guaranteed to find a set of weights that gets the
right answer on the whole training set if any such set exists
• There is no need to choose a learning rate.
A natural way to try to prove convergence
• The obvious approach is to write down an error function
and try to show that each step of the learning procedure
reduces the error.
– For stochastic online learning we would like to show
that each step reduces the expected error, where the
expectation is across the choice of training cases.
– It cannot be a squared error because the size of the
update does not depend on the size of the mistake.
• The textbook tries to use the sum of the distances on the
wrong side of the decision surface as an error measure.
– Its conclusion is that the perceptron convergence
procedure is not guaranteed to reduce the total error
at each step.
• This is true for that error function even if there is a set of
weights that gets the right answer for every training case.
Weight and data space
A feature
vector with
• Imagine a space in which each axis correct
corresponds to a feature value or to answer=0
the weight on that feature
– A point in this space is a weight
vector. Feature vectors are bad
shown in blue translated away weights
from the origin to reduce clutter.
good
• Each training case defines a plane.
weights
– On one side of the plane the
output is wrong.
• To get all training cases right we
need to find a point on the right side A feature o
of all the planes. vector with the origin
– This feasible region (if it exists) correct
is a cone with its tip at the origin answer=1
A better way to prove the convergence
(using the convexity of the solutions in weight-space)
• The obvious type of error function measures the
discrepancy between the targets and the model’s
outputs.
• A different type of cost function is to use the
squared distance between the current weights
and a feasible set of weights.
– Using this cost function we can show that every
step of the procedure reduces the error.
• Provided a set of feasible weights exists.
• Using this type of cost function, the procedure can
easily be generalized to more than two classes
using the MAX decision rule.
Why the learning procedure works
• Consider the squared • So consider “generously satisfactory”
distance between any weight vectors that lie within the
satisfactory weight vector feasible region by a margin at least as
and the current weight great as the largest update.
vector. – Every time the perceptron makes a
– Every time the mistake, the squared distance to all
perceptron makes a of these weight vectors is always
mistake, the learning decreased by at least the squared
algorithm reduces the length of the smallest update vector.
squared distance
between the current
weight vector and any
satisfactory weight
vector (unless it crosses
the decision plane).
What perceptrons cannot learn
• The adaptive part of a
perceptron cannot even tell if Data Space
two single bit features have the
same value! 0,1 1,1
Same: (1,1)  1; (0,0)  1
Different: (1,0)  0; (0,1)  0
• The four feature-output pairs
give four inequalities that are
impossible to satisfy:

w1  w2   , 0   0,0 1,0
w1   , w2   The positive and negative cases
cannot be separated by a plane
What can perceptrons do?

• They can only solve tasks if the hand-coded


features convert the original task into a linearly
separable one. How difficult is this?
• The N-bit parity task :
– Requires N features of the form: Are at least
m bits on?
– Each feature must look at all the components
of the input.
• The 2-D connectedness task
– requires an exponential number of features!
The N-bit even parity task
• There is a simple solution
that requires N hidden
units. +1
output
– Each hidden unit
computes whether
more than M of the -2 +2 -2 +2
inputs are on.
– This is a linearly >0 >1 >2 >3
separable problem.
• There are many variants
of this solution.
– It can be learned.
– It generalizes well if: 1 0 1 0
N 2
2  N input
Why connectedness is hard to compute

• Even for simple line drawings, there are


exponentially many cases.
• Removing one segment can break
connectedness
– But this depends on the precise
arrangement of the other pieces.
– Unlike parity, there are no simple
summaries of the other pieces that tell
us what will happen.
• Connectedness is easy to compute with an
iterative algorithm.
– Start anywhere in the ink
– Propagate a marker
– See if all the ink gets marked.
Distinguishing T from C in any orientation and position

• What kind of features are


required to distinguish two
different patterns of 5 pixels
independent of position and
orientation?
– Do we need to replicate T
and C templates across
all positions and
orientations? Replicate the following two
– Looking at pairs of pixels feature detectors in all positions
will not work +
– Looking at triples will work + -+ -
+
if we assume that each
input image only contains If any of these equal their threshold
of 2, it’s a C. If not, it’s a T.
one object.
Logistic regression (jump to page 205)

• When there are only two classes we can model


the conditional probability of the positive class as
T 1
p (C1 | x)   (w x  w0 ) where  ( z ) 
1  exp( z )
• If we use the right error function, something nice
happens: The gradient of the logistic and the
gradient of the error function cancel each other:
N
E (w )   ln p (t | w ), E ( w )   ( y n  t n ) x n
n 1
The logistic function

z  w T x  w0
The picture can't be display ed.

• The output is a smooth


function of the inputs and 1
the weights. y   ( z) 
1 e
 z
1 z z
y  xi  wi
0.5
wi xi
dy
0
 y (1  y )
0
dz
z
Its odd to express it
in terms of y.
The natural error function for the logistic
N
• To fit a logistic model E    ln p (t n | yn )
using maximum n 1
likelihood, we need to N
   t n ln yn  (1  t n ) ln (1  yn )
minimize the negative
log probability of the
correct answer n 1
summed over the
training set. if t =1 if t =0

En tn 1  tn
  
yn yn 1  yn
error derivative on yn  t n
training case n 
yn (1  yn )
Using the chain rule to get the error
derivatives
T zn
z n  w x n  w0 ,  xn
w

En yn  t n dyn
 ,  yn (1  yn )
yn yn (1  yn ) dz n

En En dyn zn


  ( yn  t n ) x n
w yn dzn w
The cross-entropy or “softmax” error function
for multi-class classification
The output units use a non- e zi
local non-linearity: yi 
e
zj

y1 y2 y3 j
output
units yi
 yi (1  yi )
z1 z2 z3 zi
target value
The natural cost function is the
negative log prob of the right E    t j ln y j
answer j

The steepness of E exactly E E y j


balances the flatness of the   yi  ti
zi j y j zi
softmax.
A special case of softmax for two classes

z1
e 1
y1  z1 z0
 ( z1  z0 )
e e 1 e

• So the logistic is just a special case that avoids


using redundant parameters:
– Adding the same constant to both z1 and z0
has no effect.
– The over-parameterization of the softmax is
because the probabilities must add to 1.
Probabilistic Generative Models for Discrimination

• Use a separate generative model of the input vectors for


each class, and see which model makes a test input
vector most probable.
• The posterior probability of class 1 is given by:
p(C1 ) p(x | C1 ) 1
p (C1 | x)  
p (C1 ) p (x | C1 )  p (C0 ) p (x | C0 ) 1  e z

p (C1 ) p (x | C1 ) p (C1 | x)
where z  ln  ln
p(C0 ) p (x | C0 ) 1  p(C1 | x)

z is called the logit and is


given by the log odds
A simple example for continuous inputs
• Assume that the input vectors for each class are from a
Gaussian distribution, and all classes have the same
covariance matrix.
normalizing inverse covariance matrix

 
constant

(x  μ k )T  1 (x  μ k )
1
p (x | Ck )  a exp 
2
• For two classes, C1 and C0, the posterior is a logistic:

p (C1 | x)   (w T x  w0 )
w  Σ 1 (μ1  μ 0 )
1 T 1 1 T 1 p(C1 )
w0   μ1 Σ μ1  μ0 Σ μ0  ln
2 2 p(C0 )
A picture of the two Gaussian models and
the resulting posterior for the red class
A way of thinking about the role of the
inverse covariance matrix
• If the Gaussian is spherical we
don’t need to worry about the
w  Σ 1 (μ1  μ 0 )
covariance matrix. gives the same value
• So we could start by
transforming the data space to for w T x as :
make the Gaussian spherical
1 1
– This is called “whitening”  
the data. w aff  Σ μ1  Σ μ 0
2 2

– It pre-multiplies by the 1
matrix square root of the 
inverse covariance matrix.
and x aff  Σ x 2

• In the transformed space, the T


weight vector is just the gives for w aff x aff
difference between the
transformed means.
The posterior when the covariance matrices
are different for different classes.

The decision surface is planar


when the covariance matrices
are the same and quadratic
when they are not.
Two ways to train a set of class-specific
generative models
• Generative approach Train • Discriminative approach
each model separately to fit Train all of the
the input vectors of that class. parameters of both
– Different models can be models to maximize the
trained on different cores. probability of getting the
– It is easy to add a new labels right.
class without retraining all
the other classes
• These are significant
advantages when the models
are harder to train than the
simple linear models
considered here.
An example where the two types of training
behave very differently

new Gaussian

decision What happens to the


boundary decision boundary if we
add a new red point here?

For generative fitting, the red mean moves rightwards


but the decision boundary moves leftwards! If you
really believe its Gaussian data this is sensible.

You might also like