Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Linear Models for Classification
Sumeet Agarwal, EEL709
(Most figures from Bishop, PRML)

Approaches to classification
Discriminant function: Directly assigns each
data point x to a particular class Ci
Model the conditional class distribution p(C i|x):
allows separation of inference and decision
Generative approach: model class likelihoods,
p(x|Ci), and priors, p(Ci); use Bayes' theorem to
get posteriors:
p(Ci|x) ~ p(x|Ci)p(Ci)
Linear discriminant functions
y(x) = wTx + w0
Multiple Classes
Problem of ambiguous regions
Multiple Classes
Consider a single K-class discriminant, with K linear functions:
yk(x) = wkTx + wk0
And assign x to class Ck if yk(x) > yj(x) for all j k
Implies singly connected and convex decision regions:
Least squares for classification
Too sensitive to outliers:
Least squares for classification
Problematic due to evidently non-Gaussian distribution of target
values:
Fisher's linear discriminant
Linear classification model is like 1-D projection of data: y = wTx.
Thus we need to find a decision threshold along this 1-D
projection (line). Simplest measure is separation of the class
means: m2 m1 = wT(m2 m1). If classes have nondiagonal
covariances, then a better idea is to use the Fisher criterion:
J(w) = (m2 m1)2 / (s12 + s22)
Where s12 denotes the variance of class 1 in the 1-D projection.
Maximising J() attempts to give a large separation between

projected class means, but also a small variance within each
class.
Fisher's linear discriminant
Line joining class means Fisher discriminant

The Perceptron
1(x) w1
f(wT(x))
2(x) w2
f() 1
Activation
function
3(x) w3 0 wT(x)
-1
4(x) w4
A non-linear transformation in the form of a step function

is applied to the weighted sum of the input features. This
is inspired by the way neurons appear to function,
mimicking the action potential.
The perceptron criterion
We'd like a weight vector w such that wT(xi) > 0 for xi C1
(say, ti=1) and wT(xi) < 0 for xi C2 (ti=-1)
Thus, we want wT(xi)ti > 0 i; those data points for which
this is not true will be misclassified
The perceptron criterion tries to minimise the 'magnitude' of
misclassification, i.e., it tries to minimise -wT(xi)ti for all
misclassified points (the set of which is denoted by M):
EP(w) = -iM wT(xi)ti

Why not just count the number of misclassified points?
Because this is a piecewise constant function of w, and thus
the gradient is zero at most places, making optimisation hard
Learning by gradient descent
w(+1) = w() EP(w)
= w() + (xi)ti
(if xi is misclassified)
We can show that after this update, the error due to xi will be
reduced:
-w(+1)T(xi)ti = -w()T(xi)ti ((xi)ti)T(xi)ti

< -w()T(xi)ti
(having set =1, which can be done without loss of generality)
Perceptron convergence
Perceptron
convergence
theorem
guarantees
exact solution in
finite steps for
linearly
separable data;
but no
convergence for
nonseparable
data
Gaussian Discriminant Analysis
Generative approach, with class-conditional densities
(likelihoods) modelled as Gaussians
For the case of two classes, we have:

Logistic sigmoid
In the Gaussian case, we get
The assumption
of equal
covariance
matrices leads
to linear
decision
boundaries
Allowing for unequal covariance matrices for different

classes leads to quadratic decision boundaries
Parameter estimation for GDA
Likelihood:
(assuming equal covariance matrices)
Maximum Likelihood Estimators

Logistic Regression
An example of a probabilistic discriminative model
Rather than learning P(x|Ci) and P(Ci), attempts to directly
learn P(Ci|x)
Advantages: fewer parameters, better if assumptions in
class-conditional density formulation are inaccurate
We have seen how the class posterior for a two-class setting
can be written as a logistic sigmoid acting on a linear function
of the feature vector :
This model is called logistic regression, even though it is

a model for classification, not regression!
Parameter learning
If we let
then the likelihood function is
and we can define a corresponding error, known as

cross-entropy:
Parameter learning
The derivative of the sigmoid function is given by:
Using this, we can obtain the gradient of the error function

with respect to w:
Thus the contribution to the gradient from point n is given by

the 'error' between model prediction and actual class label (yn
tn) times the basis function vector for that point, n
Could use this for sequential learning by gradient descent,
exactly as for least-squares linear regression
Nonlinear basis functions

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Uploaded by

Copyright:

Available Formats

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Uploaded by

Copyright:

Available Formats

Linear Models for Classification

Sumeet Agarwal, EEL709

(Most figures from Bishop, PRML)

J(w) = (m2 m1)2 / (s12 + s22)

Where s12 denotes the variance of class 1 in the 1-D projection.

Maximising J() attempts to give a large separation between

Line joining class means Fisher discriminant

A non-linear transformation in the form of a step function

EP(w) = -iM wT(xi)ti

-w(+1)T(xi)ti = -w()T(xi)ti ((xi)ti)T(xi)ti

For the case of two classes, we have:

Allowing for unequal covariance matrices for different

Maximum Likelihood Estimators

This model is called logistic regression, even though it is

then the likelihood function is

and we can define a corresponding error, known as

Using this, we can obtain the gradient of the error function

Thus the contribution to the gradient from point n is given by

You might also like