Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Linear Models for Classification

Sumeet Agarwal, EEL709

(Most figures from Bishop, PRML)


Approaches to classification
Discriminant function: Directly assigns each
data point x to a particular class Ci
Model the conditional class distribution p(C i|x):
allows separation of inference and decision
Generative approach: model class likelihoods,
p(x|Ci), and priors, p(Ci); use Bayes' theorem to
get posteriors:
p(Ci|x) ~ p(x|Ci)p(Ci)
Linear discriminant functions
y(x) = wTx + w0
Multiple Classes
Problem of ambiguous regions
Multiple Classes
Consider a single K-class discriminant, with K linear functions:
yk(x) = wkTx + wk0
And assign x to class Ck if yk(x) > yj(x) for all j k
Implies singly connected and convex decision regions:
Least squares for classification
Too sensitive to outliers:
Least squares for classification
Problematic due to evidently non-Gaussian distribution of target
values:
Fisher's linear discriminant
Linear classification model is like 1-D projection of data: y = wTx.
Thus we need to find a decision threshold along this 1-D
projection (line). Simplest measure is separation of the class
means: m2 m1 = wT(m2 m1). If classes have nondiagonal
covariances, then a better idea is to use the Fisher criterion:

J(w) = (m2 m1)2 / (s12 + s22)

Where s12 denotes the variance of class 1 in the 1-D projection.

Maximising J() attempts to give a large separation between


projected class means, but also a small variance within each
class.
Fisher's linear discriminant

Line joining class means Fisher discriminant


The Perceptron

1(x) w1

f(wT(x))
2(x) w2
f() 1
Activation
function
3(x) w3 0 wT(x)
-1

4(x) w4

A non-linear transformation in the form of a step function


is applied to the weighted sum of the input features. This
is inspired by the way neurons appear to function,
mimicking the action potential.
The perceptron criterion
We'd like a weight vector w such that wT(xi) > 0 for xi C1
(say, ti=1) and wT(xi) < 0 for xi C2 (ti=-1)
Thus, we want wT(xi)ti > 0 i; those data points for which
this is not true will be misclassified
The perceptron criterion tries to minimise the 'magnitude' of
misclassification, i.e., it tries to minimise -wT(xi)ti for all
misclassified points (the set of which is denoted by M):

EP(w) = -iM wT(xi)ti


Why not just count the number of misclassified points?
Because this is a piecewise constant function of w, and thus
the gradient is zero at most places, making optimisation hard
Learning by gradient descent
w(+1) = w() EP(w)
= w() + (xi)ti
(if xi is misclassified)
We can show that after this update, the error due to xi will be
reduced:

-w(+1)T(xi)ti = -w()T(xi)ti ((xi)ti)T(xi)ti


< -w()T(xi)ti
(having set =1, which can be done without loss of generality)
Perceptron convergence

Perceptron
convergence
theorem
guarantees
exact solution in
finite steps for
linearly
separable data;
but no
convergence for
nonseparable
data
Gaussian Discriminant Analysis
Generative approach, with class-conditional densities
(likelihoods) modelled as Gaussians

For the case of two classes, we have:


Logistic sigmoid
Gaussian Discriminant Analysis
In the Gaussian case, we get

The assumption
of equal
covariance
matrices leads
to linear
decision
boundaries
Gaussian Discriminant Analysis

Allowing for unequal covariance matrices for different


classes leads to quadratic decision boundaries
Parameter estimation for GDA
Likelihood:
(assuming equal covariance matrices)

Maximum Likelihood Estimators


Logistic Regression
An example of a probabilistic discriminative model
Rather than learning P(x|Ci) and P(Ci), attempts to directly
learn P(Ci|x)
Advantages: fewer parameters, better if assumptions in
class-conditional density formulation are inaccurate
We have seen how the class posterior for a two-class setting
can be written as a logistic sigmoid acting on a linear function
of the feature vector :

This model is called logistic regression, even though it is


a model for classification, not regression!
Parameter learning
If we let

then the likelihood function is

and we can define a corresponding error, known as


cross-entropy:
Parameter learning
The derivative of the sigmoid function is given by:

Using this, we can obtain the gradient of the error function


with respect to w:

Thus the contribution to the gradient from point n is given by


the 'error' between model prediction and actual class label (yn
tn) times the basis function vector for that point, n
Could use this for sequential learning by gradient descent,
exactly as for least-squares linear regression
Nonlinear basis functions

You might also like