Linear Discriminant Analysis: Intelligent Data Analysis and Probabilistic Inference

Intelligent Data Analysis and Probabilistic Inference
Lecture 15:
Linear Discriminant Analysis
Recommended reading: Bishop,
Chapter 4.1
Hastie et al., Chapter 4.3
Duncan Gillies and Marc Deisenroth

Department of Computing
Imperial College London
February 22, 2016
Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 2

Classification
x2
x1
Adapted from PRML (Bishop, 2006)
§ Input vector x P RD, assign it to one of K discrete classes Ck, k “

1, . . . , K.

§ Assumption: classes are disjoint, i.e., input vectors are assigned to
exactly one class
§ Idea: Divide input space into decision regions whose boundaries are
called decision boundaries/surfaces

Linear Classification
1
0.5
−0.5
−1
−1 −0.5 0 0.5 1
From PRML (Bishop, 2006)
§ Focus on linear classification model, i.e., the decision boundary is a

linear function of x

Defined by pD ´ 1q-dimensional hyperplane
§ If the data can be separated exactly by linear decision surfaces, they
are called linearly separable
§ Implicit assumption: Classes can be modeled well by Gaussians
Here: Treat classification as a projection problem

Example
§ Measurementsfor 150 Iris flowers from three different species. §

Measurements for 150 Iris flowers from three different species.

Example
§ Four features (petal length/width, sepal length/width)
Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

Example
§ Measurements for 150 Iris flowers from three different species.


Example
§ Given a new measurement of these features, predict the Iris species

based on a projection onto a low-dimensional space.

Example


Example


Example
§ PCA may not be ideal to separate the classes well

4

Example

Example


Example
§ PCA may not be ideal to separate the classes well

4

Orthogonal Projections
(Repetition)
§ Project input vector x P RD down to a 1-dimensional subspace with basis

vector w
§
With }w} “ 1, we get
P “ wwJ Projection matrix, such that Px “ p p “ yw P RD
Projection point Discussed in Lecture 14 y “ wJx P R Coordinates
with respect to basis w Today

5
Orthogonal Projections
(Repetition)
§ Project input vector x P RD down to a 1-dimensional subspace with basis

vector w
§
With }w} “ 1, we get
P “ wwJ Projection matrix, such that Px “ p p “ yw P RD
Projection point Discussed in Lecture 14 y “ wJx P R Coordinates
with respect to basis w Today

§ We will largely focus on the coordinates y in the following
§ Projection points equally apply to concepts discussed today
§ Coordinates equally apply to PCA (see Lecture 14)
5
Classification as Projection
w0

§ Assume we know the basis vector w, we can compute the projection of
any point x P RD onto the one-dimensional subspace spanned by w
Classification as Projection
w0

§ Assume we know the basis vector w, we can compute the projection of
any point x P RD onto the one-dimensional subspace spanned by w
§ Threshold w0, such that we decide on C1 if y ě w0 and C2 otherwise
6

The Linear Decision Boundary of LDA
§ Look at the log-probability ratio

ppC1|xq ppx|C1q ppC1q log “
log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.


log ` log
§ N
Assume Gaussian likelihood ppx|Ciq “ `x|mi, Σ˘ with the same
covariance in both classes. Decision boundary:

ppC1|xq
log “ 0 ppC2|
xq

log ` log

§ N
ppC1|xq
log “ 0 ppC2|
xq
ô log ppC1q´1pmJΣ´1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0 ppC2q 2

1


log ` log
§ N

ppC1|xq
log “ 0 ppC2|
xq
ppC1q 1 J ´1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0

ô log ppC2q´2pm1 Σ
ô pm1 ´ m2qJΣ´1x “ 21 pmJ1 Σ´1m1 ´ m2Σ´1m2q´ log ppppCC21qq


pp C1 |
xq ppx|C1q ppC1q log “ log ` log
§ N

ppC1|xq
log “0
ppC2|xq
C
pp 1q 1 J´
ô log ppC2q´2pm1 Σ
1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0
1
ô pm1 ´ m2qJΣ´1x “ pmJ1
2
Σ´1m1 ´ m2Σ´1m2q´ log ppppCC21qq

Of the form Ax “ b Decision boundary linear in x

Potential Issues
§ Considerable loss of information when projecting

Potential Issues

Potential Issues
§ Even if data was linearly separable in RD, we may lose this

separability (see figure)

Potential Issues

Potential Issues
§ Even if data was linearly separable in RD, we may lose this

separability (see figure)
Find good basis vector w that spans the subspace we project onto

Approach: Maximize Class Separation
§ Adjust components of basis vector w

Select projection that maximizes the class separation


§ Consider two classes: C1 with N1 points and C2 with N2 points


§ Consider two classes: C1 with N1 points and C2 with N2 points
§ Corresponding mean vectors:
1
m1 “ ÿ xn ,

N1 nPC1 N2 nPC2
1
m2 “ ÿ xn

Select projection that maximizes the class separation § Consider
two classes: C1 with N1 points and C2 with N2 points § Corresponding
mean vectors:

1 1
m1 “ ÿ xn , m2 “ ÿ xn
N1 PC1 n N2 C2 nP
§ Measure class separation as the distance of the projected class
means:
m2 ´ m1 “ wJm2 ´ wJm1 “ wJpm2 ´ m1q

and maximize this w.r.t. w with the constraint }w} “ 1

Maximum Class
Separation

§ Find w 9 pm2 ´ m1q
§ Projected classes may still have considerable overlap (because of
strongly non-diagonal covariances of the class distributions)
10

Maximum Class
Separation

§ Find w 9 pm2 ´ m1q
§ Projected classes may still have considerable overlap (because of
strongly non-diagonal covariances of the class distributions)
§ LDA: Large separation of projected class means and small within-class
variation (small overlap of classes)

Key Idea of LDA

§ Separate samples of distinct groups by projecting them onto a space
that
§ Maximizes their between-class separability while
§ Minimizing their within-class variability

Fisher Criterion
§
For each class Ck the within-class scatter (unnormalized variance) is given
as
s2k “ ÿ pyn ´ mkq2 , yn “ wJxn , mk “ wJmk
nPCk

Fisher Criterion
§
as
nPCk
§ Maximize the Fisher criterion:

Fisher Criterion
Between-class scatter pm2 ´ m1q2 wJSBw
Jpwq “ Within-class scatter “ s2 ` s22 “ wJSWw
1 SW “ ÿk
ÿnPCkpxn ´
mkqpxn ´ mkqJ
SB “ pm2 ´ m1qpm2 ´ m1qJ

Fisher Criterion
12
§
as
nPCk
§ Maximize the Fisher criterion:
Between-class scatter pm2 ´ m1q2 wJSBw

Fisher Criterion
Jpwq “ Within-class scatter “ s2 ` s22 “ wJSWw

1
SW “ ÿk ÿnPCkpxn ´ mkqpxn ´ mkqJ
SB “ pm2 ´ m1qpm2 ´ m1qJ
§ SW isthe total within-class scatter and proportional to the sample

covariance matrix

Generalization to k Classes
For k classes, we define the between-class scatter matrix as

ÿ N
1
SB “
k Nkpmk ´ µqpm2 ´ µqJ , µ“ N ÿxi

Finding the Projection
i“1
where µ is the global mean of the data set
Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w

Objective
wJ SBw
J pw q“
wJ SW w
We find w by setting dJ{dw “ 0:
dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0

14

Objective
wJ SBw
J pw q“
wJ SW w
dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0
ô SBw ´ JSWw “ 0

Objective
wJ SBw
J pw q“
wJ SW w
14
dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0 ô SBw ´ JSWw
“ 0 ô SW´1SBw ´ Jw “ 0

Objective
wJ SBw
J pw q“
wJ SW w
14
“ 0 ô SW´1SBw ´ Jw “ 0

Eigenvalue problem SW´1SBw “ Jw
14

Objective
wJ SBw
J pw q“
wJ SW w
“ 0 ô SW´1SBw ´ Jw “ 0
The projection vector w is the eigenvector of SW´1SB.

Objective
wJ SBw
J pw q“
wJ SW w
14
“ 0 ô SW´1SBw ´ Jw “ 0
The projection vector w is the eigenvector of SW´1SB.

Choose the eigenvector that corresponds to the maximum eigenvalue

(similar to PCA) to maximize class separability
14

Algorithm
1. Mean normalization
§X P RnˆD: ith row represents the ith sample

Algorithm

Algorithm
2. Compute mean vectors mi P RD for all k classes § X P RnˆD: ith
row represents the ith sample

Algorithm
2. Compute mean vectors mi P RD for all k classes
3. Compute scatter matrices SW, SB

Algorithm


Algorithm
4. Compute eigenvectors and eigenvalues of SW´1SB § X P RnˆD: ith row
represents the ith sample

Algorithm

4. Compute eigenvectors and eigenvalues of SW´1SB
5. Select k eigenvectors wi with the largest eigenvalues to form a D ˆ k-

dimensional matrix W “ rw1, . . . , wks

Algorithm


4. Compute eigenvectors and eigenvalues of SW´1SB
5. Select k eigenvectors wi with the largest eigenvalues to form a D ˆ k-

dimensional matrix W “ rw1, . . . , wks

Algorithm
6. Project samples onto the new subspace using W and compute the
new coordinates as Y “ XW

§Y P Rnˆk: Coordinate matrix of the n data points w.r.t. eigenbasis
W spanning the k-dimensional subspace

PCA vs LDA
4 PCA: Iris projection onto the first 2 principal axes 2.5 LDA: Iris projection onto the first 2 linear discriminants
§ Similar to PCA, we can use LDA for dimensionality reduction by looking

at an eigenvalue problem
§ LDA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to classification performance

PCA vs LDA
4 PCA: Iris projection onto the first 2 principal axes 2.5 LDA: Iris projection onto the first 2 linear discriminants

§ LDA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to classification performance
§ PCA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to minimizing reconstruction
error

Assumptions in LDA
§ The true covariance matrices of each class are equal

Assumptions in LDA

§ Without this assumption: Quadratic discriminant analysis (e.g.
Hastie et al., 2009)

Assumptions in LDA

§ Without this assumption: Quadratic discriminant analysis (e.g. Hastie
et al., 2009)
§ Performance of the standard LDA can be seriously degraded if there
are only a limited number of total training observations N compared
to the dimension D of the feature space.
Shrinkage (Copas, 1983)

Assumptions in LDA
§ Without this assumption: Quadratic discriminant analysis (e.g. Hastie

et al., 2009)
§ Performance of the standard LDA can be seriously degraded if there
are only a limited number of total training observations N compared
to the dimension D of the feature space. Shrinkage (Copas, 1983)
§ LDA explicitly attempts to model the difference between the classes of
data. PCA on the other hand does not take into account any
difference in class

Limitations of LDA
§ LDA’s most disriminant features are the means of the data distributions
§ LDA will fail when the discriminatory information is not the mean but
the variance of the data.
§ If the data distributions are very non-Gaussian, the LDA projections will
not preserve the complex structure of the data that may be required
for classification

Nonlinear LDA (e.g., Mika et al., 1999; Baudat & Anouar, 2000)
18
References I
[1] G. Baudat and F. Anouar. Generalized Discriminant Analysis using a

Kernel Approach. Neural Computation, 12(10):2385–2404, 2000.
[2] C. M. Bishop. Pattern Recognition and Machine Learning.
Information Science and Statistics. Springer-Verlag, 2006.
[3] J. B. Copas. Regression, Prediction and Shrinkage. Journal of the Royal
Statistical Society, Series B, 45(3):311–354, 1983.
[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning—Data Mining, Inference, and Prediction. Springer Series in

Statistics. Springer-Verlag New York, Inc., 175 Fifth Avenue, New York
City, NY, USA, 2001.
[5] S. Mika, G. Ratsch, J. Weston, B. Sch¨ olkopf, and K.-R. M¨
uller.¨ Fisher Discriminant Analysis with Kernels. Neural Networks
for Signal Processing, IX:41–48, 1999.
19

Linear Discriminant Analysis: Intelligent Data Analysis and Probabilistic Inference

Uploaded by

Copyright:

Available Formats

Linear Discriminant Analysis: Intelligent Data Analysis and Probabilistic Inference

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Discriminant Analysis: Intelligent Data Analysis and Probabilistic Inference

Uploaded by

Copyright:

Available Formats

Intelligent Data Analysis and Probabilistic Inference

Duncan Gillies and Marc Deisenroth

February 22, 2016

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 2

§ Input vector x P RD, assign it to one of K discrete classes Ck, k “

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 3

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 4

From PRML (Bishop, 2006)

§ Focus on linear classification model, i.e., the decision boundary is a

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 5

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 6

§ Measurementsfor 150 Iris flowers from three different species. §

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7

§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Measurements for 150 Iris flowers from three different species.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Given a new measurement of these features, predict the Iris species

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Measurements for 150 Iris flowers from three different species.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Given a new measurement of these features, predict the Iris species

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ PCA may not be ideal to separate the classes well

§ Measurements for 150 Iris flowers from three different species.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Given a new measurement of these features, predict the Iris species

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ PCA may not be ideal to separate the classes well

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Project input vector x P RD down to a 1-dimensional subspace with basis

P “ wwJ Projection matrix, such that Px “ p p “ yw P RD

Projection point Discussed in Lecture 14 y “ wJx P R Coordinates

with respect to basis w Today

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Project input vector x P RD down to a 1-dimensional subspace with basis

P “ wwJ Projection matrix, such that Px “ p p “ yw P RD

Projection point Discussed in Lecture 14 y “ wJx P R Coordinates

with respect to basis w Today

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

§ Look at the log-probability ratio

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7

§ Look at the log-probability ratio

covariance in both classes. Decision boundary:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7

ppC1|xq ppx|C1q ppC1q log “

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7

covariance in both classes. Decision boundary:

ô log ppC1q´1pmJΣ´1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0 ppC2q 2

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7

§ Look at the log-probability ratio

covariance in both classes. Decision boundary:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7