Pattern Recognition Presenation

Pattern
Classification
All materials in these slides were taken from

Pattern Classification (2nd ed) by R. O. Duda, P.
E. Hart and D. G. Stork, John Wiley & Sons, 2000
with the permission of the authors and the
publisher
Chapter 1: Introduction to Pattern
Recognition (Sections 1.1-1.6)
• Machine Perception
• An Example
• Pattern Recognition Systems
• The Design Cycle
• Learning and Adaptation
• Conclusion
3
Machine Perception
• Build a machine that can recognize patterns:

• Speech recognition
• Fingerprint identification
• OCR (Optical Character Recognition)
• DNA sequence identification
Pattern Classification, Chapter 1

4
An Example
• “Sorting incoming Fish on a conveyor according to

species using optical sensing”
Sea bass
Species
Salmon

5
• Problem Analysis
• Set up a camera and take some sample images to extract
features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!

6
• Preprocessing
• Use a segmentation operation to isolate fishes from one

another and from the background
• Information from a single fish is sent to a feature

extractor whose purpose is to reduce the data by
measuring certain features
• The features are passed to a classifier

7

8
• Classification
• Select the length of the fish as a possible feature for
discrimination

9

10
The length is a poor feature alone!
Select the lightness as a possible feature.

11

12
• Threshold decision boundary and cost relationship

• Move our decision boundary toward smaller values of
lightness in order to minimize the cost (reduce the number
of sea bass that are classified salmon!)
Task of decision theory

13
• Adopt the lightness and add the width of the fish

Fish xT = [x1, x2]
Lightness Width

14

15
• We might add other features that are not correlated

with the ones we already have. A precaution should
be taken not to reduce the performance by adding
such “noisy features”
• Ideally, the best decision boundary should be the one

which provides an optimal performance such as in the
following figure:

16

17
• However, our satisfaction is premature because

the central aim of designing a classifier is to
correctly classify novel input
Issue of generalization!

18

19
Pattern Recognition Systems
• Sensing
• Use of a transducer (camera or microphone)
• PR system depends of the bandwidth, the resolution
sensitivity distortion of the transducer
• Segmentation and grouping

• Patterns should be well separated and should not overlap

20

21
• Feature extraction
• Discriminative features
• Invariant features with respect to translation, rotation and
scale.
• Classification
• Use a feature vector provided by a feature extractor to
assign the object to a category
• Post Processing
• Exploit context input dependent information other than from
the target pattern itself to improve performance

22
The Design Cycle
• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity

23

24
• Data Collection
• How do we know when we have collected an adequately
large and representative set of examples for training and
testing the system?

25
• Feature Choice
• Depends on the characteristics of the problem domain.
Simple to extract, invariant to irrelevant transformation
insensitive to noise.

26
• Model Choice
• Unsatisfied with the performance of our fish classifier and
want to jump to another class of model

27
• Training
• Use data to determine the classifier. Many different
procedures for training classifiers and choosing models

28
• Evaluation
• Measure the error rate (or performance and switch from
one set of features to another one

29
• Computational Complexity
• What is the trade-off between computational ease and
performance?
• (How an algorithm scales as a function of the number of

features, patterns or categories?)

30
Learning and Adaptation
• Supervised learning
• A teacher provides a category label or cost for each
pattern in the training set
• Unsupervised learning
• The system forms clusters or “natural groupings” of the
input patterns

31
Conclusion
• Reader seems to be overwhelmed by the number,

complexity and magnitude of the sub-problems of
Pattern Recognition
• Many of these sub-problems can indeed be solved

• Many fascinating unsolved problems still remain

Pattern
Classification

Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
publisher
Chapter 2 (Part 1):
Bayesian Decision Theory
(Sections 2.1-2.2)
• Introduction
• Bayesian Decision Theory–Continuous Features
2
Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
• P(ω1) = P(ω2) (uniform priors)
• P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)
Pattern Classification, Chapter 2 (Part 1)

3
• Decision rule with only the prior information

• Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2
• Use of the class –conditional information

• P(x | ω1) and P(x | ω2) describe the difference in
lightness between populations of sea and salmon

4

5
• Posterior, likelihood, evidence

• P(ωj | x) = P(x | ωj) . P (ωj) / P(x)
• Where in case of two categories
j=2
P ( x ) = ∑ P ( x | ω j )P ( ω j )
j =1
• Posterior = (Likelihood. Prior) / Evidence

6

7
• Decision given the posterior probabilities
X is an observation for which:
if P(ω1 | x) > P(ω2 | x) True state of nature = ω1

if P(ω1 | x) < P(ω2 | x) True state of nature = ω2
Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(ω1 | x) if we decide ω2
P(error | x) = P(ω2 | x) if we decide ω1
8
• Minimizing the probability of error

• Decide ω1 if P(ω1 | x) > P(ω2 | x);
otherwise decide ω2
Therefore:
P(error | x) = min [P(ω1 | x), P(ω2 | x)]
(Bayes decision)

9
Bayesian Decision Theory –
Continuous Features
• Generalization of the preceding ideas

• Use of more than one feature
• Use more than two states of nature
• Allowing actions and not only decide on the state of
nature
• Introduce a loss of function which is more general than
the probability of error

10
• Allowing actions other than classification primarily

allows the possibility of rejection
• Refusing to make a decision in close or bad cases!

• The loss function states how costly each action
taken is

11
Let {ω1, ω2,…, ωc} be the set of c states of nature

(or “categories”)
Let {α1, α2,…, αa} be the set of possible actions
Let λ(αi | ωj) be the loss incurred for taking
action αi when the state of nature is ωj

12
Overall risk
R = Sum of all R(αi | x) for i = 1,…,a
Conditional risk
Minimizing R Minimizing R(αi | x) for i = 1,…, a
j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1
for i = 1,…,a
13
Select the action αi for which R(αi | x) is minimum
R is minimum and R in this case is called the

Bayes risk = best performance that can be achieved!

14
• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj)
loss incurred for deciding ωi when the true state of nature is ωj
Conditional risk:
R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)

R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)

15
Our rule is the following:

if R(α1 | x) < R(α2 | x)
action α1: “decide ω1” is taken
This results in the equivalent rule :

decide ω1 if:
(λ21- λ11) P(x | ω1) P(ω1) >

(λ12- λ22) P(x | ω2) P(ω2)
and decide ω2 otherwise

16
Likelihood ratio:
The preceding rule is equivalent to the following rule:
P ( x | ω 1 ) λ12 − λ 22 P ( ω 2 )
if > .
P ( x | ω 2 ) λ 21 − λ11 P ( ω 1 )
Then take action α1 (decide ω1)

Otherwise take action α2 (decide ω2)

17
Optimal decision property
“If the likelihood ratio exceeds a threshold value

independent of the input pattern x, we can take
optimal actions”

18
Exercise
Select the optimal decision where:

Ω= {ω1, ω2}
P(x | ω1) N(2, 0.5) (Normal distribution)
P(x | ω2) N(1.5, 0.2)
P(ω1) = 2/3
P(ω2) = 1/3
⎡1 2⎤
λ=⎢ ⎥
⎣3 4 ⎦
0
Pattern
Classification
All materials in these slides were taken

from
& Sons, 2000
with the permission of the authors and
the publisher

Chapter 2 (Part 2):
(Sections 2.3-2.5)
• Minimum-Error-Rate Classification
• Classifiers, Discriminant Functions and Decision Surfaces
• The Normal Density
2
Minimum-Error-Rate Classification
• Actions are decisions on classes

If action αi is taken and the true state of nature is ωj then:
the decision is correct if i = j and in error if i ≠ j
• Seek a decision rule that minimizes the probability

of error which is the error rate

3
• Introduction of the zero-one loss function:

⎧0 i = j
λ ( α i ,ω j ) = ⎨ i , j = 1 ,..., c
⎩1 i ≠ j
Therefore, the conditional risk is:

j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1
= ∑ P( ω j | x ) = 1 − P( ω i | x )
j ≠1
“The risk corresponding to this loss function is the

average probability error”
4
• Minimize the risk requires maximize P(ωi | x)

(since R(αi | x) = 1 – P(ωi | x))
• For Minimum error rate

• Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i

5
• Regions of decision and zero-one loss function, therefore:
λ12 − λ 22 P ( ω 2 ) P( x | ω1 )
Let . = θ λ then decide ω 1 if : > θλ
λ 21 − λ11 P ( ω 1 ) P( x | ω 2 )
• If λ is the zero-one loss function wich means:
⎛ 0 1⎞
λ = ⎜⎜ ⎟⎟
⎝1 0⎠
P( ω 2 )
then θ λ = = θa
P( ω1 )
⎛0 2 ⎞ 2 P( ω 2 )
if λ = ⎜⎜ ⎟⎟ then θ λ = = θb
⎝1 0⎠ P( ω1 )
6

7
Classifiers, Discriminant Functions
and Decision Surfaces
• The multi-category case

• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class ωi
if:
gi(x) > gj(x) ∀j ≠ i

8

9
• Let gi(x) = - R(αi | x)
(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take

gi(x) = P(ωi | x)
(max. discrimination corresponds to max. posterior!)

gi(x) ≡ P(x | ωi) P(ωi)
gi(x) = ln P(x | ωi) + ln P(ωi)

(ln: natural logarithm!)

10
• Feature space divided into c decision regions

if gi(x) > gj(x) ∀j ≠ i then x is in Ri
(Ri means assign x to ωi)
• The two-category case

• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x) ≡ g1(x) – g2(x)
Decide ω1 if g(x) > 0 ; Otherwise decide ω2

11
• The computation of g(x)
g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 ) P( ω1 )
= ln + ln
P( x | ω 2 ) P( ω 2 )

12

13
The Normal Density
• Univariate density
• Density which is analytically tractable
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
1 ⎡ 1⎛ x−μ⎞ ⎤
2
P( x ) = exp ⎢ − ⎜ ⎟ ⎥,
2π σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
Where:
μ = mean (or expected value) of x
σ2 = expected squared deviation or variance

14

15
• Multivariate density
• Multivariate normal density in d dimensions is:
1 ⎡ 1 ⎤
P( x ) = exp ⎢ − ( x − μ ) Σ ( x − μ )⎥
t −1
( 2π ) Σ ⎣ 2 ⎦
d/2 1/ 2
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
μ = (μ1, μ2, …, μd)t mean vector
Σ = d*d covariance matrix
|Σ| and Σ-1 are determinant and inverse respectively

Pattern
Classification

& Sons, 2000
publisher
Chapter 2 (part 3)
(Sections 2-6,2-9)
• Discriminant Functions for the Normal Density

• Bayes Decision Theory – Discrete Features
2
Discriminant Functions for the
Normal Density
• We saw that the minimum error-rate
classification can be achieved by the
discriminant function
gi(x) = ln P(x | ωi) + ln P(ωi)
• Case of multivariate normal

−1
1 d 1
g i ( x ) = − ( x − μ i )t ∑ ( x − μ i ) − ln 2π − ln Σ i + ln P ( ω i )
2 i 2 2

3
• Case Σi = σ2.I (I stands for the identity matrix)
g i ( x ) = w it x + w i 0 (linear discrimina nt function)

where :
μi 1
wi = 2 ; wi 0 = − μ i μ i + ln P ( ω i )
t
σ 2σ 2
( ω i 0 is called the threshold for the ith category! )

4
• A classifier that uses linear discriminant functions

is called “a linear machine”
• The decision surfaces for a linear machine are

pieces of hyperplanes defined by:
gi(x) = gj(x)

5

6
• The hyperplane separating Ri and Rj

1 σ2 P( ω i )
x0 = ( μ i + μ j ) − ln ( μi − μ j )
2 μi − μ j
2
P( ω j )
always orthogonal to the line linking the means!
1
if P ( ω i ) = P ( ω j ) then x0 = ( μ i + μ j )
2

7

8

9
• Case Σi = Σ (covariance of all classes are

identical but arbitrary!)
• Hyperplane separating Ri and Rj
1
x0 = ( μ i + μ j ) −
[ ]
ln P ( ω i ) / P ( ω j )
.( μ i − μ j )
2 ( μi − μ j ) Σ ( μi − μ j )
t −1
(the hyperplane separating Ri and Rj is generally

not orthogonal to the line between the means!)

10

11

12
• Case Σi = arbitrary
• The covariance matrices are different for each category
g i ( x ) = x tW i x + w it x = w i 0
where :
1 −1
Wi = − Σ i
2
w i = Σ i− 1 μ i
1 t −1 1
w i0 = − μ i Σ i μ i − ln Σ i + ln P ( ω i )
2 2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,

hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)
13

14

15
Bayes Decision Theory – Discrete

Features
• Components of x are binary or integer valued, x can

take only one of m discrete values
v1, v2, …, vm
• Case of independent binary features in 2 category

problem
Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with
probabilities:
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)

16
• The discriminant function in this case is:

d
g ( x ) = ∑ w i x i + w0
i =1
where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
1 − pi
d
P( ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P( ω 2 )
decide ω 1 if g(x) > 0 and ω 2 if g(x) ≤ 0

Pattern Recognition Presenation

Uploaded by

Copyright:

Available Formats

Pattern Recognition Presenation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Recognition Presenation

Uploaded by

Copyright:

Available Formats

Pattern

All materials in these slides were taken from

• Build a machine that can recognize patterns:

Pattern Classification, Chapter 1

• “Sorting incoming Fish on a conveyor according to

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• Use a segmentation operation to isolate fishes from one

• Information from a single fish is sent to a feature

• The features are passed to a classifier

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

The length is a poor feature alone!

Select the lightness as a possible feature.

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• Threshold decision boundary and cost relationship

Task of decision theory

Pattern Classification, Chapter 1

• Adopt the lightness and add the width of the fish

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• We might add other features that are not correlated

• Ideally, the best decision boundary should be the one

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• However, our satisfaction is premature because

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Recognition Systems

• Segmentation and grouping

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

The Design Cycle

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

Pattern Classification, Chapter 1

• (How an algorithm scales as a function of the number of

Pattern Classification, Chapter 1

Learning and Adaptation

Pattern Classification, Chapter 1

• Reader seems to be overwhelmed by the number,

• Many of these sub-problems can indeed be solved

Pattern Classification, Chapter 1

All materials in these slides were taken from

• P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)

Pattern Classification, Chapter 2 (Part 1)

• Decision rule with only the prior information

• Use of the class –conditional information

Pattern Classification, Chapter 2 (Part 1)

Pattern Classification, Chapter 2 (Part 1)

• Posterior, likelihood, evidence