Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pattern Recognition Presenation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

Pattern

Classification

All materials in these slides were taken from


Pattern Classification (2nd ed) by R. O. Duda, P.
E. Hart and D. G. Stork, John Wiley & Sons, 2000
with the permission of the authors and the
publisher
Chapter 1: Introduction to Pattern
Recognition (Sections 1.1-1.6)

• Machine Perception
• An Example
• Pattern Recognition Systems
• The Design Cycle
• Learning and Adaptation
• Conclusion
3

Machine Perception

• Build a machine that can recognize patterns:


• Speech recognition
• Fingerprint identification
• OCR (Optical Character Recognition)
• DNA sequence identification

Pattern Classification, Chapter 1


4

An Example

• “Sorting incoming Fish on a conveyor according to


species using optical sensing”

Sea bass
Species
Salmon

Pattern Classification, Chapter 1


5

• Problem Analysis
• Set up a camera and take some sample images to extract
features

• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!

Pattern Classification, Chapter 1


6

• Preprocessing

• Use a segmentation operation to isolate fishes from one


another and from the background

• Information from a single fish is sent to a feature


extractor whose purpose is to reduce the data by
measuring certain features

• The features are passed to a classifier

Pattern Classification, Chapter 1


7

Pattern Classification, Chapter 1


8

• Classification
• Select the length of the fish as a possible feature for
discrimination

Pattern Classification, Chapter 1


9

Pattern Classification, Chapter 1


10

The length is a poor feature alone!

Select the lightness as a possible feature.

Pattern Classification, Chapter 1


11

Pattern Classification, Chapter 1


12

• Threshold decision boundary and cost relationship


• Move our decision boundary toward smaller values of
lightness in order to minimize the cost (reduce the number
of sea bass that are classified salmon!)

Task of decision theory

Pattern Classification, Chapter 1


13

• Adopt the lightness and add the width of the fish


Fish xT = [x1, x2]

Lightness Width

Pattern Classification, Chapter 1


14

Pattern Classification, Chapter 1


15

• We might add other features that are not correlated


with the ones we already have. A precaution should
be taken not to reduce the performance by adding
such “noisy features”

• Ideally, the best decision boundary should be the one


which provides an optimal performance such as in the
following figure:

Pattern Classification, Chapter 1


16

Pattern Classification, Chapter 1


17

• However, our satisfaction is premature because


the central aim of designing a classifier is to
correctly classify novel input

Issue of generalization!

Pattern Classification, Chapter 1


18

Pattern Classification, Chapter 1


19

Pattern Recognition Systems

• Sensing
• Use of a transducer (camera or microphone)
• PR system depends of the bandwidth, the resolution
sensitivity distortion of the transducer

• Segmentation and grouping


• Patterns should be well separated and should not overlap

Pattern Classification, Chapter 1


20

Pattern Classification, Chapter 1


21

• Feature extraction
• Discriminative features
• Invariant features with respect to translation, rotation and
scale.

• Classification
• Use a feature vector provided by a feature extractor to
assign the object to a category

• Post Processing
• Exploit context input dependent information other than from
the target pattern itself to improve performance

Pattern Classification, Chapter 1


22

The Design Cycle

• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity

Pattern Classification, Chapter 1


23

Pattern Classification, Chapter 1


24

• Data Collection
• How do we know when we have collected an adequately
large and representative set of examples for training and
testing the system?

Pattern Classification, Chapter 1


25

• Feature Choice
• Depends on the characteristics of the problem domain.
Simple to extract, invariant to irrelevant transformation
insensitive to noise.

Pattern Classification, Chapter 1


26

• Model Choice
• Unsatisfied with the performance of our fish classifier and
want to jump to another class of model

Pattern Classification, Chapter 1


27

• Training
• Use data to determine the classifier. Many different
procedures for training classifiers and choosing models

Pattern Classification, Chapter 1


28

• Evaluation
• Measure the error rate (or performance and switch from
one set of features to another one

Pattern Classification, Chapter 1


29

• Computational Complexity
• What is the trade-off between computational ease and
performance?

• (How an algorithm scales as a function of the number of


features, patterns or categories?)

Pattern Classification, Chapter 1


30

Learning and Adaptation

• Supervised learning
• A teacher provides a category label or cost for each
pattern in the training set

• Unsupervised learning
• The system forms clusters or “natural groupings” of the
input patterns

Pattern Classification, Chapter 1


31

Conclusion

• Reader seems to be overwhelmed by the number,


complexity and magnitude of the sub-problems of
Pattern Recognition

• Many of these sub-problems can indeed be solved


• Many fascinating unsolved problems still remain

Pattern Classification, Chapter 1


Pattern
Classification

All materials in these slides were taken from


Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 2 (Part 1):
Bayesian Decision Theory
(Sections 2.1-2.2)

• Introduction
• Bayesian Decision Theory–Continuous Features
2

Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
• P(ω1) = P(ω2) (uniform priors)

• P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)

Pattern Classification, Chapter 2 (Part 1)


3

• Decision rule with only the prior information


• Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2

• Use of the class –conditional information


• P(x | ω1) and P(x | ω2) describe the difference in
lightness between populations of sea and salmon

Pattern Classification, Chapter 2 (Part 1)


4

Pattern Classification, Chapter 2 (Part 1)


5

• Posterior, likelihood, evidence


• P(ωj | x) = P(x | ωj) . P (ωj) / P(x)
• Where in case of two categories
j=2
P ( x ) = ∑ P ( x | ω j )P ( ω j )
j =1

• Posterior = (Likelihood. Prior) / Evidence


Pattern Classification, Chapter 2 (Part 1)
6

Pattern Classification, Chapter 2 (Part 1)


7

• Decision given the posterior probabilities

X is an observation for which:

if P(ω1 | x) > P(ω2 | x) True state of nature = ω1


if P(ω1 | x) < P(ω2 | x) True state of nature = ω2

Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(ω1 | x) if we decide ω2
P(error | x) = P(ω2 | x) if we decide ω1
Pattern Classification, Chapter 2 (Part 1)
8

• Minimizing the probability of error


• Decide ω1 if P(ω1 | x) > P(ω2 | x);
otherwise decide ω2

Therefore:
P(error | x) = min [P(ω1 | x), P(ω2 | x)]
(Bayes decision)

Pattern Classification, Chapter 2 (Part 1)


9
Bayesian Decision Theory –
Continuous Features

• Generalization of the preceding ideas


• Use of more than one feature
• Use more than two states of nature
• Allowing actions and not only decide on the state of
nature
• Introduce a loss of function which is more general than
the probability of error

Pattern Classification, Chapter 2 (Part 1)


10

• Allowing actions other than classification primarily


allows the possibility of rejection

• Refusing to make a decision in close or bad cases!


• The loss function states how costly each action
taken is

Pattern Classification, Chapter 2 (Part 1)


11

Let {ω1, ω2,…, ωc} be the set of c states of nature


(or “categories”)

Let {α1, α2,…, αa} be the set of possible actions

Let λ(αi | ωj) be the loss incurred for taking

action αi when the state of nature is ωj

Pattern Classification, Chapter 2 (Part 1)


12
Overall risk
R = Sum of all R(αi | x) for i = 1,…,a

Conditional risk

Minimizing R Minimizing R(αi | x) for i = 1,…, a

j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1

for i = 1,…,a
Pattern Classification, Chapter 2 (Part 1)
13

Select the action αi for which R(αi | x) is minimum

R is minimum and R in this case is called the


Bayes risk = best performance that can be achieved!

Pattern Classification, Chapter 2 (Part 1)


14

• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj)
loss incurred for deciding ωi when the true state of nature is ωj

Conditional risk:

R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)


R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)

Pattern Classification, Chapter 2 (Part 1)


15

Our rule is the following:


if R(α1 | x) < R(α2 | x)
action α1: “decide ω1” is taken

This results in the equivalent rule :


decide ω1 if:

(λ21- λ11) P(x | ω1) P(ω1) >


(λ12- λ22) P(x | ω2) P(ω2)

and decide ω2 otherwise


Pattern Classification, Chapter 2 (Part 1)
16

Likelihood ratio:

The preceding rule is equivalent to the following rule:

P ( x | ω 1 ) λ12 − λ 22 P ( ω 2 )
if > .
P ( x | ω 2 ) λ 21 − λ11 P ( ω 1 )

Then take action α1 (decide ω1)


Otherwise take action α2 (decide ω2)

Pattern Classification, Chapter 2 (Part 1)


17

Optimal decision property

“If the likelihood ratio exceeds a threshold value


independent of the input pattern x, we can take
optimal actions”

Pattern Classification, Chapter 2 (Part 1)


18
Exercise

Select the optimal decision where:


Ω= {ω1, ω2}
P(x | ω1) N(2, 0.5) (Normal distribution)
P(x | ω2) N(1.5, 0.2)

P(ω1) = 2/3
P(ω2) = 1/3
⎡1 2⎤
λ=⎢ ⎥
⎣3 4 ⎦
Pattern Classification, Chapter 2 (Part 1)
0

Pattern
Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher

Pattern Classification, Chapter 2 (Part 2)


Chapter 2 (Part 2):
Bayesian Decision Theory
(Sections 2.3-2.5)

• Minimum-Error-Rate Classification
• Classifiers, Discriminant Functions and Decision Surfaces
• The Normal Density
2

Minimum-Error-Rate Classification

• Actions are decisions on classes


If action αi is taken and the true state of nature is ωj then:
the decision is correct if i = j and in error if i ≠ j

• Seek a decision rule that minimizes the probability


of error which is the error rate

Pattern Classification, Chapter 2 (Part 2)


3

• Introduction of the zero-one loss function:


⎧0 i = j
λ ( α i ,ω j ) = ⎨ i , j = 1 ,..., c
⎩1 i ≠ j

Therefore, the conditional risk is:


j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1

= ∑ P( ω j | x ) = 1 − P( ω i | x )
j ≠1

“The risk corresponding to this loss function is the


average probability error”
Pattern Classification, Chapter 2 (Part 2)
4

• Minimize the risk requires maximize P(ωi | x)


(since R(αi | x) = 1 – P(ωi | x))

• For Minimum error rate


• Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i

Pattern Classification, Chapter 2 (Part 2)


5

• Regions of decision and zero-one loss function, therefore:

λ12 − λ 22 P ( ω 2 ) P( x | ω1 )
Let . = θ λ then decide ω 1 if : > θλ
λ 21 − λ11 P ( ω 1 ) P( x | ω 2 )
• If λ is the zero-one loss function wich means:
⎛ 0 1⎞
λ = ⎜⎜ ⎟⎟
⎝1 0⎠
P( ω 2 )
then θ λ = = θa
P( ω1 )
⎛0 2 ⎞ 2 P( ω 2 )
if λ = ⎜⎜ ⎟⎟ then θ λ = = θb
⎝1 0⎠ P( ω1 )
Pattern Classification, Chapter 2 (Part 2)
6

Pattern Classification, Chapter 2 (Part 2)


7
Classifiers, Discriminant Functions
and Decision Surfaces

• The multi-category case


• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class ωi
if:
gi(x) > gj(x) ∀j ≠ i

Pattern Classification, Chapter 2 (Part 2)


8

Pattern Classification, Chapter 2 (Part 2)


9
• Let gi(x) = - R(αi | x)
(max. discriminant corresponds to min. risk!)

• For the minimum error rate, we take


gi(x) = P(ωi | x)

(max. discrimination corresponds to max. posterior!)


gi(x) ≡ P(x | ωi) P(ωi)

gi(x) = ln P(x | ωi) + ln P(ωi)


(ln: natural logarithm!)

Pattern Classification, Chapter 2 (Part 2)


10

• Feature space divided into c decision regions


if gi(x) > gj(x) ∀j ≠ i then x is in Ri
(Ri means assign x to ωi)

• The two-category case


• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2

Let g(x) ≡ g1(x) – g2(x)

Decide ω1 if g(x) > 0 ; Otherwise decide ω2

Pattern Classification, Chapter 2 (Part 2)


11

• The computation of g(x)

g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 ) P( ω1 )
= ln + ln
P( x | ω 2 ) P( ω 2 )

Pattern Classification, Chapter 2 (Part 2)


12

Pattern Classification, Chapter 2 (Part 2)


13
The Normal Density
• Univariate density
• Density which is analytically tractable
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)

1 ⎡ 1⎛ x−μ⎞ ⎤
2

P( x ) = exp ⎢ − ⎜ ⎟ ⎥,
2π σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
Where:
μ = mean (or expected value) of x
σ2 = expected squared deviation or variance

Pattern Classification, Chapter 2 (Part 2)


14

Pattern Classification, Chapter 2 (Part 2)


15

• Multivariate density
• Multivariate normal density in d dimensions is:
1 ⎡ 1 ⎤
P( x ) = exp ⎢ − ( x − μ ) Σ ( x − μ )⎥
t −1

( 2π ) Σ ⎣ 2 ⎦
d/2 1/ 2

where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
μ = (μ1, μ2, …, μd)t mean vector
Σ = d*d covariance matrix
|Σ| and Σ-1 are determinant and inverse respectively

Pattern Classification, Chapter 2 (Part 2)


Pattern
Classification

All materials in these slides were taken from


Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 2 (part 3)
Bayesian Decision Theory
(Sections 2-6,2-9)

• Discriminant Functions for the Normal Density


• Bayes Decision Theory – Discrete Features
2
Discriminant Functions for the
Normal Density
• We saw that the minimum error-rate
classification can be achieved by the
discriminant function

gi(x) = ln P(x | ωi) + ln P(ωi)

• Case of multivariate normal


−1
1 d 1
g i ( x ) = − ( x − μ i )t ∑ ( x − μ i ) − ln 2π − ln Σ i + ln P ( ω i )
2 i 2 2

Pattern Classification, Chapter 2 (Part 3)


3

• Case Σi = σ2.I (I stands for the identity matrix)

g i ( x ) = w it x + w i 0 (linear discrimina nt function)


where :
μi 1
wi = 2 ; wi 0 = − μ i μ i + ln P ( ω i )
t

σ 2σ 2

( ω i 0 is called the threshold for the ith category! )

Pattern Classification, Chapter 2 (Part 3)


4

• A classifier that uses linear discriminant functions


is called “a linear machine”

• The decision surfaces for a linear machine are


pieces of hyperplanes defined by:

gi(x) = gj(x)

Pattern Classification, Chapter 2 (Part 3)


5

Pattern Classification, Chapter 2 (Part 3)


6

• The hyperplane separating Ri and Rj


1 σ2 P( ω i )
x0 = ( μ i + μ j ) − ln ( μi − μ j )
2 μi − μ j
2
P( ω j )

always orthogonal to the line linking the means!

1
if P ( ω i ) = P ( ω j ) then x0 = ( μ i + μ j )
2

Pattern Classification, Chapter 2 (Part 3)


7

Pattern Classification, Chapter 2 (Part 3)


8

Pattern Classification, Chapter 2 (Part 3)


9

• Case Σi = Σ (covariance of all classes are


identical but arbitrary!)

• Hyperplane separating Ri and Rj

1
x0 = ( μ i + μ j ) −
[ ]
ln P ( ω i ) / P ( ω j )
.( μ i − μ j )
2 ( μi − μ j ) Σ ( μi − μ j )
t −1

(the hyperplane separating Ri and Rj is generally


not orthogonal to the line between the means!)

Pattern Classification, Chapter 2 (Part 3)


10

Pattern Classification, Chapter 2 (Part 3)


11

Pattern Classification, Chapter 2 (Part 3)


12

• Case Σi = arbitrary
• The covariance matrices are different for each category

g i ( x ) = x tW i x + w it x = w i 0
where :
1 −1
Wi = − Σ i
2
w i = Σ i− 1 μ i
1 t −1 1
w i0 = − μ i Σ i μ i − ln Σ i + ln P ( ω i )
2 2

(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,


hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 3)
13

Pattern Classification, Chapter 2 (Part 3)


14

Pattern Classification, Chapter 2 (Part 3)


15

Bayes Decision Theory – Discrete


Features

• Components of x are binary or integer valued, x can


take only one of m discrete values
v1, v2, …, vm

• Case of independent binary features in 2 category


problem
Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with
probabilities:
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)

Pattern Classification, Chapter 2 (Part 3)


16

• The discriminant function in this case is:


d
g ( x ) = ∑ w i x i + w0
i =1

where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
1 − pi
d
P( ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P( ω 2 )
decide ω 1 if g(x) > 0 and ω 2 if g(x) ≤ 0
Pattern Classification, Chapter 2 (Part 3)

You might also like