CpE646 8v3 PDF

CpE
p 646 Pattern Recognition

g and
Classification
Prof. Hong Man
Department of Electrical and

Computer Engineering
Stevens Institute of Technology
Non-Parametric Classification
Chapter 4 (Section 4.4 4.6):
kn-Nearest-Neighbor
Nearest Neighbor Estimation
The Nearest-Neighbor Rule

kn-Nearest Neighbor Estimation
A solution for the pproblem of the unknown best window

function. To estimate p(x) from n training samples or
prototypes:
Let
L t the
th cell
ll volume
l be
b a function
f ti off the th number b off the
th
training data
Ce
Center
te a cell
ce about x aandd let
et itt grows
g ows until
u t itt captures
captu es kn
samples (kn = f(n))
The included samples are called the kn nearest-
neighbors
i hb off x
kn / n
The density is given as pn ( x) = (30)
Vn
Two ppossibilities can occur:

If the density is high near x, the cell will be small which
provides a good resolution
If the density is low around x, the cell will grow large
and stop until higher density regions are reached
It can be proven that lim kn = and lim kn / n = 0

n n
are necessary and sufficient for pn(x) converge to p(x) at
all points where p(x) is continuous
If kn = n and pn(x) is a good estimate of p(x), i.e. p1(x) =
pn(x) = p(x),
p(x) from (30) we have
Vn = 1/ ( )
n p ( x) and then Vn = V1 / n
this becomes similar to the Parzen-window
Parzen window approach
except that V1 is not determined arbitrarily.
Peaks are at the middle of regions with k prototypes

kn nearest-neighbor
g illustration
For kn = n when n=1, k1=1, and the estimate is
1
pn ( x) =
2 | x x1 |
where x1 is the first training sample, and || is a distance
measure. 2|x
| - x1| is the volume. This is not a ggood
estimate (Figure 4.12)
As n increases, the estimate gets better.
This method will not generate zero p(x) for any x. (If a
fixed window, e.g. Parzen window, is used and no
sample falls inside this window,
window the density estimate for
this window will be zero. This will not happen here.)
We can obtain a familyy of estimates byy makingg kn = k1 n

and adjusting the value k1.
Similar to Parzen window method, the choice of k1 is case
dependent.
Usually k1 is selected in the way that, when the estimated
density is applied to classify new test samples from the
same density, it yields the lowest error rate.
Estimation of a pposteriori pprobabilities. We can estimate

P(i|x) from a set of n labeled samples using the window
methods
We
W placel a cellll off volume
l V aroundd x andd capture
t k
samples
If ki sa
samples
p es aamong
o g these
t ese k sa
samples
p es tu
turn out to be
labeled i then the joint probability p(x,i) can be
ki / n
pn ( x, i ) =
V
Then a reasonable estimate of a pposteriori pprobabilityy is

pn ( x, i ) ki
Pn (i | x) = c
=
k
p ( x, )
j =1
n j
ki/k is the fraction of the samples within the cell that are
l b l d j, i.e.
labeled i k samplesl ini the
th cell
ll andd ki outt off k are
labeled j
For minimum error rate,, the most frequently
q y
represented category within the cell is selected for this
cell, and any test sample lies in this cell is labeled as
this category.
category
The Nearest Neighbor Rule
Let Dn = {{x1, x2, ,, xn} be a set of n labeled pprototypes

yp
Let x Dn be the closest prototype to a test point x then
the nearest-neighbor rule for classifying x is to assign it the
l b l associated
label i t d with
ith x
The nearest-neighbor rule leads to an error rate greater than
tthee minimum
u possible
poss b e -- tthee Bayes
ayes rate
ate
If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice
i the
h Bayes
B rate
The label associated with the nearest neighbor

g x is a
random variable, and the probability that =i is the a
posteriori probability P(i|x).
If n , it is
i always
l possible
ibl to
t find
fi d x
sufficiently
ffi i tl close
l
to x so that P(i | x) P(i | x)
definee m(x) tthat
We de at P((m| x) = max
a i P((i| x). Then
e the
t e
Bayes rule always select m for x.
This rule essentiallyy partitions

p the feature space
p into cells
and each cell containing a prototype x and all points
closer to it than to any other prototypes. All points in a cell
are labeled by the category of this xx, which is called
Voronoi tesselation of the space.
In each cell,
If P(m | x) P(i | x) 1, then the nearest neighbor
selection is almost always the same as the Bayes
selection
If P(m | x) P(i | x) 1/c, then the nearest neighbor
selection is rarelyy the same as the Bayes
y selection, but
their error rates are similar (i.e. both are random guess)
The averageg probability

p y of error of the nearest-neighbor
g
rule for infinite sample is
P(e) = P(e | x) p( x)dx
The Bayes decision rule minimizes P(e) by minimizing
P(e|x) for every x, then
P* (e | x) min ( P(e | x) ) = 1 P(m | x)

P* min ( P (e) ) = P* (e | x) p ( x)dx
When onlyy n samples

p are used in nearest neighbor g rule,, the
conditional probability of error becomes
P(e | x) = P(e | x, x ') p( x ' | x)dx '
where x is the nearest neighbor prototype of x.
Each time when we take n samples, the nearest
neighbor xxmay be different
different, i.e.
i e xx is a random
variable.
As n, p(x| x) approaches a delta function centered
at x, p(x| x)(x-x)
(i.e. a nearest neighbor x can be always found very
close to x).
)
To solve Pn((e||x, x)
We have n pairs of random variables {(x1,1),
(x2,2), , (xn,n)}, where j is class label for xj and
j {
{ 1, 2, , c}
Because the state of nature when xn (the nearest
e g bo of
neighbor o xw when
e tota
total sample
sa p e iss n) iss ddrawn
aw iss
independent of the state of nature when x is drawn,
we have
P( , n' | x, xn' ) = P ( | x) P( n' | xn' )
If we use the nearest-neighbor

g decision rule,, the
error occurs when n, therefore
c
Pn (e | x, x ) = 1 P( = i , n' = i | x, xn' )
'
n
i =1
c
= 1 P(i | x) P(i | xn' )
i =1
c

lim Pn (e | x) = 1 P(i | x) P (i | xn' ) ( xn' x)dxn'
n
i =1
c
= 1 P 2 (i | x)
i 1
i=
The overall asymptotic

y p nearest-neighbor
g error rate is
P = lim Pn (e)
n
li Pn (e | x) p ( x)dx
= lim d
n
c

= 1 P (i | x) p ( x)dx
2
i =1
The error rate is bounded (proof in Sec 4.5.3)
c *
P P P 2
* *
P
c 1
The k-Nearest Neighbor Rule
The k-nearest neighbor g rule is an extension of the nearest

neighbor rule.
Classify x by assigning it the label most frequently
represented
t d among the th k nearestt samples l andd use a votingti
scheme
W
When e the
t e total
tota number u be of o prototypes
p ototypes approaches
app oac es infinity,ty,
these k neighbors will all converge to x.
In a two-class case, the k-nearest neighbor rule selects m
if a majority
j i off the h k neighbors
i hb are labeled l b l d m, this hi event
has the probability
k
k

i
i = ( k +1) / 2
P ( m | x ) i
[1 P ( m | x )]k i
Example
p
k = 3 (odd value) and
Prototypes Labels
x = (0.10, 0.25)t
(0.15,
(0 15 0.35)
0 35) 1
(0.10, 0.28) 2
(0 09 0.30)
(0.09, 0 30) 1
(0.12, 0.20) 2
3 closest vectors to x with their labels are:
{(0.10, 0.28; 2); (0.12, 0.20; 2); (0.15, 0.35; 1)}
The majority
j y votingg scheme will assignsg the label 2 to x.
Metrics and Nearest Neighbor Classification
The nearest neighbor

g classifier relies on certain distance
function metric
Frequently we assume the metric is Euclidean distance in d
di
dimensions,
i but
b t it can be
b a generalized
li d scalar
l distance
di t
between two argument patterns D( , )
A metric
et c must
ust have
ave four
ou pproperties:
ope t es: for
o any
a y given
g ve vectors
vecto s
a, b and c
Non-negativity: D(a,b)0
Reflexivity: D(a,b)=0 iff a=b
Symmetry: D(a,b)=D(b,a)
Triangle lit D(a,b)+D(b,c)D(a,c)
T i l inequality:
i
The Euclidean distance in d dimensions satisfies these

properties
1/ 2
d
2
D(a, b) = (ak bk )
k =1
E
Euclidean
lid distance
di t is
i very sensitive
iti tot the
th scales
l (units)
( it ) off
the coordinates, which has negative impact to the
performance of nearest-neighbor classifiers
Minkowski metric,, also referred to as the Lk norm

1/ k
d
k
Lk (a, b) = | ak bk |
k =1
Euclidean distance is the L2 norm
L1 norm is referred to as the Manhattan distance
L distance between a and b is the maximum of the
projections of |a-b| on the d coordinate axes.
axes
Tanimoto metric,, for two sets S1 and S2
n1 + n2 2n12
DTanimoto ( S1 , S 2 ) =
n1 + n2 n12
where n1 and n2 are the number of elements in set S1 and S2

and
d n12 is
i the
th number
b ini both
b th sets.
t
Tanimoto metric is frequently used in taxonomy
Tanimoto metric examples:

p
Consider four words as sets of unordered letters:
pattern, pat, stop, pots
7 + 3 23 4 7 + 4 2 2 7
D( pattern, pat ) = = , D( pattern, stop ) = =
7 +33 7 7+42 9
7 + 4 2 2 7 3 + 4 2 2 3
D( pattern, pots) = = , D( pat , stop ) = =
7+42 9 3+ 4 2 5
3 + 4 2 2 3 4 + 4 2 4
D( pat , pots) = = , D( stop, pots ) = =0
3+ 4 2 5 4+44
Uncritical use of a pparticular metric in nearest-neighbor

g
classifier can cause low performance
The metric needs to be invariant to common transforms
such h as ttranslation,
l ti rotation,
t ti scaling
li etc.
t
It is very difficult to make a metric invariant to multiple
ta so s
transforms
Typical solutions may include pre-processing two
patterns to coalign, shifting the centers and placing in
same bounding
b di box b etc. Automatic
A i pre-processing
i can
also be difficult and unreliable.
Tangent
g distance classifier is to use a novel distance
measure and a linear approximation to the arbitrary
transforms.
Assume
A a classifier
l ifi needs d to
t handle
h dl r transforms,
t f suchh
as horizontal translation, vertical translation, shear,
rotation, scale and line thinning
We take each prototype xand perform each of the
transforms Fi(x; i) where i is the parameter
associated with this transform,
transform such as the angle in
rotation.
A tangent
g vector TVi is constructed for each transform
TVi = Fi(x; i) - x
For each d-dimensional prototype x, an rd matrix T
is generated, consisting of the tangent vectors at x.
These vectors are linearly independent.
The prototype plus a linear combination of all tangent
vectors forms an approximation of an arbitrary
transform.
The tangent
g distance from a test point
p x to a pparticular
stored prototype x is defined as
Dtan(x, x) = mina [||(x + Ta) - x||]
where T is a matrix consisting of the r tangent vectors
at x, a is a vector of parameters for linear
combination,
co a d |||| can
b at o , and ca be Euclidean
uc dea distance.
d sta ce.
In classification of x, we will first find its tangent
distance to x by finding the optimizing value of a.
Thi minimization
This i i i i is i quadratic,
d i andd can be b done
d using
i
iterative gradient descent.

CpE646 8v3 PDF

Uploaded by

Copyright:

Available Formats

CpE646 8v3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CpE646 8v3 PDF

Uploaded by

Copyright:

Available Formats

CpE

p 646 Pattern Recognition

Prof. Hong Man

Department of Electrical and

Chapter 4 (Section 4.4 4.6):

The Nearest-Neighbor Rule

A solution for the pproblem of the unknown best window

Two ppossibilities can occur:

It can be proven that lim kn = and lim kn / n = 0

Peaks are at the middle of regions with k prototypes

We can obtain a familyy of estimates byy makingg kn = k1 n

Estimation of a pposteriori pprobabilities. We can estimate

Then a reasonable estimate of a pposteriori pprobabilityy is

Let Dn = {{x1, x2, ,, xn} be a set of n labeled pprototypes

The label associated with the nearest neighbor

This rule essentiallyy partitions

The averageg probability

P* (e | x)  min ( P(e | x) ) = 1 P(m | x)

When onlyy n samples

If we use the nearest-neighbor

The overall asymptotic

The k-nearest neighbor g rule is an extension of the nearest

The nearest neighbor

The Euclidean distance in d dimensions satisfies these

Minkowski metric,, also referred to as the Lk norm

Tanimoto metric,, for two sets S1 and S2

where n1 and n2 are the number of elements in set S1 and S2

Tanimoto metric examples:

Uncritical use of a pparticular metric in nearest-neighbor

You might also like

P* (e | x) min ( P(e | x) ) = 1 P(m | x)