CpE646 8v3 PDF

p 646 Pattern Recognition

g and

Prof. Hong Man

Department of Electrical and

Computer Engineering
Stevens Institute of Technology
Non-Parametric Classification

Chapter 4 (Section 4.4 4.6):

Nearest Neighbor Estimation

The Nearest-Neighbor Rule

kn-Nearest Neighbor Estimation

A solution for the problem of the unknown best window function.

To estimate p(x) from n training samples
training data
samples (kn = f(n))
The included samples are called the kn nearest-
kn / n
The density is given as pn ( x) = (30)
kn-Nearest Neighbor Estimation

Two ppossibilities can occur:

If the density is high near x, the cell will be small which
provides a good resolution
If the density is low around x, the cell will grow large
and stop until higher density regions are reached
kn-Nearest Neighbor Estimation

It can be proven that lim kn = and lim kn / n = 0

n n
are necessary and sufficient for pn(x) converge to p(x) at
all points where p(x) is continuous
If kn = n and pn(x) is a good estimate of p(x), i.e. p1(x) =
pn(x) = p(x),
p(x) from (30) we have
Vn = 1/ ( )
n p ( x) and then Vn = V1 / n
this becomes similar to the Parzen-window
Parzen window approach
except that V1 is not determined arbitrarily.
kn-Nearest Neighbor Estimation

Peaks are at the middle of regions with k prototypes

kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

kn nearest-neighbor
For kn = n when n=1, k1=1, and the estimate is
pn ( x) =
2 | x x1 |
where x1 is the first training sample, and || is a distance
measure. 2|x
This is not a good estimate (Figure 4.12)
estimate (Figure 4.12)
As n increases, the estimate gets better.
This method will not generate zero p(x) for any x. (If a
fixed window, e.g. Parzen window, is used and no
sample falls inside this window,
window the density estimate for
this window will be zero. This will not happen here.)
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation
kn-Nearest Neighbor Estimation

We can obtain a familyy of estimates byy makingg kn = k1 n

and adjusting the value k1.
Similar to Parzen window method, the choice of k1 is case
Usually k1 is selected in the way that, when the estimated
density is applied to classify new test samples from the
same density, it yields the lowest error rate.
kn-Nearest Neighbor Estimation

Estimation of a pposteriori pprobabilities. We can estimate

P(i|x) from a set of n labeled samples using the window
turn out to be
labeled i then the joint probability p(x,i) can be
ki / n
pn ( x, i ) =
kn-Nearest Neighbor Estimation

Then a reasonable estimate of a pposteriori pprobabilityy is

pn ( x, i ) ki
Pn (i | x) = c
p ( x, )
j =1
n j

ki/k is the fraction of the samples within the cell that are
labeled j
For minimum error rate,, the most frequently
represented category within the cell is selected for this
cell, and any test sample lies in this cell is labeled as
this category.
The Nearest Neighbor Rule

Let Dn = {{x1, x2, ,, xn} be a set of n labeled pprototypes

Let x Dn be the closest prototype to a test point x then
the nearest-neighbor rule for classifying x is to assign it the
The nearest-neighbor rule leads to an error rate greater than
If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
The Nearest Neighbor Rule

The label associated with the nearest neighbor

random variable, and the probability that =i is the a
posteriori probability P(i|x).
If n , it is
to x so that P(i | x) P(i | x)
We de at P((m| x) = max
a i P((i| x). Then
Bayes rule always select m for x.
The Nearest Neighbor Rule

This rule essentiallyy partitions

p the feature space
p into cells
and each cell containing a prototype x and all points
closer to it than to any other prototypes. All points in a cell
are labeled by the category of this xx, which is called
Voronoi tesselation of the space.
In each cell,
If P(m | x) P(i | x) 1, then the nearest neighbor
selection is almost always the same as the Bayes
If P(m | x) P(i | x) 1/c, then the nearest neighbor
their error rates are similar (i.e. both are random guess)
The Nearest Neighbor Rule
The Nearest Neighbor Rule

The averageg probability

p y of error of the nearest-neighbor
rule for infinite sample is
P(e) = P(e | x) p( x)dx
The Bayes decision rule minimizes P(e) by minimizing
P(e|x) for every x, then

P* (e | x)  min ( P(e | x) ) = 1 P(m | x)

P*  min ( P (e) ) = P* (e | x) p ( x)dx
The Nearest Neighbor Rule

When onlyy n samples

p are used in nearest neighbor g rule,, the
conditional probability of error becomes
P(e | x) = P(e | x, x ') p( x ' | x)dx '
where x is the nearest neighbor prototype of x.
Each time when we take n samples, the nearest
As n, p(x| x) approaches a delta function centered
at x, p(x| x)(x-x)
(i.e. a nearest neighbor x can be always found very
close to x).
The Nearest Neighbor Rule

To solve Pn((e||x, x)
We have n pairs of random variables {(x1,1),
(x2,2), , (xn,n)}, where j is class label for xj and
j {
{ 1, 2, , c}
Because the state of nature when xn (the nearest
independent of the state of nature when x is drawn,
we have
P( , n' | x, xn' ) = P ( | x) P( n' | xn' )
The Nearest Neighbor Rule

If we use the nearest-neighbor

g decision rule,, the
error occurs when n, therefore
Pn (e | x, x ) = 1 P( = i , n' = i | x, xn' )
i =1
= 1 P(i | x) P(i | xn' )
i =1


lim Pn (e | x) = 1 P(i | x) P (i | xn' ) ( xn' x)dxn'
i =1
= 1 P 2 (i | x)
i 1
The Nearest Neighbor Rule

The overall asymptotic

y p nearest-neighbor
g error rate is
P = lim Pn (e)

li Pn (e | x) p ( x)dx
= lim d


= 1 P (i | x) p ( x)dx

i =1
The error rate is bounded (proof in Sec 4.5.3)
c *
P P P 2
* *
c 1
The Nearest Neighbor Rule
The k-Nearest Neighbor Rule

The k-nearest neighbor g rule is an extension of the nearest

neighbor rule.
Classify x by assigning it the label most frequently
When e the
t e total
tota number u be of o prototypes
these k neighbors will all converge to x.
In a two-class case, the k-nearest neighbor rule selects m
has the probability

i = ( k +1) / 2
P ( m | x ) i
[1 P ( m | x )]k i
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule
The k-Nearest Neighbor Rule

k = 3 (odd value) and
Prototypes Labels
x = (0.10, 0.25)t
(0.10, 0.28) 2
(0.12, 0.20) 2
3 closest vectors to x with their labels are:
{(0.10, 0.28; 2); (0.12, 0.20; 2); (0.15, 0.35; 1)}
The majority
j y votingg scheme will assignsg the label 2 to x.
Metrics and Nearest Neighbor Classification

The nearest neighbor

g classifier relies on certain distance
function metric
Frequently we assume the metric is Euclidean distance in d
i but
between two argument patterns D( , )
A metric
a, b and c
Non-negativity: D(a,b)0
Reflexivity: D(a,b)=0 iff a=b
Symmetry: D(a,b)=D(b,a)
Triangle lit D(a,b)+D(b,c)D(a,c)
T i l inequality:
Metrics and Nearest Neighbor Classification

The Euclidean distance in d dimensions satisfies these

1/ 2
D(a, b) = (ak bk )
k =1
the coordinates, which has negative impact to the
performance of nearest-neighbor classifiers
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

Minkowski metric,, also referred to as the Lk norm

1/ k
Lk (a, b) = | ak bk |
k =1
Euclidean distance is the L2 norm
L1 norm is referred to as the Manhattan distance
L distance between a and b is the maximum of the
projections of |a-b| on the d coordinate axes.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

Tanimoto metric,, for two sets S1 and S2

n1 + n2 2n12
DTanimoto ( S1 , S 2 ) =
n1 + n2 n12

where n1 and n2 are the number of elements in set S1 and S2

Tanimoto metric is frequently used in taxonomy
Metrics and Nearest Neighbor Classification

Tanimoto metric examples:

Consider four words as sets of unordered letters:
pattern, pat, stop, pots
7 + 3 23 4 7 + 4 2 2 7
D( pattern, pat ) = = , D( pattern, stop ) = =
7 +33 7 7+42 9
7 + 4 2 2 7 3 + 4 2 2 3
D( pattern, pots) = = , D( pat , stop ) = =
7+42 9 3+ 4 2 5
3 + 4 2 2 3 4 + 4 2 4
D( pat , pots) = = , D( stop, pots ) = =0
3+ 4 2 5 4+44
Metrics and Nearest Neighbor Classification

Uncritical use of a pparticular metric in nearest-neighbor

classifier can cause low performance
The metric needs to be invariant to common transforms
It is very difficult to make a metric invariant to multiple
Typical solutions may include pre-processing two
patterns to coalign, shifting the centers and placing in
same bounding
also be difficult and unreliable.
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

g distance classifier is to use a novel distance
measure and a linear approximation to the arbitrary
A a classifier
as horizontal translation, vertical translation, shear,
rotation, scale and line thinning
We take each prototype xand perform each of the
transforms Fi(x; i) where i is the parameter
associated with this transform,
transform such as the angle in
Metrics and Nearest Neighbor Classification

A tangent
g vector TVi is constructed for each transform
TVi = Fi(x; i) - x
For each d-dimensional prototype x, an rd matrix T
is generated, consisting of the tangent vectors at x.
These vectors are linearly independent.
The prototype plus a linear combination of all tangent
vectors forms an approximation of an arbitrary
Metrics and Nearest Neighbor Classification
Metrics and Nearest Neighbor Classification

The tangent
g distance from a test point
p x to a pparticular
stored prototype x is defined as
Dtan(x, x) = mina [||(x + Ta) - x||]
where T is a matrix consisting of the r tangent vectors
at x, a is a vector of parameters for linear
In classification of x, we will first find its tangent
distance to x by finding the optimizing value of a.
iterative gradient descent.
Metrics and Nearest Neighbor Classification

