Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
24 views

Content-Based Image Retrieval Tutorial

This document provides a tutorial on content-based image retrieval using k-nearest neighbors (k-NN) and support vector machines (SVM). It describes how images are preprocessed by extracting features like color histograms, wavelet coefficients, and textures. These features are used to represent images in a way that k-NN and SVM can classify them based on visual similarity. The k-NN algorithm is explained, noting it finds the k closest training images to a query image without training, while highlighting opportunities to optimize its efficiency.

Uploaded by

wenhao zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Content-Based Image Retrieval Tutorial

This document provides a tutorial on content-based image retrieval using k-nearest neighbors (k-NN) and support vector machines (SVM). It describes how images are preprocessed by extracting features like color histograms, wavelet coefficients, and textures. These features are used to represent images in a way that k-NN and SVM can classify them based on visual similarity. The k-NN algorithm is explained, noting it finds the k closest training images to a query image without training, while highlighting opportunities to optimize its efficiency.

Uploaded by

wenhao zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Content-based image retrieval tutorial

arXiv:1608.03811v1 [stat.ML] 12 Aug 2016

Joani Mitro

giannismitros@gmail.com

Technical Report

This paper functions as a tutorial for individuals interested to enter the


field of information retrieval but wouldn’t know where to begin from. It
describes two fundamental yet efficient image retrieval techniques, the first
being k - nearest neighbours (knn) and the second support vector machines
(svm). The goal is to provide the reader with both the theoretical and
practical aspects in order to acquire a better understanding. Along with
this tutorial we have also developed the equivalent software1 using the
MATLAB environment in order to illustrate the techniques, so that the
reader can have a hands-on experience.

1
https://github.com/kirk86/ImageRetrieval
Notation

Table 1: Notation
X, Y, M bold face roman letters indicate matrices
~x, ~y, ~v bold face small letters indicate vectors
a, b, c small letters indicate scalar values
1
( N p p
P
i=1 |xi − y i | ) p-norm distance function
PN 1
limp→∞ ( i=1 |xi − yi |p ) p infinity norm distance function
g(z) nonlinear function (e.g. sigmoid, tanh etc.)
z = WT ~x + ~b score function, mapping function
k~bk2 = 1 vector norm
φ(~x) feature mapping function
K(~x,~z) Kernel function
Km Kernel matrix

1 Introduction
As we have already mentioned this tutorial serves as an introduction for to the field of
information retrieval for the interested reader. Apart from that, there’s always been
a motivation for the development of efficient media retrieval systems, since the new
era of digital communication has brought an explosion of multimedia data over the
internet. This trend, has continued with the increasing popularity of imaging devices,
such as digital cameras that nowdays are an inseparable part of any smartphone,
together with an inceasing proliferation of image data over communication networks.

2 Data pre-processing
Like in any other case before we use our data we first have to clean them if that is
necessary and transform them into a format that is understanble by the prediction
algorithms. In this particular case the process that has been adopted includes the
following six steps, applied for each image in our dataset D, in order to transform
the raw pixel images into something meaningful that the prediction algorithms can
understand. In another sense, we map the raw pixel values into a feature space.

1. We start by computing the color histogram for each image. In this case the HSV
color space has been chosen and each H, S, V component is uniformly quantized

1
into 8, 2 and 2 bins resepctively. This produces a vector of 32 elements/values
for each image.

2. The next step is to compute the color auto-correlogram for each image, where
the image is quantized into 4 ×4 ×4 = 64 colors in the RGB space. This process
produces a vector of 64 elements/values for each image.

3. Next, we extract the first two moments (i.e. mean and standard deviation) for
each R,G,B color channel. This gives us a vector of 6 elements/values.

4. Moving forward, we compute the mean and standard deviation of the Gabor
wavelet coefficients, which produces a vector of 48 elements/values. This com-
putation requires applying the Gabor wavelet filters for each image spanning
accross four scales: “0.05, 0.1, 0.2, 0.4” and six orientations: “θ0 = 0, θn+1 =
θn + π6 ”.

5. Last but not least, we apply the wavelet transform to each image with a 3-level
decomposition. In this case the mean and standard deviation of the transform
coefficients is utilized to form the feature vector of 40 elements/values for each
image.

6. Finally, we combine all the vectors from the step 1–5 into a new vector ρ~ =
32 + 64 + 6 + 48 + 40 + 1. Each number indicates the dimensionality of the
vectors from steps 1–5 that have been concatenated into the new vector ρ~.

3 Methodology
3.1 k-Nearest Neighbour
k-Nearest Neighbour (k-NN) classifier belongs to the family of instance based learning
algorithms (IBL). IBL algorithms construct hypothesis directly from the training data
themselves which means that the hypothesis complexity can grow with the data. One
of its advantages is the ability to adapt its model to previously unseen data. Another
advantage is the low cost of updating object instances and also the fast learning rate
since it requires no training. Some other examples of IBL algorithms besides k-NN
are kernel machines and RBF networks. Some of the disadvantages of IBL algorithms
including k-NN, besides the computational complexity, which we already mentioned,
is the fact that they fail to produce good results with noisy, irrelevant, nominal or
missing attribute values. They also don’t provide a natural way of explaining how
the data is structured. The efficacy of k-NN algorithm relies on the use of a user
defined similarity function, for instance a p-norm distance function, which depicts
the nearest neighbours and the chosen set of examples. It is also often used as a

2
base procedure in benchmarking and comparative studies. Due to the nature that it
doesn’t requrie any trainnig when compared to any trained based rule, it is expected
the trained based rule to perform better, if it doesn’t then the trained base rule is
deemed useless for the application under study.
Since nearest neighbour rule is a fairly simple algorihtm most textbooks will have
a short reference to it but will neglect to provide any facts about who invented the rule
in the first place. Macello Pelillo [1] tried to give an answer to this question. Pelillo
refers often to the famous Cover and Hart paper (1967) [4] which shows what happens
if a very large selectively chosen training set is used. Before Cover and Hart the rule
was mentioned by Nilsson (1965) [5] who called it “minimum distance classifier” and
by Sebestyen (1962) [2], who called it “proximity algorithm”. Fix and Hodges [3] in
their very early discussion on non-parametric discrimination (1951) already pointed
to the nearest neighbour rule as well. The fundamental principle known as Ockham’s
razor: “select the hypothesis with the fewest assumption” can be understood as the
nearest neighbour rule for nominal properties. It is, however, not formulated in terms
of observations. Ockham worked in the 14th century and Emphasized Observations
before ideas. Pelillo pointed out that this was already done prior to Ockham, by
Alhazen [6] (Ibn al-Haytham), a very early scientist (965–1040) in the field of optics
and perception. Pelillo cites some paragraphs where he shows that Alhazen describes
a training procedure as a “universal form” which is completed by recalling the original
objects which in this case Alhazen refered to as particular forms.
To better understand the k-NN rule we will setup the concept of our application.
Suppose that we have a dataset comprised of 1000 images in total, categorized in 10
different categories/classes where each one includes 100 images.
Given an image I ∈ Rm×n we would like to find all possible similar images from
the pool of candidate images (i.e. all similar images from the dataset of 1000 total
images). A sensible first attempt algorithm would look something like this:

Data: D = {image1 , image2 , image3 , . . . , imageN }, “the set of all images”


Data: I ∈ Rm×n , “query image, you’re trying to identify similarity against D”
Result: d ∈ R, “a scalar indicating how similar two images are”
for image in D do
for column in image height: m ← height(I) do
for row in image width: n ← width(I) do
d[row][column] ← abs (I[row][column] − image[row][column])
end
end
d[image] ← sum (d[row][column]) “sum accros rows and columns”
end
Algorithm 1: naive k-NN algorithm.
Here is a visual representation of what it might look like:

3
image1 image imageN

.6 .86 .
D Each cell indicates the similarity
im
ag between image I and each other image
es in the dataset D

Figure 1: Final visual result of the Algorithm 1

The complexity of the naive k-NN is O(Dmn). Can we do better than that? Of
course we can, if we avoid some of those loops by vectorizing our main operations.
Instead of operating on the 2-D images we can vectorize them first and then perform
the operaions. First we transform our images from 2-D matrices to 1-D vectors like
it is being demonstrated in the figure below.
n

m vectorized columns m× n

image I

image I

Figure 2: Vectorizing images.

If we denote the vectorization of our query image I ∈ Rm×n as ~x ∈ Rd and with


~d ∈ Rd the vectorization of every other image in the dataset D. Then our k-NN
algorithm can be described as follows: N x − ~dimage |, and the complexity has
P
image=1 |~
now been reduced to O(D). The choice of distrance metric or distance function is
solely up to the discresion of the user. Another view of how k-NN algorithm operates
is depicted in Figure 3. Notice that k-NN performs an implicit tessellation of the
feature space that is not visible to the observer, but it is through this tessellation,
that is able to distinguish nearby datum and classify them as similar. For instance,
let’s pretend that the black capital “X” letters in Figure 3 denote some data projected
on the feature space. When a new datum comes in, such as in this case the red capital
“X” letter, which indicates a query image, then the algorithm can easily distinguish
and assign to it the closest images which are semantically similar.

4
6
X
X
X X
X
X
X
X X X
X X

X
X
X
X X

Figure 3: Voronoi diagram of k-NN.

3.2 Suppoprt Vector Machines


Support Vector Machines (SVMs also known as suppport vector networks) are super-
vised learnig models used among others for classifcation and regression analysis. They
were introduceds in 1992 in the Conference on Learnning Theory by Boser, Guyon
and Vapnik. It became quite popular since then because it is a theorectically well mo-
tivated algorithm which was developed since the 60s from Statistical Learning Theory
(Vapnik and Chervonenkis) and it also holds good empirical performance in a diverse
number of scientific fields and application domains such as bioninformatics, text and
image recognition, music retrieval and many more. SVMs are based on the idea of
separating data with a large “gap” also know as margins. During the presentation
of SVM we’ll also concern ourselfves with the question of optimal maring classifier
which will act as stepping stone for the introduction to Lagrange duality. Another
aspect of SVMs which is important is the notion of kernels which allow SVMs to be
applied efficiently in high dimenisonal feature spaces. Let’s start by settting up our
poblem. In this case the context is known from before where we have images from
different classes and we want to classify them accordingly. In other words this is a
binary classifcation problem. Based on this classification we will be able to retrieve
images that are similar to our query image. Figure 4 depicts two classes of images,
the positive, and the negative. For the sake of the example let’s consider the circles
to be the positive and the triangles to be the negative. We also have a hyperplane
separating them as well as a three labeled data points. Notice that point A is the
furthest from the decision boundary. In order to make a prediction for the value of
the label ~yi at point A, one might say that in this particular case we can be more
confident that the value of the label is going to be ~yi = 1. On the other hand, point

5
C even though it is on the correct side of the decision boundary where we would have
predicted a label value of ~yi = 1, a small change to the decision boundary could have
caused the prediction to be negative ~yi = −1. Therefore, one can say that we’re much
more confident about our prediction at A than at C. Point B lies in-between these
two cases. One can extrapolate and say that if a point is far from the separating
hyperplane, then we might be significantly more confident in our predictions. What
we are striving for is, given a training set, find a decision boundary that allows us to
make the correct and confident predictions (i.e. far from the decision boundary) on
the training examples.

A∗

6 B∗
C∗

~y
~ ~x
w

Figure 4: Images projected on a 2-D plane.

Let’s consider our binary classification problem where we have labels ~y ∈ {−1, 1}
and features ~x. Then our binary classifirer might look like
(
g(z) = 1 if z ≥ 0,
hw,
~ b~ (~ wT ~x + ~b)
x) = g(~ (1)
g(z) = 0 otherwise.
Now we have to distinguish between two different notions of margin such as func-
tional and gemometric margin. The functional margin of (~ w, ~b) with respect to
the training example (~xi , ~yi ) is

ˆ = ~yi (~
~γ wT ~x + ~b) (2)
i

If ~yi = 1, then for our prediction to be confident and correct (i.e. the functional
~ T ~x + ~b, needs to be a large positive number. If ~yi = −1, then
margin to be large), w

6
for the functional margin to be large (i.e. to make a confident and correct prediction)
~ T ~x + ~b needs to be a large negative number. Note that if we replace w
w ~ with 2~w in
~ ~
Equation 1 and b with 2b, then since g(~ T ~
w ~x + b) = g(2~ T ~
w ~x + 2b), would not change
hw,
~ ~
b at all, which means that it depends only on the sign, but not on the magnitude
of w~ T ~x + ~b. Regarding the geometric margin we’ll try to interpret them using
Figure 5.

A W
γi

Figure 5: Interpretation of geometric margin.

We can see the decision boundary corresponding to (~ w, ~b) along with the orthog-
onal vector w
~ . Point A resembles some training example ~xi with label ~yi = 1. The
distance to the decision boundary denoted by γi is given by the line segment AB.
~
How can we compute γi ? If we consider kw ~
wk
to be a unit-length vector pointing in
~
the same direction as w~ then point B = ~xi − γi · kw ~ . Since this point lies on the
wk
decision boundary then it satisfies that w~ T ~x + ~b = 0, which means that all points
~xi on the decision boundary satisfy the same equation. Substituting ~x with B we
 T
~
w ~ ~ T~
w xi +b ~
w ~
~ T
get w (~xi − γi ·
kwk
~ ) + b = 0. Solving for γ i we get γ i = kwk
~ =kwk
~
~xi + b .
kwk
~

w, ~b) with respect to the trainning example (~xi , ~yi )


Usually the geometric margin of (~
is defined to be

T !

~
w ~b
~γ i = + (3)
k~
wk k~
wk

If k~
wk = 1, then the functional margin is equal to the geometric margin. Notice
also that the geometric margin is invariant to rescaling of the parameters (i.e. if we

7
replace w ~ with 2~ w and ~b with 2~b, then the geometric margin does not change). This
way it is possible to impose an arbitray scaling contraint on w ~ without changing
anything significant from our original equation. Given a dataset D = {(~xi , ~yi ) ; , i =
1, . . . , N}, it is also possible to define the geometric margin of (~ w, ~b) with respect
to D to be the smallest of geometric margins on the individual training examples
γ = mini=1,...,N γ ~ i . Thus, the goal for our classifier is to find a decision boundary
that maximizes the geometric margin in order to reflect a confident and correct set of
predictions, resulting in a classifier that separates the positive and negative training
examples with a geometric margin. Supposed that our training data are linearly sep-
arable, how do we find a separating hyperplane that achieves the maximum geometric
margin? We start by posing the following optimisation problem

max ~γ (4)
~ ~
~γ ,w, b

wT ~xi + ~b) ≥ ~γ , i = 1, . . . , N
s.t. ~yi (~
k~
wk = 1

The k~wk = 1 constraint ensures that the functional margin equals to the geometric
margin, in this way we are guaranteed that all the geometric margins are at least ~γ .
Since “k~
wk = 1” constraint is a non-convex one, and this is hard to solve instead
what we’ll try to do is transform the problem into an easier one. Consider

~γˆ
max (5)
~ ~
~γ ,w, b k~
wk
wT ~xi + ~b) ≥ ~γˆ , i = 1, . . . , N
s.t. ~yi (~

Notice that we’ve got ridden the constraint k~


wk = 1 that was making our objective
ˆ
~
γ
difficult and also since ~γ = kwk
~ , will provide an acceptable and correct answer. The
~γˆ
main problem is that still our objective function kwk ~
is non-convex, thus we still
have to keep searching for a different representation. Recall that we can add an
arbitary scaling constraint on w ~ and ~b without changing anything from our original
formulation. We’ll introduce the scaling constraint such that the functional margin
of w~ , ~b with respect to the training set (~xi , ~yi ) must be 1, this is ~γˆ = 1. Multiplying
~ , ~b by some constant yields the functional margin multiplied by the same constant.
w
One can satisfy the scaling constraint by rescaling (~ w, ~b). If we plug this consraint
ˆ
~
γ 1
into Equation 5 and substitute kwk ~ = kwk~ , then we get the following optimization
problem.

8
1
min k~ w k2 (6)
~ b 2
~γ ,w, ~

wT ~xi + ~b) ≥ 1, i = 1, . . . , N
s.t. ~yi (~
ˆ
Notice that maximizing kwk ~γ
~
1
= kwk
~ is the same thing as minimizing k~ wk2 . We
have now transformed our optimization problem into one with a convex quadratic
objective and linear constraints which can be solved using quadratic programming.
The solution to the above optimization problem will give us the optimal margin
classifier which will lead to the dual form of our optimization problem, which in
return plays an important role in the use of kernels to get optimal margin classifiers,
in order to work efficiently in very high dimensional spaces. We can reexpress the
constraints of Equation 6 as gi (~ wT ~xi + ~b) + 1 ≤ 0. Notice that constraints
w) = −~yi (~
that hold with equality, gi (~
w) = 0, correspond to training examples (~xi , ~yi ), that
have functional margin equal to one. Let’s have a look at the figure below.

se p
ar a
ti n
gh
dec yp
i si o er p
nb l an
ou e
nd
ar y

Figure 6: Support vectors and maximum maring separating hyperplane.

The three points that lie on the decision boundary (two positive and one negative)
are the ones with the smallest margins and thus closest to the decision boundary.
Notice that these three points are called support vectors and usually they can be
smaller in number than the training set. In order to tackle the problem we frame it
as a Lagrangian optimization problem

N
~ 1 2
X h   i
L(α, w
~ , b) = k~
wk − ~ T ~xi + ~b − 1 .
αi ~yi w (7)
2 i

9
with only one Lagrangian mulptiplier “αi ” since the problem has only inequality
constraints and not any equality constraints. First, we have to find the dual form of
the problem, to do so we need to minimize L(α, w ~ , ~b) with respect to w
~ and ~b for a
fixed α. Setting the derivatives of L with respect to w ~
~ and b to zero, we get:

N
X
∇w ~ , ~b) = w
~ L(α, w ~ − αi~yi~xi = 0. (8)
i=1
N
X n
∂L
~ =
w αi ~yi~xi . derivative of ~
∂w
(9)
i=1
N
~ ~b) X
∂L(α, w, n
∂L
= αi ~yi = 0. derivative of ∂~
(10)
∂ ~b i=1
b

Substituting Equation 9 into Equation 10 and simplifying we get

N N N
X 1X X
~ , ~b) =
L(α, w αi − T ~
yi yj αi αj (~xi ) xj − b
~ ~ ~ αi~yi . (11)
i=1
2 i,j=1 i=1
N N
X 1X n
~ , ~b) =
L(α, w αi − ~yi ~yj αi αj (~xi )T ~xj . from eq. 10 last term = 0 (12)
i=1
2 i,j=1

Utilizing the constraint αi ≥ 0 and the constraint from Equation 10 the following
dual optimization problem arises:

N
X 1
max W (α) = αi − ~yi ~yj αi αj h~xi , ~xj i (13)
α
i=1
2
s.t. αi ≥ 0, i = 1, . . . , N
N
X
αi~yi = 0.
i=1

If we are able to solve the dual problem, in other words find the α that maximizes
~ (α) then we can use Equation 9 in order to find the optimal w
w ~ as a function of
α. Once we have found the optimal w ~ ∗, considering the primal problem then we can
also find the optimal value for the intercept term ~b.

~ T ~xi + mini:~yi =1 w
~b∗ = − maxi:~yi =−1 w∗ ~ ∗T ~xi
(14)
2

10
Suppose we’ve fit the parameters of our model to a training set, and we wish to
make a prediction at a new point input ~x. We would then calculate w ~ T ~x + ~b, and
predict ~y = 1 if and only if this quantity is bigger than zero. But using Equation 9,
this quantity can also be written:

N
!T
X
~ T ~x + ~b =
w αi ~yi~xi ~x + ~b (15)
i=1
N
X
= αi~yi h~xi , ~xi + ~b (16)
i=1

Earlier we saw that the different values for αi will all be zero except for the sup-
port vectors. Many of the terms in the sum above will be zero. We really need to
find only the inner products between ~x and the support vectors in order to calculate
Equation 16 and make our prediction. We will exploit this property of using inner
products between input feature vectors in order to apply kernels to our classification
problem. To talk about kernels we’ll have to think about our input data. In this
case as we have already previously mentioned we are referring to images, and images
are usually discribed by a number of pixel values which we’ll refer to as attributes,
indicating the different intensity colors across the three different color channels {R, G,
B}. When we process the pixel values in order to retrieve more meaningful represen-
tations, in other words when we map our initial pixel values through some processing
operation to some new values, these new values are called features and the operation
process is referred to as feature mapping usually denoted as φ.
Instead of applying SVM directly to the attributes ~x, we may want to use SVM to
learn from some features φ(~xi ). Since the SVM algorithm can be written entirely in
terms of innner products h~x,~zi we can instead replace them with hφ(~x), φ(~z)i. This
way given a feature mapping φ, the corresponding kernel is defined as K(~x,~z) =
φ(~x)T φ(~z). If we replace every inner product h~x,~zi in the algorithm with K(~x,~z),
then the learning process will be happening uisng features φ.
One can compute K(~x,~z) by finding φ(~x) and φ(~z) even though they may be
expensive to calculate because of their high dimensionality. Kernels such as K(~x,~z),
allows SVMs to perform learning in high dimensional feature spaces without the need
to explicitly find or represent vectors φ(~x). For instance, suppose ~x,~z ∈ Rn , and let’s
2
consider K(~x,~z) = (~xT ~z) which is equivalent to

11
n
! n
!
X X
K(~x,~z) = ~xi~zi ~xj~zj (17)
i=1 j=1
n X
X n
= ~xi~xj~zi~zj
i=1 j=1
Xn
= (~xi~xj ) (~zi~zj )
i,j=1

= φ(~x)T φ(~z)

for n = 3 the feature mapping φ is computed as

 
x1 x1
x1 x2 
 
x1 x3 
 
x2 x1 
 
φ(~x) = 
 x2 x2


x2 x3 
 
x3 x1 
 
x3 x2 
x3 x3
d
Broadly speaking a kernel K(~x,~z) = (~xT ~z + c) corresponds to a feature mapping
of n+d

d
feature space. K(~x,~z) still takes O(n) time even though it is operating in
d
a O(n ) space, because it doesn’t need to explicitly represent feature vectors in this
high dimensional space. If we think of K(~x,~z) as some measurement of how similar
are φ(~x) and φ(~z), or ~x and ~z, then we might expect K(~x,~z) = φ(~x)T φ(~z) to be large
if φ(~x) and φ(~z) are close together and vice versa.
Suppose that for some learning problem we have thought of some kernel function
K(~x,~z), considered as a reasonable measure of how similar ~x and ~z are. For instace,

(

k~
x−~zk2 1 if ~x and ~z is close
K(~x,~z) = e 2σ 2 (18)
0 otherwise

the question then becomes, can we use this definition as the kernel in an SVM
algorithm? In general, given any function K is there any process which will allow
to describe if it exists some feature mapping φ so that K(~x,~z) = φ(~x)T φ(~z) for all
~x and ~z, in other words is it a valid kernel or not? If we suppose that K is a valid

12
kernel then φ(~xi )T φ(~xj ) = φ(~xj )T φ(~xi ), meaning that the kernel matrix denoted as
Km , discribing similarity between datum ~xi and ~xj , must be symmetric. If we denote
φk (~x), the k-the coordinate of the vector φ(~x), then for any vector ~z, we have

XX
~zT K~z = ~zi Kij~zj (19)
i j
XX
= ~zi φk (~xi )T φk (~xj )~zj
i j
XX X
= ~zi φk (~xi )φk (~xj )~zj
i j k
XXX
= ~zi φk (~xi )φk (~xj )~zj
k i j
!2
X X
= ~zi φk (~xi )
k i
≥ 0.
which shows that the kernel matrix Km is positive semi-definite (K ≥ 0) since our
choice of ~z was arbitary. If K is a valid kernel meaning that it corresponds to some
feature mapping φ, then the corresponding kernel matrix Km ∈ Rm×m is symmetric
positive semi-definite. This is a necessary and sufficient condition for Km to be a
valid kernel also called the Mercer kernel. The take away message is that if you have
any algorithm that you can write in terms of only inner products h~x,~zi between the
input attribute vectors, then by replacing it with a kernel K(~x,~z) you can permit
your algorithm to work efficiently in the high dimensional feature space.
Switching gears for a moment and returning back to our problem or actually classi-
fying and semantically retrieving similar images, since we now have an understanding
of how the SVM algorithm is functioning, we can use it in our application. Recall
that we have the following dataset:


  Africa Monuments Animals People · · ·
Africa

 
 ..
image image image image .
 
1 1 1 1
 
Monum.

 

 
 image
2 image2 image2 image2
D10−classes = Animals ←

People
 
 image3 image3 image3 image3
. . . ..

.. .. ..
 
 .. .

 

.



image100 image100 image100 image100

As in Section 3.1 we have our dataset and our query image, in this case denoted
by ~q, vectorized, in order to perform mathematical operations seamelessly. To make

13
things even more explicit, imagine that our system (i.e. the MATLAB software ac-
companying this tutorial) or the algorithm (i.e. the SVM in this case) receives a query
image ~q from the user and its job is to find and return to the user all the images which
are similar to the query ~q. For instance, if the query image ~q of the user depicts a
monument then the job of our system or algorithm is to return to the user all the
images depicting monuments from our dataset D.
In other words we are treating our problem as a multiclass classification problem.
Generally speaking there are two broad approaches in which we can resolve this issue
using the SVM algorithm. The first one is called “one-vs-all” approach and involves
training a single classifier per class, with the samples of that class as positive samples
and all other samples as negatives. This strategy requires the base classifiers to
produce a real-valued confidence score for its decision, rather than just a class label.
n!
The second approach is called “one-vs-one” and usually one has to train (n−k)!k!
binary classifiers for a k-way multiclass problem. Each receives the samples of a pair of
classes from the original training set, and must learn to distinguish these two classes.
n!
At prediction time, a voting scheme is applied: all (n−l)!k! classifiers are applied to
an unseen sample and the class that got the highest number of “+1” predictions
gets predicted by the combined classifier. This is the method that the accompanying
software is utilizing for the SVM solution.
Notice that if the SVM algorithm predicts the wrong class label for a query image
~q, then we end up retrieving and returning to the user all the images from the wrong
category/class since we predicted the wrong label. How can we compensate for this
shortcoming? This is left as an exercise to the reader to practice his/her skills.

References
[1] P. Marcelo, Pattern Recognition Letters, History of science, Nearest neighbor
classification, Visual perception., pp. 34–37, Vol. 38, 2014.

[2] G. Sebestyen, IEEE Transactions on Information Theory, Review of Learning


Machines, pp. 407, 3, Vol. 12, 1965–1966.

[3] E. Fix and J. L. Hodges, International Statistical Institute, Discriminatory Anal-


ysis Nonparametric Discrimination: Consistency Properties, pp. 238–247, Vol.
57, No. 3, 1989.

[4] T. Cover and P. Hart, IEEE Transactions on Information Theory, The nearest
neighbor pattern classification, pp. 21–27, Vol. 13, 1967.

[5] N. Nilsson, Learning Machines: Foundations of Trainable Pattern Classifying


Systems, First edition, 1965.

14
[6] Ibn al-asan, widely considered to be one fo the first theoretical physicists, c.965–
c.1040 CE

15

You might also like