Content-Based Image Retrieval Tutorial
Content-Based Image Retrieval Tutorial
Joani Mitro
giannismitros@gmail.com
Technical Report
1
https://github.com/kirk86/ImageRetrieval
Notation
Table 1: Notation
X, Y, M bold face roman letters indicate matrices
~x, ~y, ~v bold face small letters indicate vectors
a, b, c small letters indicate scalar values
1
( N p p
P
i=1 |xi − y i | ) p-norm distance function
PN 1
limp→∞ ( i=1 |xi − yi |p ) p infinity norm distance function
g(z) nonlinear function (e.g. sigmoid, tanh etc.)
z = WT ~x + ~b score function, mapping function
k~bk2 = 1 vector norm
φ(~x) feature mapping function
K(~x,~z) Kernel function
Km Kernel matrix
1 Introduction
As we have already mentioned this tutorial serves as an introduction for to the field of
information retrieval for the interested reader. Apart from that, there’s always been
a motivation for the development of efficient media retrieval systems, since the new
era of digital communication has brought an explosion of multimedia data over the
internet. This trend, has continued with the increasing popularity of imaging devices,
such as digital cameras that nowdays are an inseparable part of any smartphone,
together with an inceasing proliferation of image data over communication networks.
2 Data pre-processing
Like in any other case before we use our data we first have to clean them if that is
necessary and transform them into a format that is understanble by the prediction
algorithms. In this particular case the process that has been adopted includes the
following six steps, applied for each image in our dataset D, in order to transform
the raw pixel images into something meaningful that the prediction algorithms can
understand. In another sense, we map the raw pixel values into a feature space.
1. We start by computing the color histogram for each image. In this case the HSV
color space has been chosen and each H, S, V component is uniformly quantized
1
into 8, 2 and 2 bins resepctively. This produces a vector of 32 elements/values
for each image.
2. The next step is to compute the color auto-correlogram for each image, where
the image is quantized into 4 ×4 ×4 = 64 colors in the RGB space. This process
produces a vector of 64 elements/values for each image.
3. Next, we extract the first two moments (i.e. mean and standard deviation) for
each R,G,B color channel. This gives us a vector of 6 elements/values.
4. Moving forward, we compute the mean and standard deviation of the Gabor
wavelet coefficients, which produces a vector of 48 elements/values. This com-
putation requires applying the Gabor wavelet filters for each image spanning
accross four scales: “0.05, 0.1, 0.2, 0.4” and six orientations: “θ0 = 0, θn+1 =
θn + π6 ”.
5. Last but not least, we apply the wavelet transform to each image with a 3-level
decomposition. In this case the mean and standard deviation of the transform
coefficients is utilized to form the feature vector of 40 elements/values for each
image.
6. Finally, we combine all the vectors from the step 1–5 into a new vector ρ~ =
32 + 64 + 6 + 48 + 40 + 1. Each number indicates the dimensionality of the
vectors from steps 1–5 that have been concatenated into the new vector ρ~.
3 Methodology
3.1 k-Nearest Neighbour
k-Nearest Neighbour (k-NN) classifier belongs to the family of instance based learning
algorithms (IBL). IBL algorithms construct hypothesis directly from the training data
themselves which means that the hypothesis complexity can grow with the data. One
of its advantages is the ability to adapt its model to previously unseen data. Another
advantage is the low cost of updating object instances and also the fast learning rate
since it requires no training. Some other examples of IBL algorithms besides k-NN
are kernel machines and RBF networks. Some of the disadvantages of IBL algorithms
including k-NN, besides the computational complexity, which we already mentioned,
is the fact that they fail to produce good results with noisy, irrelevant, nominal or
missing attribute values. They also don’t provide a natural way of explaining how
the data is structured. The efficacy of k-NN algorithm relies on the use of a user
defined similarity function, for instance a p-norm distance function, which depicts
the nearest neighbours and the chosen set of examples. It is also often used as a
2
base procedure in benchmarking and comparative studies. Due to the nature that it
doesn’t requrie any trainnig when compared to any trained based rule, it is expected
the trained based rule to perform better, if it doesn’t then the trained base rule is
deemed useless for the application under study.
Since nearest neighbour rule is a fairly simple algorihtm most textbooks will have
a short reference to it but will neglect to provide any facts about who invented the rule
in the first place. Macello Pelillo [1] tried to give an answer to this question. Pelillo
refers often to the famous Cover and Hart paper (1967) [4] which shows what happens
if a very large selectively chosen training set is used. Before Cover and Hart the rule
was mentioned by Nilsson (1965) [5] who called it “minimum distance classifier” and
by Sebestyen (1962) [2], who called it “proximity algorithm”. Fix and Hodges [3] in
their very early discussion on non-parametric discrimination (1951) already pointed
to the nearest neighbour rule as well. The fundamental principle known as Ockham’s
razor: “select the hypothesis with the fewest assumption” can be understood as the
nearest neighbour rule for nominal properties. It is, however, not formulated in terms
of observations. Ockham worked in the 14th century and Emphasized Observations
before ideas. Pelillo pointed out that this was already done prior to Ockham, by
Alhazen [6] (Ibn al-Haytham), a very early scientist (965–1040) in the field of optics
and perception. Pelillo cites some paragraphs where he shows that Alhazen describes
a training procedure as a “universal form” which is completed by recalling the original
objects which in this case Alhazen refered to as particular forms.
To better understand the k-NN rule we will setup the concept of our application.
Suppose that we have a dataset comprised of 1000 images in total, categorized in 10
different categories/classes where each one includes 100 images.
Given an image I ∈ Rm×n we would like to find all possible similar images from
the pool of candidate images (i.e. all similar images from the dataset of 1000 total
images). A sensible first attempt algorithm would look something like this:
3
image1 image imageN
.6 .86 .
D Each cell indicates the similarity
im
ag between image I and each other image
es in the dataset D
The complexity of the naive k-NN is O(Dmn). Can we do better than that? Of
course we can, if we avoid some of those loops by vectorizing our main operations.
Instead of operating on the 2-D images we can vectorize them first and then perform
the operaions. First we transform our images from 2-D matrices to 1-D vectors like
it is being demonstrated in the figure below.
n
m vectorized columns m× n
image I
image I
4
6
X
X
X X
X
X
X
X X X
X X
X
X
X
X X
5
C even though it is on the correct side of the decision boundary where we would have
predicted a label value of ~yi = 1, a small change to the decision boundary could have
caused the prediction to be negative ~yi = −1. Therefore, one can say that we’re much
more confident about our prediction at A than at C. Point B lies in-between these
two cases. One can extrapolate and say that if a point is far from the separating
hyperplane, then we might be significantly more confident in our predictions. What
we are striving for is, given a training set, find a decision boundary that allows us to
make the correct and confident predictions (i.e. far from the decision boundary) on
the training examples.
A∗
6 B∗
C∗
~y
~ ~x
w
Let’s consider our binary classification problem where we have labels ~y ∈ {−1, 1}
and features ~x. Then our binary classifirer might look like
(
g(z) = 1 if z ≥ 0,
hw,
~ b~ (~ wT ~x + ~b)
x) = g(~ (1)
g(z) = 0 otherwise.
Now we have to distinguish between two different notions of margin such as func-
tional and gemometric margin. The functional margin of (~ w, ~b) with respect to
the training example (~xi , ~yi ) is
ˆ = ~yi (~
~γ wT ~x + ~b) (2)
i
If ~yi = 1, then for our prediction to be confident and correct (i.e. the functional
~ T ~x + ~b, needs to be a large positive number. If ~yi = −1, then
margin to be large), w
6
for the functional margin to be large (i.e. to make a confident and correct prediction)
~ T ~x + ~b needs to be a large negative number. Note that if we replace w
w ~ with 2~w in
~ ~
Equation 1 and b with 2b, then since g(~ T ~
w ~x + b) = g(2~ T ~
w ~x + 2b), would not change
hw,
~ ~
b at all, which means that it depends only on the sign, but not on the magnitude
of w~ T ~x + ~b. Regarding the geometric margin we’ll try to interpret them using
Figure 5.
A W
γi
We can see the decision boundary corresponding to (~ w, ~b) along with the orthog-
onal vector w
~ . Point A resembles some training example ~xi with label ~yi = 1. The
distance to the decision boundary denoted by γi is given by the line segment AB.
~
How can we compute γi ? If we consider kw ~
wk
to be a unit-length vector pointing in
~
the same direction as w~ then point B = ~xi − γi · kw ~ . Since this point lies on the
wk
decision boundary then it satisfies that w~ T ~x + ~b = 0, which means that all points
~xi on the decision boundary satisfy the same equation. Substituting ~x with B we
T
~
w ~ ~ T~
w xi +b ~
w ~
~ T
get w (~xi − γi ·
kwk
~ ) + b = 0. Solving for γ i we get γ i = kwk
~ =kwk
~
~xi + b .
kwk
~
T !
~
w ~b
~γ i = + (3)
k~
wk k~
wk
If k~
wk = 1, then the functional margin is equal to the geometric margin. Notice
also that the geometric margin is invariant to rescaling of the parameters (i.e. if we
7
replace w ~ with 2~ w and ~b with 2~b, then the geometric margin does not change). This
way it is possible to impose an arbitray scaling contraint on w ~ without changing
anything significant from our original equation. Given a dataset D = {(~xi , ~yi ) ; , i =
1, . . . , N}, it is also possible to define the geometric margin of (~ w, ~b) with respect
to D to be the smallest of geometric margins on the individual training examples
γ = mini=1,...,N γ ~ i . Thus, the goal for our classifier is to find a decision boundary
that maximizes the geometric margin in order to reflect a confident and correct set of
predictions, resulting in a classifier that separates the positive and negative training
examples with a geometric margin. Supposed that our training data are linearly sep-
arable, how do we find a separating hyperplane that achieves the maximum geometric
margin? We start by posing the following optimisation problem
max ~γ (4)
~ ~
~γ ,w, b
wT ~xi + ~b) ≥ ~γ , i = 1, . . . , N
s.t. ~yi (~
k~
wk = 1
The k~wk = 1 constraint ensures that the functional margin equals to the geometric
margin, in this way we are guaranteed that all the geometric margins are at least ~γ .
Since “k~
wk = 1” constraint is a non-convex one, and this is hard to solve instead
what we’ll try to do is transform the problem into an easier one. Consider
~γˆ
max (5)
~ ~
~γ ,w, b k~
wk
wT ~xi + ~b) ≥ ~γˆ , i = 1, . . . , N
s.t. ~yi (~
8
1
min k~ w k2 (6)
~ b 2
~γ ,w, ~
wT ~xi + ~b) ≥ 1, i = 1, . . . , N
s.t. ~yi (~
ˆ
Notice that maximizing kwk ~γ
~
1
= kwk
~ is the same thing as minimizing k~ wk2 . We
have now transformed our optimization problem into one with a convex quadratic
objective and linear constraints which can be solved using quadratic programming.
The solution to the above optimization problem will give us the optimal margin
classifier which will lead to the dual form of our optimization problem, which in
return plays an important role in the use of kernels to get optimal margin classifiers,
in order to work efficiently in very high dimensional spaces. We can reexpress the
constraints of Equation 6 as gi (~ wT ~xi + ~b) + 1 ≤ 0. Notice that constraints
w) = −~yi (~
that hold with equality, gi (~
w) = 0, correspond to training examples (~xi , ~yi ), that
have functional margin equal to one. Let’s have a look at the figure below.
se p
ar a
ti n
gh
dec yp
i si o er p
nb l an
ou e
nd
ar y
The three points that lie on the decision boundary (two positive and one negative)
are the ones with the smallest margins and thus closest to the decision boundary.
Notice that these three points are called support vectors and usually they can be
smaller in number than the training set. In order to tackle the problem we frame it
as a Lagrangian optimization problem
N
~ 1 2
X h i
L(α, w
~ , b) = k~
wk − ~ T ~xi + ~b − 1 .
αi ~yi w (7)
2 i
9
with only one Lagrangian mulptiplier “αi ” since the problem has only inequality
constraints and not any equality constraints. First, we have to find the dual form of
the problem, to do so we need to minimize L(α, w ~ , ~b) with respect to w
~ and ~b for a
fixed α. Setting the derivatives of L with respect to w ~
~ and b to zero, we get:
N
X
∇w ~ , ~b) = w
~ L(α, w ~ − αi~yi~xi = 0. (8)
i=1
N
X n
∂L
~ =
w αi ~yi~xi . derivative of ~
∂w
(9)
i=1
N
~ ~b) X
∂L(α, w, n
∂L
= αi ~yi = 0. derivative of ∂~
(10)
∂ ~b i=1
b
N N N
X 1X X
~ , ~b) =
L(α, w αi − T ~
yi yj αi αj (~xi ) xj − b
~ ~ ~ αi~yi . (11)
i=1
2 i,j=1 i=1
N N
X 1X n
~ , ~b) =
L(α, w αi − ~yi ~yj αi αj (~xi )T ~xj . from eq. 10 last term = 0 (12)
i=1
2 i,j=1
Utilizing the constraint αi ≥ 0 and the constraint from Equation 10 the following
dual optimization problem arises:
N
X 1
max W (α) = αi − ~yi ~yj αi αj h~xi , ~xj i (13)
α
i=1
2
s.t. αi ≥ 0, i = 1, . . . , N
N
X
αi~yi = 0.
i=1
If we are able to solve the dual problem, in other words find the α that maximizes
~ (α) then we can use Equation 9 in order to find the optimal w
w ~ as a function of
α. Once we have found the optimal w ~ ∗, considering the primal problem then we can
also find the optimal value for the intercept term ~b.
~ T ~xi + mini:~yi =1 w
~b∗ = − maxi:~yi =−1 w∗ ~ ∗T ~xi
(14)
2
10
Suppose we’ve fit the parameters of our model to a training set, and we wish to
make a prediction at a new point input ~x. We would then calculate w ~ T ~x + ~b, and
predict ~y = 1 if and only if this quantity is bigger than zero. But using Equation 9,
this quantity can also be written:
N
!T
X
~ T ~x + ~b =
w αi ~yi~xi ~x + ~b (15)
i=1
N
X
= αi~yi h~xi , ~xi + ~b (16)
i=1
Earlier we saw that the different values for αi will all be zero except for the sup-
port vectors. Many of the terms in the sum above will be zero. We really need to
find only the inner products between ~x and the support vectors in order to calculate
Equation 16 and make our prediction. We will exploit this property of using inner
products between input feature vectors in order to apply kernels to our classification
problem. To talk about kernels we’ll have to think about our input data. In this
case as we have already previously mentioned we are referring to images, and images
are usually discribed by a number of pixel values which we’ll refer to as attributes,
indicating the different intensity colors across the three different color channels {R, G,
B}. When we process the pixel values in order to retrieve more meaningful represen-
tations, in other words when we map our initial pixel values through some processing
operation to some new values, these new values are called features and the operation
process is referred to as feature mapping usually denoted as φ.
Instead of applying SVM directly to the attributes ~x, we may want to use SVM to
learn from some features φ(~xi ). Since the SVM algorithm can be written entirely in
terms of innner products h~x,~zi we can instead replace them with hφ(~x), φ(~z)i. This
way given a feature mapping φ, the corresponding kernel is defined as K(~x,~z) =
φ(~x)T φ(~z). If we replace every inner product h~x,~zi in the algorithm with K(~x,~z),
then the learning process will be happening uisng features φ.
One can compute K(~x,~z) by finding φ(~x) and φ(~z) even though they may be
expensive to calculate because of their high dimensionality. Kernels such as K(~x,~z),
allows SVMs to perform learning in high dimensional feature spaces without the need
to explicitly find or represent vectors φ(~x). For instance, suppose ~x,~z ∈ Rn , and let’s
2
consider K(~x,~z) = (~xT ~z) which is equivalent to
11
n
! n
!
X X
K(~x,~z) = ~xi~zi ~xj~zj (17)
i=1 j=1
n X
X n
= ~xi~xj~zi~zj
i=1 j=1
Xn
= (~xi~xj ) (~zi~zj )
i,j=1
= φ(~x)T φ(~z)
x1 x1
x1 x2
x1 x3
x2 x1
φ(~x) =
x2 x2
x2 x3
x3 x1
x3 x2
x3 x3
d
Broadly speaking a kernel K(~x,~z) = (~xT ~z + c) corresponds to a feature mapping
of n+d
d
feature space. K(~x,~z) still takes O(n) time even though it is operating in
d
a O(n ) space, because it doesn’t need to explicitly represent feature vectors in this
high dimensional space. If we think of K(~x,~z) as some measurement of how similar
are φ(~x) and φ(~z), or ~x and ~z, then we might expect K(~x,~z) = φ(~x)T φ(~z) to be large
if φ(~x) and φ(~z) are close together and vice versa.
Suppose that for some learning problem we have thought of some kernel function
K(~x,~z), considered as a reasonable measure of how similar ~x and ~z are. For instace,
(
−
k~
x−~zk2 1 if ~x and ~z is close
K(~x,~z) = e 2σ 2 (18)
0 otherwise
the question then becomes, can we use this definition as the kernel in an SVM
algorithm? In general, given any function K is there any process which will allow
to describe if it exists some feature mapping φ so that K(~x,~z) = φ(~x)T φ(~z) for all
~x and ~z, in other words is it a valid kernel or not? If we suppose that K is a valid
12
kernel then φ(~xi )T φ(~xj ) = φ(~xj )T φ(~xi ), meaning that the kernel matrix denoted as
Km , discribing similarity between datum ~xi and ~xj , must be symmetric. If we denote
φk (~x), the k-the coordinate of the vector φ(~x), then for any vector ~z, we have
XX
~zT K~z = ~zi Kij~zj (19)
i j
XX
= ~zi φk (~xi )T φk (~xj )~zj
i j
XX X
= ~zi φk (~xi )φk (~xj )~zj
i j k
XXX
= ~zi φk (~xi )φk (~xj )~zj
k i j
!2
X X
= ~zi φk (~xi )
k i
≥ 0.
which shows that the kernel matrix Km is positive semi-definite (K ≥ 0) since our
choice of ~z was arbitary. If K is a valid kernel meaning that it corresponds to some
feature mapping φ, then the corresponding kernel matrix Km ∈ Rm×m is symmetric
positive semi-definite. This is a necessary and sufficient condition for Km to be a
valid kernel also called the Mercer kernel. The take away message is that if you have
any algorithm that you can write in terms of only inner products h~x,~zi between the
input attribute vectors, then by replacing it with a kernel K(~x,~z) you can permit
your algorithm to work efficiently in the high dimensional feature space.
Switching gears for a moment and returning back to our problem or actually classi-
fying and semantically retrieving similar images, since we now have an understanding
of how the SVM algorithm is functioning, we can use it in our application. Recall
that we have the following dataset:
Africa Monuments Animals People · · ·
Africa
..
image image image image .
1 1 1 1
Monum.
image
2 image2 image2 image2
D10−classes = Animals ←
People
image3 image3 image3 image3
. . . ..
.. .. ..
.. .
.
image100 image100 image100 image100
As in Section 3.1 we have our dataset and our query image, in this case denoted
by ~q, vectorized, in order to perform mathematical operations seamelessly. To make
13
things even more explicit, imagine that our system (i.e. the MATLAB software ac-
companying this tutorial) or the algorithm (i.e. the SVM in this case) receives a query
image ~q from the user and its job is to find and return to the user all the images which
are similar to the query ~q. For instance, if the query image ~q of the user depicts a
monument then the job of our system or algorithm is to return to the user all the
images depicting monuments from our dataset D.
In other words we are treating our problem as a multiclass classification problem.
Generally speaking there are two broad approaches in which we can resolve this issue
using the SVM algorithm. The first one is called “one-vs-all” approach and involves
training a single classifier per class, with the samples of that class as positive samples
and all other samples as negatives. This strategy requires the base classifiers to
produce a real-valued confidence score for its decision, rather than just a class label.
n!
The second approach is called “one-vs-one” and usually one has to train (n−k)!k!
binary classifiers for a k-way multiclass problem. Each receives the samples of a pair of
classes from the original training set, and must learn to distinguish these two classes.
n!
At prediction time, a voting scheme is applied: all (n−l)!k! classifiers are applied to
an unseen sample and the class that got the highest number of “+1” predictions
gets predicted by the combined classifier. This is the method that the accompanying
software is utilizing for the SVM solution.
Notice that if the SVM algorithm predicts the wrong class label for a query image
~q, then we end up retrieving and returning to the user all the images from the wrong
category/class since we predicted the wrong label. How can we compensate for this
shortcoming? This is left as an exercise to the reader to practice his/her skills.
References
[1] P. Marcelo, Pattern Recognition Letters, History of science, Nearest neighbor
classification, Visual perception., pp. 34–37, Vol. 38, 2014.
[4] T. Cover and P. Hart, IEEE Transactions on Information Theory, The nearest
neighbor pattern classification, pp. 21–27, Vol. 13, 1967.
14
[6] Ibn al-asan, widely considered to be one fo the first theoretical physicists, c.965–
c.1040 CE
15