ReportCS336 M11 KHCL
ReportCS336 M11 KHCL
ReportCS336 M11 KHCL
Final Project:
IMAGE RETRIEVAL
Table of Contents
1. Introduction: ....................................................................................................................................... 2
2. Survey: ................................................................................................................................................. 3
3. Method and Model:............................................................................................................................. 4
3.1 Simple Image Retrieval (SIR): .................................................................................................. 4
3.1.1 Architecture and Transfer: ................................................................................................ 4
3.1.2 Similarity: ............................................................................................................................. 4
3.2 Deep Local Feature (DELF): .................................................................................................... 4
3.2.1 Feature Extraction: ............................................................................................................. 5
3.2.2 Keypoint Selection: ............................................................................................................ 5
3.2.3 Dimensionality Reduction: ................................................................................................ 6
3.3 CNN Image Retrieval with No Human Annotation (CNN-IRwNHA): .................................. 6
3.3.1 Fully Convolutional Network:.................................................................................................... 6
3.3.2 Generalized-mean pooling and image descriptor: ................................................................... 6
3.3.3 Siamese learning and loss function:.......................................................................................... 7
4. Evaluation:........................................................................................................................................... 8
5. Design API: .......................................................................................................................................... 9
6. Conclusion: ........................................................................................................................................ 11
7. References:......................................................................................................................................... 11
1
University of Information Technology
Computer Science
1. Introduction:
Image Retrieval is a fundamental task in computer vision, since it is directly related to
various practical applications such as object detection, visual place recognition and
product recognition. The last decades have witnessed tremendous advances in image
retrieval systems - from handcrafted features and indexing algorithms to, more recently,
methods based on convolutional neural networks (CNNs) for global descriptor learning.
With the growth of image retrieval systems as well as e-commerce and online websites,
image retrieval applications have been increasingly all along around our daily life. For
example, Amazon, Alibaba, Myntra etc. have been heavily utilizing image retrieval to put
forward what they think is the most suitable product based on what we have seen just
now.
In this report, we conduct research and experiment with different methods to build
Image Retrieval system. First, the Simple Image Retrieval (SIR) method uses an existing
CNN network architecture to extract features of the input images and compare the feature
vectors by Cosine Distance. For the other two methods, we examine state of the art
methods with the Oxford5k dataset, and the source code is publicly available on github.
Includes Deep Local Feature (DELF) and CNN Image Retrieval with No Human
Annotation - an unsupervised learning method.
After being designed, these methods are evaluated by us against the ground truth of
the Oxford5k dataset. We use the map to draw conclusions about the pros and cons of
the methods in finding relevant images. Finally, we deploy our system to the web
application. Because of limited resources, our demo version is temporarily hosted by
google colab for each demo run.
2
University of Information Technology
Computer Science
2. Survey:
Before building our Image Retrieval System, we started surveying many sources. First
of all, we searched on Google with the keyword “image retrieval” to have an overview of
it. Secondly, we started looking at some science papers to understand how image
retrieval systems work, especially the systems used methods based on convolutional
neural networks (CNNs). The Deep Learning for Instance Retrieval: A Survey [1] scientific
paper has been surveyed by us to find the best current methods. Some reputable
websites in this field such as paperwithcode were also investigated by us during the
survey. Finally, we started looking for the source code of the systems choosing from
science papers based on what we desired, then tried executing them.
Afterall, we have chosen three methods which meet our needs. The first method is
based on CNN using the ResNet152 which used like backbone of our method SIR. In
order to increase generalizability and empiric various methods, our team decided to
choose one of the most popular supervised and unsupervised methods. With the
supervised method, we find the DELF method [2] which has the third highest score among
the methods of its kind. In addition, this method has official source code on tensor flow
which will be easy to implement. With the unsupervised one, we found the article Fine-
tuning CNN Image Retrieval with No Human Annotation [3].
3
University of Information Technology
Computer Science
3.1.2 Similarity:
We use Cosine Similarity to compare the similarity of feature vectors. The formula is
defined as follows:
𝑨 .𝑩 ∑ni 𝑨𝑖 𝑩𝑖
cosine_similarity(𝑨, 𝑩) = = (1)
‖𝑨‖‖𝑩‖
√∑ni 𝑨2𝑖 √∑ni 𝑩2𝑖
4
University of Information Technology
Computer Science
are trained only with image-level annotations on a landmark image dataset. To identify
semantically useful local features for image retrieval, we also propose an attention
mechanism for keypoint selection, which shares most network layers with the descriptor.
3.2.1 Feature Extraction:
In this stage, DELF extracts dense features from an image by applying a fully
convolutional network (FCN), which is constructed by using the feature extraction layer
of CNN trained with a classification loss. The FCN is taken from the ResNet50 model,
using the output of the conv4_x convolutional block. To handle scale changes, DELF is
explicitly constructed an image pyramid and applied the FCN for each level
independently. The obtained feature maps are regarded as a dense grid of local
descriptors. Features are localized based on their receptive fields, which can be
computed by considering the configuration of convolutional and pooling layers of the FCN.
DELF uses the pixel coordinates of the center of the receptive field as the feature location.
The receptive field size for the image at the original scale is 291 × 291. Using the image
pyramid, DELF can obtain features that describe image regions of different sizes.
3.2.2 Keypoint Selection:
After extracting features, DELF has a technique to effectively select a subset of the
features. Since a substantial part of the densely extracted features are irrelevant to the
recognition module and likely to add clutter, distracting the retrieval process, keypoint
selection is important for both accuracy and computational efficiency of retrieval systems.
The keypoint selection needs to train a landmark classifier with attention to explicitly
measure relevance scores for local feature descriptors. To train the function, features are
pooled by a weighted sum, where the weights are predicted by the attention network. The
training is formulated as follows. Denote by 𝐟n ∈ 𝑅𝑑 , 𝑛 = 1, . . . , 𝑁 the d-dimensional
features to be learned jointly with the attention model. Our goal is to learn a score function
α(𝐟n ; θ) for each feature, where θ denotes the parameters of function α(∙). The output
logit y of the network is generated by a weighted sum of the feature vectors, which is
given by
𝑦 = 𝐖 (∑ α(𝐟n ; θ) . 𝐟n (2)
n
where 𝐖 ∈ 𝑹𝑴 𝒙 𝒅 represents the weights of the final fully connected layer of the CNN
trained to predict M classes. For training, using cross entropy loss, which is given by
𝑒𝑥𝑝(𝒚)
ℒ = −𝑦 ∗. 𝑙𝑜𝑔( ) (3)
𝟏𝑇 𝑒𝑥𝑝(𝒚)
where y* is ground-truth in one-hot representation and (1) is one vector. The parameters
in the score function α(∙) are trained by backpropagation, where the gradient is given by
𝜕ℒ 𝜕ℒ 𝜕𝒚 𝜕𝛼𝑛 𝜕ℒ 𝜕𝛼𝑛
= ∑ = ∑ 𝑾 𝐟n (4)
𝜕θ 𝜕𝐲 𝜕𝛼𝑛 𝜕θ 𝜕𝐲 𝜕θ
𝑛 𝑛
5
University of Information Technology
Computer Science
where the backpropagation of the output score 𝛼𝑛 ≡ α(𝐟n ; θ) with respect to θ is same as
the standard multi-layer perceptron.
Figure 5: The architecture of network with the contrastive loss used at training time.
The feature vector finally consists of a single value per feature map, the generalized-
mean activation, and its dimensionality is equal to K. The pooling parameter pk can be
manually set or learned since this operation is differentiable and can be part of the
backpropagation. There is a different pooling parameter per feature map 𝐟 but it is also
6
University of Information Technology
Computer Science
possible to use a shared one. In this case 𝑝𝑘 = 𝑝, ∀𝑘 ∈ [1, 𝐾]. The last network layer
comprises an L2-normalization layer. Vector 𝐟 is L2-normalized so that similarity between
two images is finally evaluated with inner product. In the rest of the paper, GeM vector
corresponds to the L2- normalized vector 𝐟 and constitutes the image descriptor.
3.3.3 Siamese learning and loss function:
At the training stage, use a siamese architecture and train a two-branch network. Each
branch is a clone of the other, meaning that they share the same parameters. The training
input consists of image pairs (𝑖, 𝑗) and labels 𝐘 (𝑖, 𝑗) ∈ {0, 1} declaring whether a pair is
non-matching (label 0) or matching (label 1). We employ the contrastive loss that acts on
matching and non-matching pairs and is defined as
1 2
̅
‖𝐟(𝑖) ̅
− 𝐟(𝑗)‖ , 𝑖𝑓 𝒀(𝑖, 𝑗) = 1
ℒ(𝑖, 𝑗) = { 2 (6)
1 2
̅
(𝑚𝑎𝑥{0, 𝜏 − ‖𝐟(𝑖) ̅
− 𝐟(𝑗)‖ }), 𝑖𝑓 𝒀(𝑖, 𝑗) = 0
2
̅
where 𝐟(𝑖) is the L2-normalized GeM vector of image i, and 𝜏 is a margin parameter
defining when non-matching pairs have large enough distance in order to be ignored by
the loss.
7
University of Information Technology
Computer Science
4. Evaluation:
Roxford 5k dataset (Revisiting Oxford) [4]:
Author revisits and address issues with Oxford 5k and Paris 6k image retrieval
benchmarks. New annotation for both datasets is created with an extra attention to the
reliability of the ground truth and three new protocols of varying difficulty are introduced.
Author additionally introduces 15 new challenging queries per dataset and a new set of
1M hard distractors.
Oxford 5k dataset:
Method map
SIR 23.34
DELF 66.69
CNN-IRwNHA 82.09
Considering the two tables of evaluation results, the best method for our IR system is
CNN-IRwNHA. For the SIR method, we call this a naive method because it simply uses
CNN architectures to extract global features. This leads to some local features that are
not extracted well by the other two methods. In addition, we perform training on the
IMAGENET dataset, which is a dataset used for object detection and classification, whose
features are not suitable for the retrieval problem. DELF improves accuracy but the
execution time for a rather large query (~2min/query) slows down the IR system. CNN-
IRwNHA achieved the best results in our experiments in both accuracy and time.
8
University of Information Technology
Computer Science
5. Design API:
In this project, we build an API using method RESTful API Flask framework for
server-side and ReactJS framework for client-side.
However, we cannot deploy both back-end and front-end on any free hosting server
because our models require GPU for feature extraction step. Therefore, we use Google
Colab (GPU available for free plan account) for hosting the back-end server then we just
paste the link to the client website which has been deploy already to connect with the
back-end server.
9
University of Information Technology
Computer Science
Experimental images
10
University of Information Technology
Computer Science
6. Conclusion:
Indeed, we have completely designed our Image Retrieval system and evaluated three
methods that we used to build it. In practice, the unsupervised method CNN-IRwNHA have
demonstrated the best performance. In term of using the unsupervised method, we figured
out that this method could fine tune parameters itself to improve the performance of our
Image Retrieval system, which helps us a lot for not spending too much time to find out
how to upgrade our system. In contrast, the other methods need more time to fine tune
parameters, however, their performances are still really good such as … for SIR method
and … DELF method respectively.
When building the API, we have struggled with deploying our system on the host
server, so we decided to build it as a web app. In this report, we provide a file called
“Demo_FinalProject.ipynb” including detailed instructions for using our Image Retrieval
system. However, our system can be used for Google Colab. The video below including
instructions to use our system and our experiments.
(Link video: https://youtu.be/HQFgYrPgjX4)
7. References:
[1] Wei Chen, Yu Liu, Weiping Wang, Erwin M. Bakker, Theodoros Georgiou, Paul
Fieguth, Li Liu, and Michael S. Lew, “Deep Learning for Instance Retrieval: A Survey”,
2022
[2] H. Noh, A. Araujo, J. Sim, T. Weyand and B. Han, "Large-Scale Image Retrieval with
Attentive Deep Local Features", ICCV, 2017
[3] Filip Radenovic, Giorgos Tolias, Ondrej Chum, “Fine-tuning CNN Image Retrieval with
No Human Annotation”, TPAMI, 2018
[4] F. and Iscen, A. and Tolias, G. and Avrithis, Y. and Chum, “Revisiting Oxford and
Paris: Large-Scale Image Retrieval Benchmarking”, CVPR, 2018
___________________________________End___________________________________
11