Ercoli 2017
Ercoli 2017
Ercoli 2017
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 1
Abstract—In this paper we present an efficient method for a very good performance in retrieval, even with very compact
visual descriptors retrieval based on compact hash codes com- hash codes (e.g. also with 32 bits); it can be applied to hash
puted using a multiple k-means assignment. The method has local and global visual features, either engineered (e.g. SIFT)
been applied to the problem of approximate nearest neighbor
(ANN) search of local and global visual content descriptors, and or learned (e.g. CNNs). The proposed approach greatly re-
it has been tested on different datasets: three large scale standard duces the need of training data and memory requirements
datasets of engineered features of up to one billion descriptors for the quantizer, and obtains a retrieval performance similar
(BIGANN) and, supported by recent progress in convolutional or superior to more complex state-of-the-art approaches on
neural networks (CNNs), on CIFAR-10, MNIST, INRIA Holidays, standard large scale datasets. This makes it suitable, in terms
Oxford 5K and Paris 6K datasets; also the recent DEEP1B
dataset, composed by one billion CNN-based features, has been of computational cost, for mobile devices, large-scale media
used. Experimental results show that, despite its simplicity, the analysis and content-based image retrieval in general.
proposed method obtains a very high performance that makes it The paper is organized as follows: in Sect. II we provide a
superior to more complex state-of-the-art methods. thorough review of works related to visual feature hashing and
Index Terms—Retrieval, nearest neighbor search, hashing, indexing, highlighting the main differences of the proposed
SIFT, CNN. method. Our approach is presented in Sect. III, including a
discussion about its computational complexity. Experimental
results on several large scale standard datasets of visual
I. I NTRODUCTION
features and images, and comparison with the current state-
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 2
hierarchical k-means clustering). Experimental results on SIFT two quantization distortion properties of the Residual Vector
descriptors have shown that unstructured quantizers provide Quantization (RVQ) model, that tries to restore quantization
significantly superior performances with respect to structured distortion errors instead of reducing it.
quantizers. A few works have addressed the problem of indexing.
b) Scalar Quantization: Zhou et al. [9] have proposed an Babenko and Lempitsky [20] have proposed an efficient sim-
approach based on scalar quantization of SIFT descriptors. The ilarity search method, called inverted multi-index (IMI); this
median and the third quartile of the bins of each descriptor are approach generalizes the inverted index by replacing vector
computed and used as thresholds, hashing is then computed quantization inside inverted indices with product quantization,
coding the value of each bin of the descriptor with 2 bits, and building the multi-index as a multi-dimensional table
depending on this subdivision. The final hash code has a (Multi-D-ADC). Another multi-index strategy has been pro-
dimension of 256 bits, but only the first 32 bits are used posed by Zheng et al. [21], where complementary features,
to index the code in an inverted file; thus differences in i.e. binarized SIFT descriptors and local color features are
the following bits, associated with the remaining bins of the indexed in a coupled Multi-Index (c-MI), performing feature
SIFT descriptor, are not taken into account when querying the fusion at indexing level for content-based image retrieval.
index. On the other hand methods that follow this approach, More recently, Babenko and Lempitsky [22] have addressed
like the three following ones, do not require training data. the problem of indexing CNN features, observing that IMI is
The method of [9] has been extended by Ren et al. [10], inefficient to index such features, and proposing two exten-
including an evaluation of the reliability of bits, depending on sions of IMI: the Non-Orthogonal Inverted Multi-Index (NO-
their quantization errors. Unreliable bits are then flipped when IMI) and the Generalized Non-Orthogonal Inverted Multi-
performing search, as a form of query expansion. To avoid Index (GNO-IMI). Multi-scale matching of SIFT and CNN
using codebooks, in the context of memory limited devices features binarized with LSH, has been recently addressed in
such as mobile phones, Zhou et al. [11] have proposed the use the work of Zheng et al. [23], proposing an indexing structure
of scalable cascaded hashing (SCH), performing sequentially that uses compact pointers to associate the local features with
scalar quantization on the principal components, obtained the regional and global ones.
using PCA, of SIFT descriptors. Chen and Hsieh [12] have d) Neural Networks: Lin et al. [24] have proposed a
recently proposed an approach that quantizes the differences deep learning framework to create hash-like binary codes for
of the bins of the SIFT descriptor, using the median computed fast image retrieval. Hash codes are learned in a point-wise
on all the SIFT descriptors of a training set as a threshold. manner by employing a hidden layer for representing the latent
c) Vector Quantization: Jégou et al. [1] have proposed concepts that dominate the class labels (when the data labels
to decompose the feature space into a Cartesian product are available). This layer learns specific image representations
of subspaces with lower dimensionality, that are quantized and a set of hash-like functions.
separately. This Product Quantization (PQ) method is efficient Do et al. [25] have addressed the problem of learning binary
in solving memory issues that arise when using vector quan- hash codes for large scale image search using a deep model
tization methods such as k-means, since it requires a much which tries to preserve similarity, balance and independence
reduced number of centroids to obtain a code of the desired of images. Two sub-optimizations during the learning process
length. The method has obtained state-of-the-art results on allow to efficiently solve binary constraints.
a large scale SIFT features dataset, improving over methods Guo and Li [26] have proposed a method to obtain the
such as SH [2] and Hamming Embedding [13]. This result is binary hash code of a given image using binarization of the
confirmed in the work of Chandrasekhar et al. [14], that have CNN outputs of a certain fully connected layer.
compared several compression schemes for SIFT features. Zhang et al. [27] have proposed a very deep neural network
However, its performance is dependent on the choice of the (DNN) model for supervised learning of hash codes (VDSH).
subspaces used. They use a training algorithm inspired by alternating direction
The efficiency and efficacy of the Product Quantization method of multipliers (ADMM) [28]. The method decomposes
method has led to development of several variations and im- the training process into independent layer-wise local updates
provements. The idea of compositionality of the PQ approach through auxiliary variables.
has been further analyzed by Norouzi and Fleet [15], that have Xia et al. [29] have proposed an hashing method for
built upon it proposing two variations of k-means: Orthogo- image retrieval which simultaneously learns a representation
nal k-means and Cartesian k-means (ck-means). Also Ge et of images and a set of hash functions.
al. [16] have proposed another improvement of PQ, called A deep learning framework for hashing of multimodal data
OPQ, that minimizes quantization distortions w.r.t. space de- has been proposed by Wang et al. [30], using a multimodal
composition and quantization codebooks; He et al. [17] have Deep Belief Network to capture correlation in high-level
proposed an affinity-preserving technique to approximate the space during pre-training, followed by learning a cross-modal
Euclidean distance between codewords in k-means method. autoencoder in fine tuning phase.
Kalantidis and Avrithis [18] have presented a simple vector Lin et al. [31] have proposed to use unsupervised two steps
quantizer (LOPQ) which uses a local optimization over a hashing of CNN features. In the first step Stacked Restricted
rotation and a space decomposition and apply a parametric Boltzmann Machines learn binary embedding functions, then
solution that assumes a normal distribution. More recently, fine tuning is performed to retain the metric properties of the
Guo et al. [19] have improved over OPQ and LOPQ adding original feature space.
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 3
The method proposed in this paper belongs to the family are assigned to the nearest cluster center, whose code is used
of methods based on vector quantization; compared to recent as hash code. Considering the case of 128-dimensional visual
hashing algorithms based on neural networks this allows to content descriptors, like SIFT or the FC7 layer of the VGG-
learn hash codes with a much reduced computational cost, M-128 CNN [33], this means that compressing them to 64
without loss in performance. Unlike previous vector quantiza- bits codes requires to use k = 264 centroids. In this case
tion approaches it associates the features to multiple codebook the computational cost of learning a k-means based quantizer
words, reducing quantization errors due to wrong associations becomes expensive in terms of memory and time because:
to nearby codewords. Another difference, considering in par- i) there is the need of a quantity of training data that is
ticular the family of methods based on Product Quantization, several times larger than k, and ii) the execution time of
is the fact that codewords are associated to single bits of the the algorithm becomes unfeasible. Using hierarchical k-means
hash and not to portions of the feature; this avoids the need to (HKM) makes it possible to reduce execution time, but the
partition features so to reflect the structure of the descriptor, problem of memory usage and size of the required learning
as shown in [1]. set affects also this approach. Since the quantizer is defined
by the k centroids, the use of quantizers with a large number
III. T HE P ROPOSED M ETHOD of centroids may not be practical or efficient: if a feature
has a dimension D, there is need to store k × D values to
The proposed method exploits a novel version of the k-
represent the codebook of the quantizer. A possible solution
means vector quantization approach, introducing the possi-
to this problem is to reduce the length of the hash signature,
bility of assignment of a visual feature to multiple cluster
but this typically affects negatively retrieval performance. The
centers during the quantization process. This approach greatly
use of product k-means quantization, proposed originally by
reduces the number of required cluster centers, as well as
Jégou et al. [1], overcomes this issue.
the required training data, performing a sort of quantized
In our approach, instead, we propose to compute a sort
codebook soft assignment for an extremely compact hash code
of soft assignment within the k-means framework, to obtain
of visual features. Table I summarizes the symbols used in the
very compact signatures and dimension of the quantizer,
following.
thus reducing its memory requirements, while maintaining a
retrieval performance similar to that of [1].
TABLE I
N OTATION TABLE The proposed method, called multi-k-means (in the follow-
ing abbreviated as m-k-means), starts learning a standard k-
xi feature to be hashed
Ci generic centroid
means dictionary as shown in Eq. 1, using a very small number
Si cluster k of centroids to maintain a low computational cost. Once we
k number of centroids; length of the hash code obtained our C1 , . . . , Ck centroids, the main difference resides
Cj centroid associated to the j th bit of the hash code in the assignment and creation of the hash code. Each centroid
D dimension of the feature
is associated to a specific bit of the hash code:
The first step of the computation is a typical k- (
means algorithm for clustering. Given a set of observations k x − Cj k≤ δ j th bit = 1
(2)
(x1 , x2 , . . . , xn ) where each observation is a D-dimensional k x − Cj k> δ j th bit = 0
real vector, k-means clustering partitions the n observations
into k(≤ n) sets S = {S1 , S2 , . . . , Sk } so as to minimize the where x is the feature point and δ is a threshold measure given
sum of distance functions of each point in the cluster to the by
Ck centers. Its objective is to find:
Qk 1
( Pj=1 k x − Cj k) geometric mean
k X k
X
argmins k x − Ci k2 (1) δ = k1 kj=1 k x − Cj k arithmetic mean
i=1 x∈Si
th
n nearest distance k x − Cj k ∀j = 1, ..., k
This process is convergent (to some local optimum) but (3)
the quality of the local optimum strongly depends on the i.e. centroid j is associated to the j th bit of the hash code of
initial assignment. We use the k-means++ [32] algorithm length k; the bit is set to 1 if the feature to be quantized is
for choosing the initial values, to avoid the poor clusterings assigned to its centroid, or to 0 otherwise.
sometimes found by the standard k-means algorithm.
A feature can be assigned to more than one centroid using
two main different approaches:
A. Multi-k-means Hashing i) m-k-means-t1 - using Eq. (2) and one of the first two
K-means is typically used to compute the hash code of thresholds of Eq. (3). In this case the feature vector is
visual feature in unstructured vector quantization, because it considered as belonging to all the centroids from which its
minimizes the quantization error by satisfying the two Lloyd distance is below the threshold. Experiments have shown that
optimality conditions [1]. In a first step a dictionary is learned the arithmetic mean is more efficient with respect to the
over a training set and then hash codes of features are obtained geometric one, and all the experiments will report results
by computing their distance from each cluster center. Vectors obtained with it.
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 4
ii) m-k-means-n1 - using Eq. (2) and the third threshold of [34] and, similarly, it alleviates the problem of codeword
Eq. (3), i.e. assigning the feature to a predefined number n of ambiguity while reducing the quantization error.
nearest centroids. Fig. 1 illustrates the quantization process and the resulting
We also introduce two variants (m-k-means-t2 and m-k- hash codes in three cases: one in which a vector is assigned to
means-n2 ) to the previous approaches by randomly splitting a variable number of centroids (m-k-means-t1 ), one in which
the training data into two groups and creating two different a vector is assigned to a predefined number of centroids (m-k-
codebooks for each feature vector. The final hash code is given means-n1 ) and one in which the resulting code is created by
by the union of these two codes. the union of two different codes created using two different
codebooks (m-k-means-t2 and m-k-means-n2 ). In all cases the
feature is assigned to more than one centroid. An evaluation
of these two approaches is reported in Sect. IV.
Typically a multi probe approach is used to solve the
problem of ambiguous assignment to a codebook centroid
(in case of vector quantization, as in the coarse quantization
step of PQ [1]) or quantization error (e.g. in case of scalar
quantization, as in [9], [10]); this technique stems from the
approach originally proposed in [35], to reduce the need of
creating a large number of hash tables in LSH. The idea is
that if a object is close to a query object q, but is not hashed
to the same bucket of q, it is still likely hashed to a bucket
that is near, i.e. to a bucket associated with an hash that has a
small difference w.r.t. the hash of q. With this approach one or
more bits of the query hash code are flipped to perform a query
expansion, improving recall at the expense of computational
cost and search time. In fact, if we chose to try all the hashes
within an Hamming distance of 1 we have to create variations
of the original hash of q flipping all the bits of the hash,
one at a time. This means that for a hash code of length k
we need to repeat the query with additional k hashes. In the
proposed method this need of multi probe queries is greatly
reduced, because of the possibility of assignment of features
to more than one centroid. For example, consider either Fig. 1
(top) or (middle): if a query point nearby f1 or f2 falls in
the Voronoi cell of centroid C6 , using standard k-means it
could be retrieved only using a multi probe query, instead the
proposed approach maintains the same hash code.
B. Computational Complexity
Let us consider a vector with dimensionality D, and desired
hash code length of 64 bits. Standard k-means has an assign-
ment complexity of kD, where k = 264 , while the proposed
approach instead needs k 0 = 64 centroids, has a complexity
of k 0 D and requires k 0 D floats to store the codebook. Product
Quantization requires k ∗ × D floats for the codebook and has
Fig. 1. Toy examples illustrating the proposed method: (top) features can an assignment complexity of k ∗ D, where k ∗ = k 1/m , using
be assigned (green line) to a variable number of nearest clusters (e.g. those typically k ∗ = 256 and m = 8 values, for a 64 bit length [1];
with distances below the mean δ - i.e. m-k-means-t1 ); (middle) features can
be assigned to a fixed number of clusters (e.g. the 2 nearest clusters - i.e. m- in this case the cost of the proposed method is a quarter of
k-means-n1 );(bottom) hash code created from two different codebooks (m-k- the cost of PQ.
means-x2 , where x can be either t or n). If a feature is assigned to a centroid
the corresponding bit in the hash code is set to 1.
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 5
TABLE III
DEEP1B DATASETS CHARACTERISTICS
descriptor dimensionality D 96
# learning set vectors 358,480,000
# database set vectors 1,000,000,000
# queries set vectors 10,000
# nearest vectors for each query 1
A. Datasets
BIGANN Dataset [1], [36] is a large-scale dataset com-
monly used to compare methods for visual feature hashing
and approximate nearest neighbor search [1], [15], [16], [18],
[20], [36], [37]. The dataset is composed by three different Tabella 5.1: Esempi di query e di immagini rilevanti in INRIA Holidays
sets of SIFT and GIST descriptors, each one divided in three Fig. 2. Sample images from INRIA Holiday dataset. Left column shows the
query images, the other columns show similar images.
subsets: a learning set, a query set and base set; each query has
corresponding ground truth results in the base set, computed in
an exhaustive way with Euclidean distance, ordered from the
most similar to the most different. For SIFT1M and SIFT1B
query and base descriptors have been extracted from the
INRIA Holidays images [38], while the learning set has been
extracted from Flickr images. For GIST1M query and base
descriptors are from INRIA Holidays and Flickr 1M datasets,
while learning vectors are from [39]. In all the cases query
descriptors are from the query images of INRIA Holidays (see
Figure 2). The characteristics of the dataset are summarized
in Table II.
DEEP1B Dataset [22] is a recent dataset produced using
a deep CNN based on the GoogLeNet [40] architecture and
trained on ImageNet dataset [41]. Descriptors are extracted
from the outputs of the last fully-connected layer, compressed
using PCA to 96 dimensions, and l2 -normalized. The charac-
teristics of the dataset are summarized in Table III.
CIFAR-10 Dataset [42] consists of 60,000 colour images Fig. 3. Sample images from CIFAR-10 dataset
(32 × 32 pixels) in 10 classes, with 6,000 images per class
(see Figure 3). The dataset is split into training and test
sets, composed by 50,000 and 10,000 images respectively. INRIA Holidays is composed by 1,491 images, of which 500
A retrieved image is considered relevant for the query if it are used as queries; Oxford 5K is composed by 5,062 images
belongs to the same class. This dataset has been used for ANN with 55 query images, and Paris 6K is made of 6,412 images
retrieval using hash codes in [24], [29]. with 55 query images. We used the query images and ground
truth provided for each dataset, adding 100,000 distractor
MNIST Dataset [43] consists of 70,000 handwritten digits
images from Flickr 100K [44].
images (28 × 28 pixels, see Figure 4). The dataset is split into
60,000 training examples and 10,000 test examples. Similarly
to CIFAR-10, a retrieved image is considered relevant if it B. Evaluation Metrics
belongs to the same class of the query. This dataset has been The performance of ANN retrieval in BIGANN dataset is
used for ANN retrieval in [24], [29]. evaluated using recall@R, which is used in most of the results
Image retrieval datasets INRIA Holidays [38], Oxford 5K reported in the literature [1], [15], [16], [18], [20], [36] and
[44] and Paris 6K [45] are three datasets typically used to it is, for varying values of R, the average rate of queries for
evaluate image retrieval systems. For each dataset are given which the 1-nearest neighbor is retrieved in the top R positions.
a number of query images, and the associated ground truth. In case of R = 1 this metric coincides with precision@1. The
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 6
F7 F8
CNN 4096d 1000d
4096 nodes 1000 nodes
ImageNet (~1.2M images)
Step 1: Supervised Pre-Training on ImageNet F8
F7 transfer
Parameter
CNN
Module2: Fine-tuning on Target Domain 1000 nodes
4096 Transferring
Parameter nodes
Step 2: Fine tuning
ImageNet onimages)
(~1.2M CIFAR-10
Module2: Fine-tuning on Target Domain Parameter Transferring
Convolutional layers FC7 FCh FC8
CIFAR-10
Module3: Image
Module3: Retrieval
Image via
Retrieval viaHierarchical Deep
Hierarchical Deep Search
Search
Similarity
Fig. 5. Framework used for CNN feature extraction
101001
101001
Similarity
on CIFAR-10 [24]: we
Computation
same measure has been used by the authors of the DEEP1BFigure 1: Query Image
Coarse-level Search
d) MNIST: We use LeNet CNN to compute our features
Fine-level Search
The proposed image retrieval framework via hierarchical deep search. Our method consists of three main com-
Results
dataset [22]. ponents. The first is the supervised pre-training of a convolutional neural network on the ImageNet to learn rich mid-level
Figure 1:image
Therepresentations.
proposed image
in MNIST. In theretrieval
This framework
is a network
second component, we via
add ahierarchical
latent layer todeep
architecture search.
the network andOur
havemethod
developed
neurons in consists
this layeroflearn
by three main com-
LeCun
Performance of image retrieval in CIFAR-10, MNIST,ponents. IN- The first representations
hashes-like is the supervised while pre-training
fine-tuning it onofthe a convolutional neuralThe
target domain dataset. network
final stageonisthe ImageNet
to retrieve to images
similar learn rich mid-level
using a
[48] that
coarse-to-fine
image representations. In the wassecond especially
strategy that utilizes the learn
component, we add designed
a latent layer for
hashes-like binary codes to therecognizing
and F features.
network and have neurons 7
handwritten in this layer learn
RIA Holidays, Oxford 5K and Paris 6K is measured following hashes-like representations while fine-tuning it on the target domain dataset. The final stage is to retrieve similar images
million digits,
using a coarse-to-fine
images categorizedreading
strategy that
into utilizes zip
1000 object codes,
the classes.
learn hashes-like
Our etc. binary
images Itcodes
inducing issimilar
a Fbinary
and 8-layer activationsnetwork
7 features. would have the with
the setup of [29], using Mean Average Precision: method for learning binary codes is described in detail as
follows.1 input layer, 2 convolutional
same label. To fulfill this idea, we embed the latent layer H
between F layers,
and F as shown2in the 7
non-linear
middle row of Figure 1.
8
down-
The images
latent layer H is a fully connected layer,activations
and its neuron
PQ million images sampling
3.1. Learningcategorized
Hash-like layers,
intoBinary object2classes.
1000 Codes fully Our connected inducing
activities are regulated layers
similar and
binary
by the succeeding layeraF thatGaussian would have the
en-
q=1 AveP (q) method forRecentlearning binary codes is described in detail as codessame label.and Toachieves
fulfill this idea, weThe embed the latent layer H 8
semantics classification. proposed
M AP = connected
(4)ture activations
follows. of layers F layer induced by with
studies [14, 7, 5, 1] have shown that the fea-
the input 10 im- output
6−8 latent layerclasses.
between HF not and F
7 only 8 We an used
as shown
provides aof themodified
in the middle
abstraction row of Figure 1.
rich
Q age can serve as the visual signatures. The use of these
version of LeNet Codes [47] and thewe
Thefrom
features latentF , layer
but also is a fully
Hbridges connected
the mid-level layer,
features and and its neuron
3.1. Learning Hash-like Binary activities are regulated by the succeeding layer first
obtain features from the
7
mid-level image representations demonstrates impressive high-level semantics. In our design, the neurons in the F8 that en-
where Recentimprovement on the task of image classification, retrieval,
fully
studies
and others.
[14, connected
However,
7, these
5, 1]signatures arelayer.
have shown
latent layer H are activated by sigmoid functions so the ac-
that the fea- tivations
high-dimensional
codes semantics and achieves classification. The proposed
are approximated to {0, 1}.
Z 1 ture activations of layers
vectors that are inefficient F induced by the input
for image retrieval in a large cor- im- latent layer H not only provides
To achieve domain adaptation, we fine-tune the proposedan abstraction of the rich
Wevisual perform search with
these anetwork
non-exhaustive alsoapproach on both
6−8
age can pus.
serve To as the signatures. The use of features
on thefrom F7 , but dataset bridges
via backthe mid-level features and
AveP = p(r)dr (5)
mid-levelto image
facilitate
reduce the
efficient
representations
image retrieval,
computational cost demonstrates
a practical
is to convert the
way
impressive
feature The the high-level
initial
target-domain
weights of semantics.
the deep CNNInareour
propagation.
set design, the neurons in the
as the weights
0 vectorson
improvement
CIFAR-10
to binary
the taskcodes.
of Such
image
and
binary MNIST
compact
classification, can bedatasets.
codesretrieval, trained For
fromlayer
latent ImageNet
H areeach
dataset. image
The
activated weights
by sigmoidofwe extract
thefunctions
latent so the ac-
quickly compared using hashing or Hamming distance. layer H and the final classification layer F are randomly
a
and others.InHowever,
is the area under the precision-recall curve and Q is the number this 48-dimensional
work, we propose to learn the domain feature
these signatures are high-dimensional
specific im- vector
initialized. Thefor
tivations are CIFAR-10,
approximated
initial random weights of latent and
to {0, 1}. layer H 500-
8
set (Table II). Euclidean distance measure is then used to re- (6)
rank the nearest feature points, calculating recall@R values in where Ai and Bi are the components of the original feature
these subsets. vectors A and B, is used to re-rank the nearest visual features.
b) DEEP1B: We use the CNN features computed in e) Holidays, Oxford 5K, Paris 6K: We used CNN fea-
[22], hashed to 64-bit codes. Searching process is done in tures extracted using the 1024d average pooling layer of
a non-exhaustive way, using Hamming distances to reduce the GoogLeNet [40], that in initial experiments has proven to be
subsets of candidates from the whole database set. After we more effective than the FC7 layer of VGG [49] used in [50].
have extracted a shortlist of candidates we perform a re-rank When testing on a dataset, training is performed using the
step based on Euclidean distances and we calculate recall@R other two datasets. Features have been hashed to 64 bits binary
values. codes, a length that has proved to be the best compromise
c) CIFAR-10: We use features computed with the frame- between compactness and representativeness, and allow us to
work proposed in [24] (Figure 5). The process is carried out compare with a number of competing approaches.
in two steps: in the first step a supervised pre-training on
the large-scale ImageNet dataset [46] is performed. In the D. Results on BIGANN: SIFT1M, GIST1M
second step fine-tuning of the network is performed, with the In this set of experiments the proposed approach and its
insertion of a latent layer that simultaneously learns domain variants are compared on the SIFT1M (Fig. 6) and GIST1M
specific feature representations and a set of hash-like functions. (Fig. 7) datasets against several methods presented in section
The authors used the pre-trained CNN model proposed by II: Product Quantization (ADC and IVFADC) [1], PQ-RO
Krizhevsky et al. [46] and implemented in the Caffe CNN [19], PQ-RR [19], Cartesian k-means [15], OPQ-P [16], [51],
library [47]. In our experiments we use features coming from OPQ-NP [16], [51], LOPQ [18], a non-exhaustive adaptation
the F Ch Layer (Latent Layer H), which has a size of 48 of OPQ [16], called I-OPQ [18], RVQ [52] , RVQ-P [19] and
nodes. RVQ-NP [19].
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 7
ADC (Asymmetric Distance Computation) is characterized training phase, these experiments are averaged over a set of
by the number of sub vectors m and the number of quantizers 10 runs.
per sub vectors k ∗ , and produces a code of length m ×log2 k ∗ .
IVFADC (Inverted File with Asymmetric Distance Com-
putation) is characterized by the codebook size k 0 (number
of centroids associated to each quantizer), the number of
neighbouring cells w visited during the multiple assignment,
the number of sub vectors m and the number of quantizers per
sub vectors k ∗ which is in this case fixed to k ∗ = 256. The
length of the final code is given by m ×log2 k ∗ .
PQ-RO [19] is the Product Quantization approach with data
projection by randomly order dimensions.
PQ-RR [19] is the Product Quantization approach with data
projection by both PCA and randomly rotation.
Cartesian k-means (ck-means) [15] models region center as
an additive combinations of subcenters. Let m be the number
of subcenters, with h elements, then the total number of model
centers is k = hm , but the total number of subcenters is h×m,
and the number of bits of the signature is m × log2 h.
OPQ-P [16], [51] is the parametric version of Optimized Fig. 6. Recall@R on SIFT1M - Comparison between our method (m-k-
Product Quantization (OPQ), that assumes a parametric Gaus- means-t1 , m-k-means-n1 with n=32, m-k-means-t2 and m-k-means-n2 with
n=32), the Product Quantization method (PQ ADC and PQ IVFADC) [1],
sian distribution of features and performs space decomposition Cartesian k-means method (ck-means) [15], a non-exhaustive adaptation of
using an orthonormal matrix computed from the covariance the Optimized Product Quantization method (I-OPQ), a Locally optimized
matrix of data. product quantization method (LOPQ) [18], OPQ-P and OPQ-NP [16], [51],
and PQ-RO, PQ-RR, RVQ-P and RVQ-NP [19].
OPQ-NP [16], [51] is the non-parametric version of OPQ,
that does not assume any data distribution and alternatively
optimizes sub-codebooks and space decomposition.
LOPQ (Locally optimized product quantization) [18] is
a vector quantizer that combines low distortion with fast
search applying a local optimization over rotation and space
decomposition.
I-OPQ [18] is a non-exhaustive adaptation of OPQ (Opti-
mized Product Quantization [16]) which use either OPQ-P or
OPQ-NP global optimization.
RVQ [52] approximates the quantization error by another
quantizer instead of discarding it. In this method several stage-
quantizers, each one with its corresponding stage-codebook,
are connected sequentially. Each stage-quantizer approximates
the residual vector of the preceding stage by one of centroids
of its stage-codebook and generates a new residual vector for
the next stage.
RVQ-P [19] is a parametric version of RVQ, where stage-
codebooks and space decomposition of RVQ are optimized
Fig. 7. Recall@R on GIST1M - Comparison between our method (m-k-
using SVD. means-t1 , m-k-means-n1 with n=48, m-k-means-t2 and m-k-means-n2 with
RVQ-NP [19] is a non-parametric version of RVQ, using n=48), the Product Quantization method (ADC and IVFADC) [1], Cartesian k-
the same techniques of RVQ-P, but optimizing a space de- means method (ck-means) [15], a non-exhaustive adaptation of the Optimized
Product Quantization method (I-OPQ) [18], a Locally optimized product
composition for all the stages. quantization method (LOPQ) [18], OPQ-P and OPQ-NP [16], [51], and PQ-
The parameters of the proposed methods are set as follows: RO, PQ-RR, RVQ-P and RVQ-NP [19].
for m-k-means-t1 we use as threshold the arithmetic mean of
the distances between feature vectors and centroids to compute The proposed method, in all its variants, obtains the best
hash code; m-k-means-n1 creates hash code by setting to 1 results when considering the more challenging values of
the corresponding position of the first 32 (SIFT1M) and first recall@R, i.e. with a small number of nearest neighbors, like
48 (GIST1M) nearest centroids for each feature; m-k-means- 1, 10 and 100. When R goes to 1000 and 10,000 it still obtains
t2 and m-k-means-n2 create two different sub hash codes for the best results and in the case of SIFT1M it is on par with
each feature by splitting into two parts the training phase and ck-means [15]. Considering GIST1M the method consistently
combine these two sub parts into one single code to create the outperforms all the other methods for all the values of R,
final signature. Since we have a random splitting during the except for R=1 where RVQ-P [19] is better.
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 8
TABLE IV
MAP RESULTS ON CIFAR-10 AND MNIST. C OMPARISON BETWEEN OUR
METHOD ( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =24, M - K - MEANS -t2
AND M - K - MEANS -n2 WITH N =24) WITH KSH [53], ITQ-CCA [54], MLH
[55], BRE [56], CNNH [29], CNNH+ [29], K EVIN N ET [24], LSH [57],
SH [2], ITQ [54]. R ESULTS FROM [7], [24], [29].
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 9
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 10
typically due to quantization errors, and results in better [17] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving
approximated nearest neighbour estimation using Hamming quantization method for learning binary compact codes,” in Proc. of
CVPR, 2013.
distance. The method has been tested on large scale datasets [18] Y. Kalantidis and Y. Avrithis, “Locally optimized product quantization
of engineered (SIFT and GIST) and learned (deep CNN) for approximate nearest neighbor search,” in Proc. of CVPR, 2014.
features, obtaining results that outperform or are comparable to [19] D. Guo, C. Li, and L. Wu, “Parametric and nonparametric residual vector
quantization optimizations for ANN search,” Neurocomputing, 2016.
more complex state-of-the-art approaches. The m-k-means-n1 [20] A. Babenko and V. Lempitsky, “The inverted multi-index,” in Proc. of
variant typically performs better than m-k-means-t1 , especially CVPR, 2012.
when dealing with modern CNN features. [21] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Packing and padding: Coupled
multi-index for accurate image retrieval,” in Proc. of CVPR, 2014.
[22] A. Babenko and V. Lempitsky, “Efficient indexing of billion-scale
ACKNOWLEDGMENT datasets of deep descriptors,” in Proc. of CVPR, 2016.
[23] L. Zheng, S. Wang, J. Wang, and Q. Tian, “Accurate image search with
This work is partially supported by the “Social Museum and multi-scale contextual evidences,” International Journal of Computer
Smart Tourism” project (CTN01 00034 231545). Vision, vol. 120, no. 1, pp. 1–13, Oct. 2016.
[24] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen, “Deep learning of
This research is based upon work supported [in part] by binary hash codes for fast image retrieval,” in Proc. of CVPR, 2015.
the Office of the Director of National Intelligence (ODNI), [25] T.-T. Do, A.-Z. Doan, and N.-M. Cheung, “Discrete hashing with deep
neural network,” arXiv preprint arXiv:1508.07148, 2015.
Intelligence Advanced Research Projects Activity (IARPA), [26] J. Guo and J. Li, “CNN based hashing for image retrieval,” arXiv
via IARPA contract number 2014-14071600011. The views preprint arXiv:1509.01354, 2015.
and conclusions contained herein are those of the authors and [27] Z. Zhang, Y. Chen, and V. Saligrama, “Supervised hashing with deep
neural networks,” arXiv preprint arXiv:1511.04524, 2015.
should not be interpreted as necessarily representing the offi- [28] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
cial policies or endorsements, either expressed or implied, of optimization and statistical learning via the alternating direction method
ODNI, IARPA, or the U.S. Government. The U.S. Government of multipliers,” Foundations and Trends in Machine Learning, vol. 3,
no. 1, pp. 1–122, 2011.
is authorized to reproduce and distribute reprints for Gov- [29] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for
ernmental purpose notwithstanding any copyright annotation image retrieval via image representation learning.” in Proc. of AAAI,
thereon. 2014.
[30] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes
for multimodal representations using orthogonal deep structure,” IEEE
R EFERENCES Transactions on Multimedia, vol. 17, no. 9, pp. 1404–1416, 2015.
[31] J. Lin, O. Morère, J. Petta, V. Chandrasekhar, and A. Veillard, “Tiny
[1] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest descriptors for image retrieval with unsupervised triplet hashing,” in
neighbor search,” IEEE Transactions on Pattern Analysis and Machine Proc. of DCC, 2016.
Intelligence, vol. 33, no. 1, pp. 117–128, 2011. [32] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful
[2] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. of seeding,” in Proc. of ACM-SIAM SODA, 2007.
NIPS, 2009. [33] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of
[3] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with the devil in the details: Delving deep into convolutional nets,” in Proc.
semantically consistent graph for image indexing,” IEEE Transactions of BMVC, 2014.
on Multimedia, vol. 15, no. 1, pp. 141–152, 2013. [34] J. van Gemert, C. Veenman, A. Smeulders, and J.-M. Geusebroek,
[4] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon, “Spherical “Visual word ambiguity,” IEEE Transactions on Pattern Analysis and
hashing: Binary code embedding with hyperspheres,” IEEE Transactions Machine Intelligence, vol. 32, no. 7, pp. 1271–1283, 2010.
on Pattern Analysis and Machine Intelligence, vol. 37, no. 11, pp. 2304– [35] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe
2316, 2015. lsh: Efficient indexing for high-dimensional similarity search,” in Proc.
[5] Z. Jin, C. Li, Y. Lin, and D. Cai, “Density sensitive hashing,” IEEE of VLDB, 2007.
Transactions on Cybernetics, vol. 44, no. 8, pp. 1362–1371, 2014. [36] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one
[6] S. Du, W. Zhang, S. Chen, and Y. Wen, “Learning flexible binary code billion vectors: re-rank with source coding,” in Proc. of ICASSP, 2011.
for linear projection based hashing with random forest,” in Proc. of [37] M. Norouzi, A. Punjani, and D. Fleet, “Fast exact search in Hamming
ICPR, 2014. space with multi-index hashing,” IEEE Transactions on Pattern Analysis
[7] Y. Lv, W. W. Y. Ng, Z. Zeng, D. S. Yeung, and P. P. K. Chan, and Machine Intelligence, vol. 36, no. 6, pp. 1107–1119, 2014.
“Asymmetric cyclical hashing for large scale image retrieval,” IEEE [38] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak
Transactions on Multimedia, vol. 17, no. 8, pp. 1225–1235, 2015. geometric consistency for large scale image search,” in Proc. of ECCV,
[8] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A 2008.
comparison of hash function types and querying mechanisms,” Pattern [39] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A
Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010. large data set for nonparametric object and scene recognition,” IEEE
[9] W. Zhou, Y. Lu, H. Li, and Q. Tian, “Scalar quantization for large scale Transactions on Pattern Analysis and Machine Intelligence, vol. 30,
image search,” in Proc. of ACM MM, 2012. no. 11, pp. 1958–1970, 2008.
[10] G. Ren, J. Cai, S. Li, N. Yu, and Q. Tian, “Scalable image search with [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
reliable binary code,” in Proc. of ACM MM, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[11] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, “Towards in Proc. of CVPR, 2015.
codebook-free: Scalable cascaded hashing for mobile image search,” [41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 601–611, 2014. A large-scale hierarchical image database,” in Proc. of CVPR, 2009.
[12] C.-C. Chen and S.-L. Hsieh, “Using binarization and hashing for [42] A. Krizhevsky and G. Hinton, “Learning multiple layers of features
efficient SIFT matching,” Journal of Visual Communication and Image from tiny images,” Master’s thesis, Department of Computer Science,
Representation, vol. 30, pp. 86–93, 2015. University of Toronto, 2009.
[13] M. Jain, H. Jégou, and P. Gros, “Asymmetric Hamming embedding: [43] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of
Taking the best of our bits for large scale image search,” in Proc. of handwritten digits,” 1998.
ACM MM, 2011. [44] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
[14] V. Chandrasekhar, M. Makar, G. Takacs, D. Chen, S. S. Tsai, N.-M. retrieval with large vocabularies and fast spatial matching,” in Proc.
Cheung, R. Grzeszczuk, Y. Reznik, and B. Girod, “Survey of SIFT of CVPR, 2007.
compression schemes,” in Proc. of WMPP, 2010. [45] ——, “Lost in quantization: Improving particular object retrieval in large
[15] M. Norouzi and D. Fleet, “Cartesian k-means,” in Proc. of CVPR, 2013. scale image databases,” in Proc. of CVPR, 2008.
[16] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
approximate nearest neighbor search,” in Proc. of CVPR, 2013. with deep convolutional neural networks,” in Proc. of NIPS, 2012.
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 11
1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.