Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ercoli 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 1

Compact Hash Codes for Efficient Visual


Descriptors Retrieval in Large Scale Databases
Simone Ercoli, Marco Bertini, Member, IEEE, and Alberto Del Bimbo, Member, IEEE

Abstract—In this paper we present an efficient method for a very good performance in retrieval, even with very compact
visual descriptors retrieval based on compact hash codes com- hash codes (e.g. also with 32 bits); it can be applied to hash
puted using a multiple k-means assignment. The method has local and global visual features, either engineered (e.g. SIFT)
been applied to the problem of approximate nearest neighbor
(ANN) search of local and global visual content descriptors, and or learned (e.g. CNNs). The proposed approach greatly re-
it has been tested on different datasets: three large scale standard duces the need of training data and memory requirements
datasets of engineered features of up to one billion descriptors for the quantizer, and obtains a retrieval performance similar
(BIGANN) and, supported by recent progress in convolutional or superior to more complex state-of-the-art approaches on
neural networks (CNNs), on CIFAR-10, MNIST, INRIA Holidays, standard large scale datasets. This makes it suitable, in terms
Oxford 5K and Paris 6K datasets; also the recent DEEP1B
dataset, composed by one billion CNN-based features, has been of computational cost, for mobile devices, large-scale media
used. Experimental results show that, despite its simplicity, the analysis and content-based image retrieval in general.
proposed method obtains a very high performance that makes it The paper is organized as follows: in Sect. II we provide a
superior to more complex state-of-the-art methods. thorough review of works related to visual feature hashing and
Index Terms—Retrieval, nearest neighbor search, hashing, indexing, highlighting the main differences of the proposed
SIFT, CNN. method. Our approach is presented in Sect. III, including a
discussion about its computational complexity. Experimental
results on several large scale standard datasets of visual
I. I NTRODUCTION
features and images, and comparison with the current state-

E FFICIENT nearest neighbor (NN) search is one of the


main issues in large scale information retrieval for mul-
timedia and computer vision tasks. Also methods designed
of-the-art, are reported and discussed in Sect. IV. Finally,
conclusions are drawn in Sect. V.

for multidimensional indexing obtain a performance that is II. P REVIOUS W ORKS


only comparable to exhaustive search, when they are used Previous works on visual feature hashing can be divided
to index high dimensional features [1]. A typical solution in methods based on hashing functions, scalar quantization,
to avoid an exhaustive comparison is to employ methods vector quantization and, more recently, neural networks.
that perform approximate nearest neighbor (ANN) search, a) Hashing functions: Weiss et al. [2] have proposed to
that is finding the elements of a database that have a high treat the problem of hashing as a particular form of graph
probability to be nearest neighbors. ANN algorithms have been partitioning, in their Spectral Hashing (SH) algorithm. Li et
typically evaluated based on the trade-off between efficiency al. [3] have improved the application of SH to image retrieval
and search quality, but the inception of web-scale datasets have optimizing the graph Laplacian that is built based on pairwise
introduced also the problems of memory usage and speed. For similarities of images during the hash function learning pro-
these reasons the most recent approaches typically use feature cess, without requiring to learn a distance metric in a separate
hashing to reduce the dimensionality of the descriptors, and step. Heo et al. [4] have proposed to encode high-dimensional
compute Hamming distances over these hash codes. These data points using hyperspheres instead of hyperplanes; Jin
methods generally use inverted files, e.g. implemented using et al. [5] have proposed a variation of LSH, called Density
hash tables, to index the database, and require to use hash Sensitive Hashing, that does not use random projections but
codes with a length of several tens of bits to obtain a instead uses projective functions that are more suitable for
reasonable performance in retrieval, a typical length is 64 bits. the distribution of the data. Du et al. [6] have proposed the
Their application to systems with relatively limited memory use of Random Forests to perform linear projections, along
(e.g. mobile devices still have 1-2 GB RAM only) or to with a metric that is not based on Hamming distance. Lv
systems that involve a large-scale media indexing, that need et al. [7] address the problem of large scale image retrieval
to maintain the index in main memory to provide an adequate learning two hashes in their Asymmetric Cyclical Hashing
response time, requires to use compact hash codes. (ACH) method: a short one (k bits) for the images of the
In this paper we present a novel method for feature hashing, database and a longer one (mk bits) for the query. Hashes are
based on multiple k-means assignments. This method is un- obtained using similarity preserving Random Fourier Features,
supervised, requires a very limited codebook size and obtains and computing the Hamming distance between the long query
hash and the cyclically m-times concatenated compact hash
S. Ercoli, M. Bertini and A. Del Bimbo are with Media Integra- code of the stored image.
tion and Communication Center, University of Florence, Italy (e-mail: si-
mone.ercoli@unifi.it; marco.bertini@unifi.it; alberto.delbimbo@unifi.it. Paulevé et al. [8] have compared structured quantization
Manuscript received August 26, 2016; revised December 23, 2016. algorithms with unstructured quantizers (i.e. k-means and

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 2

hierarchical k-means clustering). Experimental results on SIFT two quantization distortion properties of the Residual Vector
descriptors have shown that unstructured quantizers provide Quantization (RVQ) model, that tries to restore quantization
significantly superior performances with respect to structured distortion errors instead of reducing it.
quantizers. A few works have addressed the problem of indexing.
b) Scalar Quantization: Zhou et al. [9] have proposed an Babenko and Lempitsky [20] have proposed an efficient sim-
approach based on scalar quantization of SIFT descriptors. The ilarity search method, called inverted multi-index (IMI); this
median and the third quartile of the bins of each descriptor are approach generalizes the inverted index by replacing vector
computed and used as thresholds, hashing is then computed quantization inside inverted indices with product quantization,
coding the value of each bin of the descriptor with 2 bits, and building the multi-index as a multi-dimensional table
depending on this subdivision. The final hash code has a (Multi-D-ADC). Another multi-index strategy has been pro-
dimension of 256 bits, but only the first 32 bits are used posed by Zheng et al. [21], where complementary features,
to index the code in an inverted file; thus differences in i.e. binarized SIFT descriptors and local color features are
the following bits, associated with the remaining bins of the indexed in a coupled Multi-Index (c-MI), performing feature
SIFT descriptor, are not taken into account when querying the fusion at indexing level for content-based image retrieval.
index. On the other hand methods that follow this approach, More recently, Babenko and Lempitsky [22] have addressed
like the three following ones, do not require training data. the problem of indexing CNN features, observing that IMI is
The method of [9] has been extended by Ren et al. [10], inefficient to index such features, and proposing two exten-
including an evaluation of the reliability of bits, depending on sions of IMI: the Non-Orthogonal Inverted Multi-Index (NO-
their quantization errors. Unreliable bits are then flipped when IMI) and the Generalized Non-Orthogonal Inverted Multi-
performing search, as a form of query expansion. To avoid Index (GNO-IMI). Multi-scale matching of SIFT and CNN
using codebooks, in the context of memory limited devices features binarized with LSH, has been recently addressed in
such as mobile phones, Zhou et al. [11] have proposed the use the work of Zheng et al. [23], proposing an indexing structure
of scalable cascaded hashing (SCH), performing sequentially that uses compact pointers to associate the local features with
scalar quantization on the principal components, obtained the regional and global ones.
using PCA, of SIFT descriptors. Chen and Hsieh [12] have d) Neural Networks: Lin et al. [24] have proposed a
recently proposed an approach that quantizes the differences deep learning framework to create hash-like binary codes for
of the bins of the SIFT descriptor, using the median computed fast image retrieval. Hash codes are learned in a point-wise
on all the SIFT descriptors of a training set as a threshold. manner by employing a hidden layer for representing the latent
c) Vector Quantization: Jégou et al. [1] have proposed concepts that dominate the class labels (when the data labels
to decompose the feature space into a Cartesian product are available). This layer learns specific image representations
of subspaces with lower dimensionality, that are quantized and a set of hash-like functions.
separately. This Product Quantization (PQ) method is efficient Do et al. [25] have addressed the problem of learning binary
in solving memory issues that arise when using vector quan- hash codes for large scale image search using a deep model
tization methods such as k-means, since it requires a much which tries to preserve similarity, balance and independence
reduced number of centroids to obtain a code of the desired of images. Two sub-optimizations during the learning process
length. The method has obtained state-of-the-art results on allow to efficiently solve binary constraints.
a large scale SIFT features dataset, improving over methods Guo and Li [26] have proposed a method to obtain the
such as SH [2] and Hamming Embedding [13]. This result is binary hash code of a given image using binarization of the
confirmed in the work of Chandrasekhar et al. [14], that have CNN outputs of a certain fully connected layer.
compared several compression schemes for SIFT features. Zhang et al. [27] have proposed a very deep neural network
However, its performance is dependent on the choice of the (DNN) model for supervised learning of hash codes (VDSH).
subspaces used. They use a training algorithm inspired by alternating direction
The efficiency and efficacy of the Product Quantization method of multipliers (ADMM) [28]. The method decomposes
method has led to development of several variations and im- the training process into independent layer-wise local updates
provements. The idea of compositionality of the PQ approach through auxiliary variables.
has been further analyzed by Norouzi and Fleet [15], that have Xia et al. [29] have proposed an hashing method for
built upon it proposing two variations of k-means: Orthogo- image retrieval which simultaneously learns a representation
nal k-means and Cartesian k-means (ck-means). Also Ge et of images and a set of hash functions.
al. [16] have proposed another improvement of PQ, called A deep learning framework for hashing of multimodal data
OPQ, that minimizes quantization distortions w.r.t. space de- has been proposed by Wang et al. [30], using a multimodal
composition and quantization codebooks; He et al. [17] have Deep Belief Network to capture correlation in high-level
proposed an affinity-preserving technique to approximate the space during pre-training, followed by learning a cross-modal
Euclidean distance between codewords in k-means method. autoencoder in fine tuning phase.
Kalantidis and Avrithis [18] have presented a simple vector Lin et al. [31] have proposed to use unsupervised two steps
quantizer (LOPQ) which uses a local optimization over a hashing of CNN features. In the first step Stacked Restricted
rotation and a space decomposition and apply a parametric Boltzmann Machines learn binary embedding functions, then
solution that assumes a normal distribution. More recently, fine tuning is performed to retain the metric properties of the
Guo et al. [19] have improved over OPQ and LOPQ adding original feature space.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 3

The method proposed in this paper belongs to the family are assigned to the nearest cluster center, whose code is used
of methods based on vector quantization; compared to recent as hash code. Considering the case of 128-dimensional visual
hashing algorithms based on neural networks this allows to content descriptors, like SIFT or the FC7 layer of the VGG-
learn hash codes with a much reduced computational cost, M-128 CNN [33], this means that compressing them to 64
without loss in performance. Unlike previous vector quantiza- bits codes requires to use k = 264 centroids. In this case
tion approaches it associates the features to multiple codebook the computational cost of learning a k-means based quantizer
words, reducing quantization errors due to wrong associations becomes expensive in terms of memory and time because:
to nearby codewords. Another difference, considering in par- i) there is the need of a quantity of training data that is
ticular the family of methods based on Product Quantization, several times larger than k, and ii) the execution time of
is the fact that codewords are associated to single bits of the the algorithm becomes unfeasible. Using hierarchical k-means
hash and not to portions of the feature; this avoids the need to (HKM) makes it possible to reduce execution time, but the
partition features so to reflect the structure of the descriptor, problem of memory usage and size of the required learning
as shown in [1]. set affects also this approach. Since the quantizer is defined
by the k centroids, the use of quantizers with a large number
III. T HE P ROPOSED M ETHOD of centroids may not be practical or efficient: if a feature
has a dimension D, there is need to store k × D values to
The proposed method exploits a novel version of the k-
represent the codebook of the quantizer. A possible solution
means vector quantization approach, introducing the possi-
to this problem is to reduce the length of the hash signature,
bility of assignment of a visual feature to multiple cluster
but this typically affects negatively retrieval performance. The
centers during the quantization process. This approach greatly
use of product k-means quantization, proposed originally by
reduces the number of required cluster centers, as well as
Jégou et al. [1], overcomes this issue.
the required training data, performing a sort of quantized
In our approach, instead, we propose to compute a sort
codebook soft assignment for an extremely compact hash code
of soft assignment within the k-means framework, to obtain
of visual features. Table I summarizes the symbols used in the
very compact signatures and dimension of the quantizer,
following.
thus reducing its memory requirements, while maintaining a
retrieval performance similar to that of [1].
TABLE I
N OTATION TABLE The proposed method, called multi-k-means (in the follow-
ing abbreviated as m-k-means), starts learning a standard k-
xi feature to be hashed
Ci generic centroid
means dictionary as shown in Eq. 1, using a very small number
Si cluster k of centroids to maintain a low computational cost. Once we
k number of centroids; length of the hash code obtained our C1 , . . . , Ck centroids, the main difference resides
Cj centroid associated to the j th bit of the hash code in the assignment and creation of the hash code. Each centroid
D dimension of the feature
is associated to a specific bit of the hash code:
The first step of the computation is a typical k- (
means algorithm for clustering. Given a set of observations k x − Cj k≤ δ j th bit = 1
(2)
(x1 , x2 , . . . , xn ) where each observation is a D-dimensional k x − Cj k> δ j th bit = 0
real vector, k-means clustering partitions the n observations
into k(≤ n) sets S = {S1 , S2 , . . . , Sk } so as to minimize the where x is the feature point and δ is a threshold measure given
sum of distance functions of each point in the cluster to the by
Ck centers. Its objective is to find:
 Qk 1
( Pj=1 k x − Cj k) geometric mean
k X  k
X
argmins k x − Ci k2 (1) δ = k1 kj=1 k x − Cj k arithmetic mean
i=1 x∈Si

 th
n nearest distance k x − Cj k ∀j = 1, ..., k
This process is convergent (to some local optimum) but (3)
the quality of the local optimum strongly depends on the i.e. centroid j is associated to the j th bit of the hash code of
initial assignment. We use the k-means++ [32] algorithm length k; the bit is set to 1 if the feature to be quantized is
for choosing the initial values, to avoid the poor clusterings assigned to its centroid, or to 0 otherwise.
sometimes found by the standard k-means algorithm.
A feature can be assigned to more than one centroid using
two main different approaches:
A. Multi-k-means Hashing i) m-k-means-t1 - using Eq. (2) and one of the first two
K-means is typically used to compute the hash code of thresholds of Eq. (3). In this case the feature vector is
visual feature in unstructured vector quantization, because it considered as belonging to all the centroids from which its
minimizes the quantization error by satisfying the two Lloyd distance is below the threshold. Experiments have shown that
optimality conditions [1]. In a first step a dictionary is learned the arithmetic mean is more efficient with respect to the
over a training set and then hash codes of features are obtained geometric one, and all the experiments will report results
by computing their distance from each cluster center. Vectors obtained with it.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 4

ii) m-k-means-n1 - using Eq. (2) and the third threshold of [34] and, similarly, it alleviates the problem of codeword
Eq. (3), i.e. assigning the feature to a predefined number n of ambiguity while reducing the quantization error.
nearest centroids. Fig. 1 illustrates the quantization process and the resulting
We also introduce two variants (m-k-means-t2 and m-k- hash codes in three cases: one in which a vector is assigned to
means-n2 ) to the previous approaches by randomly splitting a variable number of centroids (m-k-means-t1 ), one in which
the training data into two groups and creating two different a vector is assigned to a predefined number of centroids (m-k-
codebooks for each feature vector. The final hash code is given means-n1 ) and one in which the resulting code is created by
by the union of these two codes. the union of two different codes created using two different
codebooks (m-k-means-t2 and m-k-means-n2 ). In all cases the
feature is assigned to more than one centroid. An evaluation
of these two approaches is reported in Sect. IV.
Typically a multi probe approach is used to solve the
problem of ambiguous assignment to a codebook centroid
(in case of vector quantization, as in the coarse quantization
step of PQ [1]) or quantization error (e.g. in case of scalar
quantization, as in [9], [10]); this technique stems from the
approach originally proposed in [35], to reduce the need of
creating a large number of hash tables in LSH. The idea is
that if a object is close to a query object q, but is not hashed
to the same bucket of q, it is still likely hashed to a bucket
that is near, i.e. to a bucket associated with an hash that has a
small difference w.r.t. the hash of q. With this approach one or
more bits of the query hash code are flipped to perform a query
expansion, improving recall at the expense of computational
cost and search time. In fact, if we chose to try all the hashes
within an Hamming distance of 1 we have to create variations
of the original hash of q flipping all the bits of the hash,
one at a time. This means that for a hash code of length k
we need to repeat the query with additional k hashes. In the
proposed method this need of multi probe queries is greatly
reduced, because of the possibility of assignment of features
to more than one centroid. For example, consider either Fig. 1
(top) or (middle): if a query point nearby f1 or f2 falls in
the Voronoi cell of centroid C6 , using standard k-means it
could be retrieved only using a multi probe query, instead the
proposed approach maintains the same hash code.

B. Computational Complexity
Let us consider a vector with dimensionality D, and desired
hash code length of 64 bits. Standard k-means has an assign-
ment complexity of kD, where k = 264 , while the proposed
approach instead needs k 0 = 64 centroids, has a complexity
of k 0 D and requires k 0 D floats to store the codebook. Product
Quantization requires k ∗ × D floats for the codebook and has
Fig. 1. Toy examples illustrating the proposed method: (top) features can an assignment complexity of k ∗ D, where k ∗ = k 1/m , using
be assigned (green line) to a variable number of nearest clusters (e.g. those typically k ∗ = 256 and m = 8 values, for a 64 bit length [1];
with distances below the mean δ - i.e. m-k-means-t1 ); (middle) features can
be assigned to a fixed number of clusters (e.g. the 2 nearest clusters - i.e. m- in this case the cost of the proposed method is a quarter of
k-means-n1 );(bottom) hash code created from two different codebooks (m-k- the cost of PQ.
means-x2 , where x can be either t or n). If a feature is assigned to a centroid
the corresponding bit in the hash code is set to 1.

IV. E XPERIMENTAL RESULTS


With the proposed approach it is possible to create hash
signatures using a much smaller number of centroids than The variants of the proposed method (m-k-means-t1 , m-k-
using the usual k-means baseline, since each centroid is means-n1 , m-k-means-t2 and m-k-means-n2 ) have been thor-
directly associated to a bit of the hash code. This approach can oughly compared to several state-of-the-art approaches using
be considered a quantized version of codebook soft assignment standard datasets, experimental setups and evaluation metrics.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 5

Query Immagini rilevanti


TABLE II
BIGANN DATASETS CHARACTERISTICS

vector dataset SIFT 1M SIFT 1B GIST 1M


descriptor dimensionality D 128 128 960
# learning set vectors 100,000 100,000,000 500,000
# database set vectors 1,000,000 1,000,000,000 1,000,000
# queries set vectors 10,000 10,000 1,000
# nearest vectors for each query 100 1000 100

TABLE III
DEEP1B DATASETS CHARACTERISTICS

descriptor dimensionality D 96
# learning set vectors 358,480,000
# database set vectors 1,000,000,000
# queries set vectors 10,000
# nearest vectors for each query 1

A. Datasets
BIGANN Dataset [1], [36] is a large-scale dataset com-
monly used to compare methods for visual feature hashing
and approximate nearest neighbor search [1], [15], [16], [18],
[20], [36], [37]. The dataset is composed by three different Tabella 5.1: Esempi di query e di immagini rilevanti in INRIA Holidays
sets of SIFT and GIST descriptors, each one divided in three Fig. 2. Sample images from INRIA Holiday dataset. Left column shows the
query images, the other columns show similar images.
subsets: a learning set, a query set and base set; each query has
corresponding ground truth results in the base set, computed in
an exhaustive way with Euclidean distance, ordered from the
most similar to the most different. For SIFT1M and SIFT1B
query and base descriptors have been extracted from the
INRIA Holidays images [38], while the learning set has been
extracted from Flickr images. For GIST1M query and base
descriptors are from INRIA Holidays and Flickr 1M datasets,
while learning vectors are from [39]. In all the cases query
descriptors are from the query images of INRIA Holidays (see
Figure 2). The characteristics of the dataset are summarized
in Table II.
DEEP1B Dataset [22] is a recent dataset produced using
a deep CNN based on the GoogLeNet [40] architecture and
trained on ImageNet dataset [41]. Descriptors are extracted
from the outputs of the last fully-connected layer, compressed
using PCA to 96 dimensions, and l2 -normalized. The charac-
teristics of the dataset are summarized in Table III.
CIFAR-10 Dataset [42] consists of 60,000 colour images Fig. 3. Sample images from CIFAR-10 dataset
(32 × 32 pixels) in 10 classes, with 6,000 images per class
(see Figure 3). The dataset is split into training and test
sets, composed by 50,000 and 10,000 images respectively. INRIA Holidays is composed by 1,491 images, of which 500
A retrieved image is considered relevant for the query if it are used as queries; Oxford 5K is composed by 5,062 images
belongs to the same class. This dataset has been used for ANN with 55 query images, and Paris 6K is made of 6,412 images
retrieval using hash codes in [24], [29]. with 55 query images. We used the query images and ground
truth provided for each dataset, adding 100,000 distractor
MNIST Dataset [43] consists of 70,000 handwritten digits
images from Flickr 100K [44].
images (28 × 28 pixels, see Figure 4). The dataset is split into
60,000 training examples and 10,000 test examples. Similarly
to CIFAR-10, a retrieved image is considered relevant if it B. Evaluation Metrics
belongs to the same class of the query. This dataset has been The performance of ANN retrieval in BIGANN dataset is
used for ANN retrieval in [24], [29]. evaluated using recall@R, which is used in most of the results
Image retrieval datasets INRIA Holidays [38], Oxford 5K reported in the literature [1], [15], [16], [18], [20], [36] and
[44] and Paris 6K [45] are three datasets typically used to it is, for varying values of R, the average rate of queries for
evaluate image retrieval systems. For each dataset are given which the 1-nearest neighbor is retrieved in the top R positions.
a number of query images, and the associated ground truth. In case of R = 1 this metric coincides with precision@1. The

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 6

Module1: Supervised Pre-Training on ImageNet


ImageNet
Convolutional layers FC7 FC8

Module1: Supervised Pre-Training on ImageNet

F7 F8
CNN 4096d 1000d
4096 nodes 1000 nodes
ImageNet (~1.2M images)
Step 1: Supervised Pre-Training on ImageNet F8
F7 transfer
Parameter
CNN
Module2: Fine-tuning on Target Domain 1000 nodes
4096 Transferring
Parameter nodes
Step 2: Fine tuning
ImageNet onimages)
(~1.2M CIFAR-10
Module2: Fine-tuning on Target Domain Parameter Transferring
Convolutional layers FC7 FCh FC8
CIFAR-10

F7 Latent Layer (H) F8


CNN F7 Latent Layer (H) h nodesF8
Target domain dataset CNN 4096 nodes n nodes
Target domain dataset 4096 nodes h nodes n nodes
4096d Hd Nd

Module3: Image
Module3: Retrieval
Image via
Retrieval viaHierarchical Deep
Hierarchical Deep Search
Search
Similarity
Fig. 5. Framework used for CNN feature extraction
101001
101001
Similarity
on CIFAR-10 [24]: we
Computation

Fig. 4. Sample images from MNIST dataset Computation

use the values101010


of the nodes of the FCh layer as feature
101010 Query (48 dimensions).
Query

101011 Candidate Pool


Query Image
101011Coarse-level Search Fine-level SearchPool
Candidate Results

same measure has been used by the authors of the DEEP1BFigure 1: Query Image
Coarse-level Search
d) MNIST: We use LeNet CNN to compute our features
Fine-level Search
The proposed image retrieval framework via hierarchical deep search. Our method consists of three main com-
Results

dataset [22]. ponents. The first is the supervised pre-training of a convolutional neural network on the ImageNet to learn rich mid-level
Figure 1:image
Therepresentations.
proposed image
in MNIST. In theretrieval
This framework
is a network
second component, we via
add ahierarchical
latent layer todeep
architecture search.
the network andOur
havemethod
developed
neurons in consists
this layeroflearn
by three main com-
LeCun
Performance of image retrieval in CIFAR-10, MNIST,ponents. IN- The first representations
hashes-like is the supervised while pre-training
fine-tuning it onofthe a convolutional neuralThe
target domain dataset. network
final stageonisthe ImageNet
to retrieve to images
similar learn rich mid-level
using a
[48] that
coarse-to-fine
image representations. In the wassecond especially
strategy that utilizes the learn
component, we add designed
a latent layer for
hashes-like binary codes to therecognizing
and F features.
network and have neurons 7
handwritten in this layer learn
RIA Holidays, Oxford 5K and Paris 6K is measured following hashes-like representations while fine-tuning it on the target domain dataset. The final stage is to retrieve similar images
million digits,
using a coarse-to-fine
images categorizedreading
strategy that
into utilizes zip
1000 object codes,
the classes.
learn hashes-like
Our etc. binary
images Itcodes
inducing issimilar
a Fbinary
and 8-layer activationsnetwork
7 features. would have the with
the setup of [29], using Mean Average Precision: method for learning binary codes is described in detail as
follows.1 input layer, 2 convolutional
same label. To fulfill this idea, we embed the latent layer H
between F layers,
and F as shown2in the 7
non-linear
middle row of Figure 1.
8
down-
The images
latent layer H is a fully connected layer,activations
and its neuron
PQ million images sampling
3.1. Learningcategorized
Hash-like layers,
intoBinary object2classes.
1000 Codes fully Our connected inducing
activities are regulated layers
similar and
binary
by the succeeding layeraF thatGaussian would have the
en-
q=1 AveP (q) method forRecentlearning binary codes is described in detail as codessame label.and Toachieves
fulfill this idea, weThe embed the latent layer H 8
semantics classification. proposed
M AP = connected
(4)ture activations
follows. of layers F layer induced by with
studies [14, 7, 5, 1] have shown that the fea-
the input 10 im- output
6−8 latent layerclasses.
between HF not and F
7 only 8 We an used
as shown
provides aof themodified
in the middle
abstraction row of Figure 1.
rich
Q age can serve as the visual signatures. The use of these
version of LeNet Codes [47] and thewe
Thefrom
features latentF , layer
but also is a fully
Hbridges connected
the mid-level layer,
features and and its neuron
3.1. Learning Hash-like Binary activities are regulated by the succeeding layer first
obtain features from the
7
mid-level image representations demonstrates impressive high-level semantics. In our design, the neurons in the F8 that en-
where Recentimprovement on the task of image classification, retrieval,
fully
studies
and others.
[14, connected
However,
7, these
5, 1]signatures arelayer.
have shown
latent layer H are activated by sigmoid functions so the ac-
that the fea- tivations
high-dimensional
codes semantics and achieves classification. The proposed
are approximated to {0, 1}.
Z 1 ture activations of layers
vectors that are inefficient F induced by the input
for image retrieval in a large cor- im- latent layer H not only provides
To achieve domain adaptation, we fine-tune the proposedan abstraction of the rich
Wevisual perform search with
these anetwork
non-exhaustive alsoapproach on both
6−8
age can pus.
serve To as the signatures. The use of features
on thefrom F7 , but dataset bridges
via backthe mid-level features and
AveP = p(r)dr (5)
mid-levelto image
facilitate
reduce the
efficient
representations
image retrieval,
computational cost demonstrates
a practical
is to convert the
way
impressive
feature The the high-level
initial
target-domain
weights of semantics.
the deep CNNInareour
propagation.
set design, the neurons in the
as the weights
0 vectorson
improvement
CIFAR-10
to binary
the taskcodes.
of Such
image
and
binary MNIST
compact
classification, can bedatasets.
codesretrieval, trained For
fromlayer
latent ImageNet
H areeach
dataset. image
The
activated weights
by sigmoidofwe extract
thefunctions
latent so the ac-
quickly compared using hashing or Hamming distance. layer H and the final classification layer F are randomly
a
and others.InHowever,
is the area under the precision-recall curve and Q is the number this 48-dimensional
work, we propose to learn the domain feature
these signatures are high-dimensional
specific im- vector
initialized. Thefor
tivations are CIFAR-10,
approximated
initial random weights of latent and
to {0, 1}. layer H 500-
8

vectors that are inefficient for image retrieval in


age representations and a set of hash-like (or binary coded) a large cor- To achieve domain adaptation,
acts like LSH [6] which uses random projections for con- we fine-tune the proposed
of queries. pus. To functions dimensional
facilitate efficient image
simultaneously. We assume feature
retrieval,
that the finalvector
a practical
outputs for
way structingnetwork MNIST,
the on thebits.
hashing from
target-domain
The theadapted
dataset
codes are then respective
via from
back propagation.
of the classification layer rely on
to reduce the computational cost is to convert the feature
F a set of h hidden at- LSH to those that suit the data better from supervised
The initial weights of the deep CNN are set as the weights deep-
tributesnetwork
with each attributeand thenIn otherwe
points generate a 48 bitsImageNetbinary hash code a deep using
8
on or off. of view,
vectors to binary codes. Such binary compact codes can be network learning.
trained fromWithout dramatic modifications
dataset. The toweights of the latent

C. Configurations and Implementations


quickly compared the proposed
using hashing or methods
Hamming distance. of Sect. layerIII. H and Hamming distance
the final classification layer F is8 are
usedrandomly
In this work, we propose to learn the domain specific im- initialized. The initial random weights of latent layer H
age representations to select
and a set of the nearest
hash-like (or binary hash coded)codes acts forlike LSH each query
[6] which and similarity
uses random projections for con-
a) BIGANN: We use settings which reproduce topfunctions per- simultaneously. We assume that the final outputs structing the hashing bits. The codes are then adapted from
measure given by
formances at 64-bit codes. We perform search with a of non-
the classification layer F8 rely on a set of h hidden at-
tributes with each attribute on or off. In other points of view,
LSH to those that suit the data better from supervised deep-
network learning. Without dramatic modifications to a deep
exhaustive approach. For each query 64 bits binary hash code Pn
of the feature and Hamming distance measure are used to similarity = cos(θ) = A·B Ai Bi
= pPn i=12 pPn
extract small subsets of candidates from the whole database k A kk B k i=1 Ai i=1 Bi
2

set (Table II). Euclidean distance measure is then used to re- (6)
rank the nearest feature points, calculating recall@R values in where Ai and Bi are the components of the original feature
these subsets. vectors A and B, is used to re-rank the nearest visual features.
b) DEEP1B: We use the CNN features computed in e) Holidays, Oxford 5K, Paris 6K: We used CNN fea-
[22], hashed to 64-bit codes. Searching process is done in tures extracted using the 1024d average pooling layer of
a non-exhaustive way, using Hamming distances to reduce the GoogLeNet [40], that in initial experiments has proven to be
subsets of candidates from the whole database set. After we more effective than the FC7 layer of VGG [49] used in [50].
have extracted a shortlist of candidates we perform a re-rank When testing on a dataset, training is performed using the
step based on Euclidean distances and we calculate recall@R other two datasets. Features have been hashed to 64 bits binary
values. codes, a length that has proved to be the best compromise
c) CIFAR-10: We use features computed with the frame- between compactness and representativeness, and allow us to
work proposed in [24] (Figure 5). The process is carried out compare with a number of competing approaches.
in two steps: in the first step a supervised pre-training on
the large-scale ImageNet dataset [46] is performed. In the D. Results on BIGANN: SIFT1M, GIST1M
second step fine-tuning of the network is performed, with the In this set of experiments the proposed approach and its
insertion of a latent layer that simultaneously learns domain variants are compared on the SIFT1M (Fig. 6) and GIST1M
specific feature representations and a set of hash-like functions. (Fig. 7) datasets against several methods presented in section
The authors used the pre-trained CNN model proposed by II: Product Quantization (ADC and IVFADC) [1], PQ-RO
Krizhevsky et al. [46] and implemented in the Caffe CNN [19], PQ-RR [19], Cartesian k-means [15], OPQ-P [16], [51],
library [47]. In our experiments we use features coming from OPQ-NP [16], [51], LOPQ [18], a non-exhaustive adaptation
the F Ch Layer (Latent Layer H), which has a size of 48 of OPQ [16], called I-OPQ [18], RVQ [52] , RVQ-P [19] and
nodes. RVQ-NP [19].

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 7

ADC (Asymmetric Distance Computation) is characterized training phase, these experiments are averaged over a set of
by the number of sub vectors m and the number of quantizers 10 runs.
per sub vectors k ∗ , and produces a code of length m ×log2 k ∗ .
IVFADC (Inverted File with Asymmetric Distance Com-
putation) is characterized by the codebook size k 0 (number
of centroids associated to each quantizer), the number of
neighbouring cells w visited during the multiple assignment,
the number of sub vectors m and the number of quantizers per
sub vectors k ∗ which is in this case fixed to k ∗ = 256. The
length of the final code is given by m ×log2 k ∗ .
PQ-RO [19] is the Product Quantization approach with data
projection by randomly order dimensions.
PQ-RR [19] is the Product Quantization approach with data
projection by both PCA and randomly rotation.
Cartesian k-means (ck-means) [15] models region center as
an additive combinations of subcenters. Let m be the number
of subcenters, with h elements, then the total number of model
centers is k = hm , but the total number of subcenters is h×m,
and the number of bits of the signature is m × log2 h.
OPQ-P [16], [51] is the parametric version of Optimized Fig. 6. Recall@R on SIFT1M - Comparison between our method (m-k-
Product Quantization (OPQ), that assumes a parametric Gaus- means-t1 , m-k-means-n1 with n=32, m-k-means-t2 and m-k-means-n2 with
n=32), the Product Quantization method (PQ ADC and PQ IVFADC) [1],
sian distribution of features and performs space decomposition Cartesian k-means method (ck-means) [15], a non-exhaustive adaptation of
using an orthonormal matrix computed from the covariance the Optimized Product Quantization method (I-OPQ), a Locally optimized
matrix of data. product quantization method (LOPQ) [18], OPQ-P and OPQ-NP [16], [51],
and PQ-RO, PQ-RR, RVQ-P and RVQ-NP [19].
OPQ-NP [16], [51] is the non-parametric version of OPQ,
that does not assume any data distribution and alternatively
optimizes sub-codebooks and space decomposition.
LOPQ (Locally optimized product quantization) [18] is
a vector quantizer that combines low distortion with fast
search applying a local optimization over rotation and space
decomposition.
I-OPQ [18] is a non-exhaustive adaptation of OPQ (Opti-
mized Product Quantization [16]) which use either OPQ-P or
OPQ-NP global optimization.
RVQ [52] approximates the quantization error by another
quantizer instead of discarding it. In this method several stage-
quantizers, each one with its corresponding stage-codebook,
are connected sequentially. Each stage-quantizer approximates
the residual vector of the preceding stage by one of centroids
of its stage-codebook and generates a new residual vector for
the next stage.
RVQ-P [19] is a parametric version of RVQ, where stage-
codebooks and space decomposition of RVQ are optimized
Fig. 7. Recall@R on GIST1M - Comparison between our method (m-k-
using SVD. means-t1 , m-k-means-n1 with n=48, m-k-means-t2 and m-k-means-n2 with
RVQ-NP [19] is a non-parametric version of RVQ, using n=48), the Product Quantization method (ADC and IVFADC) [1], Cartesian k-
the same techniques of RVQ-P, but optimizing a space de- means method (ck-means) [15], a non-exhaustive adaptation of the Optimized
Product Quantization method (I-OPQ) [18], a Locally optimized product
composition for all the stages. quantization method (LOPQ) [18], OPQ-P and OPQ-NP [16], [51], and PQ-
The parameters of the proposed methods are set as follows: RO, PQ-RR, RVQ-P and RVQ-NP [19].
for m-k-means-t1 we use as threshold the arithmetic mean of
the distances between feature vectors and centroids to compute The proposed method, in all its variants, obtains the best
hash code; m-k-means-n1 creates hash code by setting to 1 results when considering the more challenging values of
the corresponding position of the first 32 (SIFT1M) and first recall@R, i.e. with a small number of nearest neighbors, like
48 (GIST1M) nearest centroids for each feature; m-k-means- 1, 10 and 100. When R goes to 1000 and 10,000 it still obtains
t2 and m-k-means-n2 create two different sub hash codes for the best results and in the case of SIFT1M it is on par with
each feature by splitting into two parts the training phase and ck-means [15]. Considering GIST1M the method consistently
combine these two sub parts into one single code to create the outperforms all the other methods for all the values of R,
final signature. Since we have a random splitting during the except for R=1 where RVQ-P [19] is better.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 8

E. Results on BIGANN: SIFT1B

In this experiment we compare our method on the large


scale SIFT1B dataset (Fig. 8) against LOPQ and a sub-
optimal variant LOR+PQ [18], single index PQ approaches
IVFADC+R [36] and ADC+R [36], I-OPQ [16] and ck-means
[15], and a multi-index method Multi-D-ADC [20]. [36] differ
from the standard IVFADC [1] and ADC [1] in using short
quantization codes to re-rank the NN candidates. m-k-means-t1
uses the same setup of the previous experiment; m-k-means-n1
uses the first 24 nearest centroids for each feature.
Also in this experiment the proposed method obtains the
best results, in particular when considering the more challeng-
ing small value of R for the recall@R measure (R = 1), with
an improvement between 10% and 20% with respect to the
best results of the compared methods.

Fig. 9. Recall@R on DEEP1B - Comparison between our method (m-


k-means-t1 , m-k-means-n1 with n=24), Inverted Multi-Index (IMI) [22],
Non-Orthogonal Inverted Multi-Index (NO-IMI) [22] and Generalized Non-
Orthogonal Inverted Multi-Index (GNO-IMI) [22].

TABLE IV
MAP RESULTS ON CIFAR-10 AND MNIST. C OMPARISON BETWEEN OUR
METHOD ( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =24, M - K - MEANS -t2
AND M - K - MEANS -n2 WITH N =24) WITH KSH [53], ITQ-CCA [54], MLH
[55], BRE [56], CNNH [29], CNNH+ [29], K EVIN N ET [24], LSH [57],
SH [2], ITQ [54]. R ESULTS FROM [7], [24], [29].

Method CIFAR-10 (MAP) MNIST (MAP)


LSH [57] 0.120 0.243
SH [2] 0.130 0.250
ITQ [54] 0.175 0.429
BRE [56] 0.196 0.634
MLH [55] 0.211 0.654
ITQ-CCA [54] 0.295 0.726
KSH [53] 0.356 0.900
CNNH [29] 0.522 0.960
Fig. 8. Recall@R on SIFT1B - Comparison between our method (m-k-
CNNH+ [29] 0.532 0.975
means-t1 , m-k-means-n1 with n=24), the Product Quantization method [36],
a non-exhaustive adaptation of the Optimized Product Quantization method ACH [7] 0.600 -
(I-OPQ) [18], a multi-index method (Multi-D-ADC), a Locally optimized KevinNet [24] 0.894 0.985
product quantization method (LOPQ) with a sub-optimal variant (LOR+PQ) m-k-means-t1 0.953 0.972
[18]. m-k-means-t2 0.849 0.964
m-k-means-n1 0.972 0.969
m-k-means-n2 0.901 0.959

F. Results on DEEP1B G. Results on CIFAR-10, MNIST


Experiments on DEEP1B [22] are shown in Fig. 9. We use a In the experiments on CIFAR-10 [42] and MNIST [43]
configuration with an hash code length of 64 bits for the m-k- images dataset we use the following configurations for the
means-t1 and m-k-means-n1 variants. The comparison is made proposed method: hash code length of 48 bits (the same length
against IMI [22], NO-IMI [22] and GNO-IMI [22], for which used by the compared methods), arithmetic mean for the m-
we report the results obtained by the authors using a rerank k-means-t1 variant, n = 24 for m-k-means-n1 .
approach for codes of 64 bits. Following the experimental Queries are performed using a random selection of 1,000
setup used in [22], we considered R = 1, R = 5 and R = 10 query images (100 images for each class), considering a
for the recall@R measure. category labels ground truth relevance (Rel(i)) between a
The proposed method obtains best results in both config- query q and the ith ranked image. So Rel(i) ∈ {0, 1} with
urations (m-k-means-t1 and m-k-means-n1 ) and considering 1 for the query and ith images with the same label and 0
R = 1 obtains a result approximately three times greater than otherwise. This setup has been used in [24], [29]. Since we
the others methods; for the other values of R the improvement select queries in a random way the results of these experiments
is between 2 and 1.5×. are averaged over a set of 10 runs.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 9

We compared the proposed approach with several state-of- TABLE V


the-art hashing methods including supervised (KSH [53], ITQ- MAP RESULTS ON H OLIDAYS , OXFORD 5K AND PARIS 6K DATASETS .
T HE PROPOSED METHODS METHOD OUTPERFORMS ALL THE CURRENT
CCA [54], MLH [55], BRE [56], CNNH [29], CNNH+ [29], STATE - OF - THE - ART METHODS . A LL HASHES ARE 64 BIT LONG .
KevinNet [24]) and unsupervised methods (LSH [57], SH [2],
ITQ [54]). Method Holidays Oxford 5K Paris 6K
ITQ [54] 0.537 0.230 -
The proposed method obtains the best results on the more BPBC [60] 0.381 0.225 -
challenging of the two datasets, i.e. CIFAR-10. The compar- PCAHash [54] 0.528 0.239 -
ison with the KevinNet [24] method is interesting since we LSH [61] 0.431 0.239 -
SKLSH [62] 0.241 0.134 -
use the same features, but we obtain better results for all SH [2] 0.522 0.232 -
the variants of the proposed method except one. On MNIST SRBM [63] 0.516 0.212 -
dataset the best results of our approach are comparable with UTH [31] 0.571 0.240 -
m-k-means-n1 (n=6, H=10) 0.756 0.460 0.676
the second best method [29], and anyway are not far from the m-k-means-n1 (n=6, H=16) 0.756 0.461 0.678
best approach [24]. m-k-means-n1 (n=10, H=10) 0.743 0.456 0.608
m-k-means-n1 (n=10, H=16) 0.756 0.460 0.677
m-k-means-t1 (n=10) 0.683 0.362 0.480
H. Results on INRIA Holidays, Oxford 5K and Paris 6K
In this experiment we evaluate the effects of the method
parameters, i.e. number of nearest centroids n for the m-
k-means-n1 method, and Hamming distance threshold (H).
The proposed approaches are compared with several state-of-
the-art methods, among which the recent UTH method [31];
while some of these methods were originally proposed for
engineered features, they have been evaluated on CNN features
(results reported from [31]). Retrieval is performed using the
coarse-to-fine approach used in [24], where the hash is used to
select a candidate list of images and CNN descriptors are used
to re-rank this list. For the sake of brevity only some combina-
tions of the parameters of the proposed methods are reported.
The m-k-means-n1 method, with an Hamming distance H ≥ 6
and n = 6 greatly outperforms any state-of-the-art hashing
method. Also m-k-means-t1 outperforms competing methods,
although obtaining a smaller improvement. In general both
methods are robust with respect to the parameters.
Fig. 10. Experiments on SIFT1M with different signatures length and values
Of course using methods that use the full CNN descriptors, of n. We use k = 32, k = 64 and k = 128 for m-k-means-t1 variant; for
without hashing, it is possible to obtain better results, for m-k-means-n1 we use k = 64 with n = 8, n = 16 and n = 32.
example as in the schemes tested in the work by Wan et
al. [58], or using the global CNN descriptor proposed by
Gordo et al. [59], but this comes at the expense of much larger V. C ONCLUSION
memory occupation and computational costs for retrieval We have proposed a new version of the k-means based
(e.g. the descriptor proposed by Gordo is 512 floats long). hashing schema called multi-k-means – with 4 variants: m-
In this case the benefit of using a hashing schema is to allow k-means-t1 , m-k-means-t2 , m-k-means-n1 and m-k-means-n2
better scaling to larger datasets and improved computational – which uses a small number of centroids, guarantees a low
costs. computational cost and results in a compact quantizer. These
characteristics are achieved thanks to the association of the
centroids to the bits of the hash code, that greatly reduce the
I. Results varying hash code length and n
need of a large number of centroids to produce a code of the
In this experiment we compare the behavior of retrieval needed length. Another advantage of the method is that it has
performance for different lengths of the hash code (for m- no parameters in its m-k-means-t1 and m-k-means-t2 variants,
kmeans-t1 ) and for different values of n nearest neighbors and only one parameter for the other two variants; anyway, as
(for m-kmeans-n1 ). Experiments were made for SIFT1M for shown by the experiments, it is quite robust to variations of
different values of recall@R. Figure 10 reports the results. We such parameter, as well as hash code length.
can observe how the performances are good for each signature Our compact hash signature is able to represent high di-
length and how they converge to 1 from recall@10 onward; mensional visual features obtaining a very high efficiency in
differences in performance are already small for R = 5. approximate nearest neighbor (ANN) retrieval, both on local
This means that our binary coding method maintains a good and global visual features. This characteristic stems from the
representation along different signatures length and, for the multiple-assignment strategy, that reduces the need of multi
m-k-means-n1 variant, also for different values of n. probe strategy to retrieve hash codes that differ by few bits,

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 10

typically due to quantization errors, and results in better [17] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving
approximated nearest neighbour estimation using Hamming quantization method for learning binary compact codes,” in Proc. of
CVPR, 2013.
distance. The method has been tested on large scale datasets [18] Y. Kalantidis and Y. Avrithis, “Locally optimized product quantization
of engineered (SIFT and GIST) and learned (deep CNN) for approximate nearest neighbor search,” in Proc. of CVPR, 2014.
features, obtaining results that outperform or are comparable to [19] D. Guo, C. Li, and L. Wu, “Parametric and nonparametric residual vector
quantization optimizations for ANN search,” Neurocomputing, 2016.
more complex state-of-the-art approaches. The m-k-means-n1 [20] A. Babenko and V. Lempitsky, “The inverted multi-index,” in Proc. of
variant typically performs better than m-k-means-t1 , especially CVPR, 2012.
when dealing with modern CNN features. [21] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Packing and padding: Coupled
multi-index for accurate image retrieval,” in Proc. of CVPR, 2014.
[22] A. Babenko and V. Lempitsky, “Efficient indexing of billion-scale
ACKNOWLEDGMENT datasets of deep descriptors,” in Proc. of CVPR, 2016.
[23] L. Zheng, S. Wang, J. Wang, and Q. Tian, “Accurate image search with
This work is partially supported by the “Social Museum and multi-scale contextual evidences,” International Journal of Computer
Smart Tourism” project (CTN01 00034 231545). Vision, vol. 120, no. 1, pp. 1–13, Oct. 2016.
[24] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen, “Deep learning of
This research is based upon work supported [in part] by binary hash codes for fast image retrieval,” in Proc. of CVPR, 2015.
the Office of the Director of National Intelligence (ODNI), [25] T.-T. Do, A.-Z. Doan, and N.-M. Cheung, “Discrete hashing with deep
neural network,” arXiv preprint arXiv:1508.07148, 2015.
Intelligence Advanced Research Projects Activity (IARPA), [26] J. Guo and J. Li, “CNN based hashing for image retrieval,” arXiv
via IARPA contract number 2014-14071600011. The views preprint arXiv:1509.01354, 2015.
and conclusions contained herein are those of the authors and [27] Z. Zhang, Y. Chen, and V. Saligrama, “Supervised hashing with deep
neural networks,” arXiv preprint arXiv:1511.04524, 2015.
should not be interpreted as necessarily representing the offi- [28] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
cial policies or endorsements, either expressed or implied, of optimization and statistical learning via the alternating direction method
ODNI, IARPA, or the U.S. Government. The U.S. Government of multipliers,” Foundations and Trends in Machine Learning, vol. 3,
no. 1, pp. 1–122, 2011.
is authorized to reproduce and distribute reprints for Gov- [29] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for
ernmental purpose notwithstanding any copyright annotation image retrieval via image representation learning.” in Proc. of AAAI,
thereon. 2014.
[30] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes
for multimodal representations using orthogonal deep structure,” IEEE
R EFERENCES Transactions on Multimedia, vol. 17, no. 9, pp. 1404–1416, 2015.
[31] J. Lin, O. Morère, J. Petta, V. Chandrasekhar, and A. Veillard, “Tiny
[1] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest descriptors for image retrieval with unsupervised triplet hashing,” in
neighbor search,” IEEE Transactions on Pattern Analysis and Machine Proc. of DCC, 2016.
Intelligence, vol. 33, no. 1, pp. 117–128, 2011. [32] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful
[2] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. of seeding,” in Proc. of ACM-SIAM SODA, 2007.
NIPS, 2009. [33] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of
[3] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with the devil in the details: Delving deep into convolutional nets,” in Proc.
semantically consistent graph for image indexing,” IEEE Transactions of BMVC, 2014.
on Multimedia, vol. 15, no. 1, pp. 141–152, 2013. [34] J. van Gemert, C. Veenman, A. Smeulders, and J.-M. Geusebroek,
[4] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon, “Spherical “Visual word ambiguity,” IEEE Transactions on Pattern Analysis and
hashing: Binary code embedding with hyperspheres,” IEEE Transactions Machine Intelligence, vol. 32, no. 7, pp. 1271–1283, 2010.
on Pattern Analysis and Machine Intelligence, vol. 37, no. 11, pp. 2304– [35] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe
2316, 2015. lsh: Efficient indexing for high-dimensional similarity search,” in Proc.
[5] Z. Jin, C. Li, Y. Lin, and D. Cai, “Density sensitive hashing,” IEEE of VLDB, 2007.
Transactions on Cybernetics, vol. 44, no. 8, pp. 1362–1371, 2014. [36] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one
[6] S. Du, W. Zhang, S. Chen, and Y. Wen, “Learning flexible binary code billion vectors: re-rank with source coding,” in Proc. of ICASSP, 2011.
for linear projection based hashing with random forest,” in Proc. of [37] M. Norouzi, A. Punjani, and D. Fleet, “Fast exact search in Hamming
ICPR, 2014. space with multi-index hashing,” IEEE Transactions on Pattern Analysis
[7] Y. Lv, W. W. Y. Ng, Z. Zeng, D. S. Yeung, and P. P. K. Chan, and Machine Intelligence, vol. 36, no. 6, pp. 1107–1119, 2014.
“Asymmetric cyclical hashing for large scale image retrieval,” IEEE [38] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak
Transactions on Multimedia, vol. 17, no. 8, pp. 1225–1235, 2015. geometric consistency for large scale image search,” in Proc. of ECCV,
[8] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A 2008.
comparison of hash function types and querying mechanisms,” Pattern [39] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A
Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010. large data set for nonparametric object and scene recognition,” IEEE
[9] W. Zhou, Y. Lu, H. Li, and Q. Tian, “Scalar quantization for large scale Transactions on Pattern Analysis and Machine Intelligence, vol. 30,
image search,” in Proc. of ACM MM, 2012. no. 11, pp. 1958–1970, 2008.
[10] G. Ren, J. Cai, S. Li, N. Yu, and Q. Tian, “Scalable image search with [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
reliable binary code,” in Proc. of ACM MM, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[11] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, “Towards in Proc. of CVPR, 2015.
codebook-free: Scalable cascaded hashing for mobile image search,” [41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 601–611, 2014. A large-scale hierarchical image database,” in Proc. of CVPR, 2009.
[12] C.-C. Chen and S.-L. Hsieh, “Using binarization and hashing for [42] A. Krizhevsky and G. Hinton, “Learning multiple layers of features
efficient SIFT matching,” Journal of Visual Communication and Image from tiny images,” Master’s thesis, Department of Computer Science,
Representation, vol. 30, pp. 86–93, 2015. University of Toronto, 2009.
[13] M. Jain, H. Jégou, and P. Gros, “Asymmetric Hamming embedding: [43] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of
Taking the best of our bits for large scale image search,” in Proc. of handwritten digits,” 1998.
ACM MM, 2011. [44] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
[14] V. Chandrasekhar, M. Makar, G. Takacs, D. Chen, S. S. Tsai, N.-M. retrieval with large vocabularies and fast spatial matching,” in Proc.
Cheung, R. Grzeszczuk, Y. Reznik, and B. Girod, “Survey of SIFT of CVPR, 2007.
compression schemes,” in Proc. of WMPP, 2010. [45] ——, “Lost in quantization: Improving particular object retrieval in large
[15] M. Norouzi and D. Fleet, “Cartesian k-means,” in Proc. of CVPR, 2013. scale image databases,” in Proc. of CVPR, 2008.
[16] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
approximate nearest neighbor search,” in Proc. of CVPR, 2013. with deep convolutional neural networks,” in Proc. of NIPS, 2012.

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2017.2697824, IEEE
Transactions on Multimedia
IEEE TRANSACTIONS ON MULTIMEDIA (DRAFT) 11

[47] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,


S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proc. of ACM MM, 2014. TABLE VII
[48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Recall@R ON GIST1M - C OMPARISON BETWEEN OUR METHOD
applied to document recognition,” Proceedings of the IEEE, vol. 86, ( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =48, M - K - MEANS -t2 AND
no. 11, pp. 2278–2324, 1998. M - K - MEANS -n2 WITH N =48), THE P RODUCT Q UANTIZATION METHOD
[49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for (ADC AND IVFADC) [1], C ARTESIAN K - MEANS METHOD ( CK - MEANS )
large-scale image recognition,” in Proc. of ICLR, 2015. [15], A NON - EXHAUSTIVE ADAPTATION OF THE O PTIMIZED P RODUCT
[50] T. Uricchio, M. Bertini, L. Seidenari, and A. Del Bimbo, “Fisher Q UANTIZATION METHOD (I-OPQ) [18], A L OCALLY OPTIMIZED PRODUCT
encoded convolutional bag-of-windows for efficient image retrieval and QUANTIZATION METHOD (LOPQ) [18], OPQ-P AND OPQ-NP [16], [51],
social image tagging,” in Proc. of ICCV Workshops, 2015. AND PQ-RO, PQ-RR, RVQ-P AND RVQ-NP [19].
[51] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 36, Method R@1 R@10 R@100 R@1000 R@10000
no. 4, pp. 744–755, 2014. PQ (ADC) [1] 0.145 0.315 0.650 0.932 0.997
[52] Y. Chen, T. Guan, and C. Wang, “Approximate nearest neighbor search PQ (IVFADC) [1] 0.180 0.435 0.740 0.966 0.992
PQ-RO [19] 0.034 0.056 0.136 N/A N/A
by residual vector quantization,” Sensors, vol. 10, no. 12, pp. 11 259–
PQ-RR [19] 0.033 0.062 0.124 N/A N/A
11 273, 2010. ck-means [15] 0.135 0.335 0.728 0.952 0.985
[53] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised OPQ-P [16], [51] 0.095 0.297 0.629 N/A N/A
hashing with kernels,” in Proc. of CVPR, 2012. OPQ-NP [16], [51] 0.089 0.277 0.642 N/A N/A
[54] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantiza- I-OPQ [18] 0.146 0.410 0.729 0.862 0.866
tion: A procrustean approach to learning binary codes for large-scale LOPQ [18] 0.160 0.461 0.756 0.860 0.866
image retrieval,” IEEE Transactions on Pattern Analysis and Machine RVQ [52] 0.095 0.276 0.656 0.936 1
Intelligence, vol. 35, no. 12, pp. 2916–2929, 2013. RVQ-P [19] 0.309 0.700 0.950 N/A N/A
[55] M. Norouzi and D. J. Fleet, “Minimal loss hashing for compact binary RVQ-NP [19] 0.107 0.314 0.678 N/A N/A
codes,” in Proc. of ICML, 2011. m-k-means-t1 0.111 0.906 1 1 1
[56] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive m-k-means-t2 0.123 0.890 1 1 1
m-k-means-n1 0,231 0,940 1 1 1
embeddings,” in Proc. of NIPS, 2009. m-k-means-n2 0,265 0,905 1 0,999 0,999
[57] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high
dimensions via hashing,” in Proc. of VLDB, 1999.
[58] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep
learning for content-based image retrieval: A comprehensive study,” in
Proc. of ACM MM, 2014.
[59] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:
Learning global representations for image search,” in Proc. of ECCV, TABLE VIII
2016. Recall@R ON SIFT1B - C OMPARISON BETWEEN OUR METHOD
[60] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik, “Learning binary codes ( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =24), THE P RODUCT
for high-dimensional data using bilinear projections,” in Proc. of CVPR, Q UANTIZATION METHOD [36], A NON - EXHAUSTIVE ADAPTATION OF THE
2013. O PTIMIZED P RODUCT Q UANTIZATION METHOD (I-OPQ) [18], A
[61] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive MULTI - INDEX METHOD (M ULTI -D-ADC), A L OCALLY OPTIMIZED
hashing scheme based on p-stable distributions,” in Proc. of SoCG, 2004. PRODUCT QUANTIZATION METHOD (LOPQ) WITH A SUB - OPTIMAL
[62] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from VARIANT (LOR+PQ) [18].
shift-invariant kernels,” in Proc. of NIPS, 2009.
[63] V. Chandrasekhar, J. Lin, O. Morere, A. Veillard, and H. Goh, “Compact Method R@1 R@10 R@100
global descriptors for visual search,” in Proc. of DCC, 2015. PQ (ADC+R) [36] 0.656 0.970 0.985
PQ (IVFADC+R) [36] 0.630 0.977 0.983
A PPENDIX
ck-means [15] 0.084 0.288 0.637
E XPERIMENTAL RESULTS ON BIGANN AND DEEP1B I-OPQ [18] 0.114 0.399 0.777
REPORTED AS TABLES
Multi-D-ADC [20] 0.165 0.517 0.860
LOR+PQ [18] 0.183 0.565 0.889
TABLE VI LOPQ [18] 0.199 0.586 0.909
Recall@R ON SIFT1M - C OMPARISON BETWEEN OUR METHOD
( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =32, M - K - MEANS -t2 AND m-k-means-t1 0.775 0.917 0.928
M - K - MEANS -n2 WITH N =32), THE P RODUCT Q UANTIZATION METHOD m-k-means-n1 0.787 0.990 1
(PQ ADC AND PQ IVFADC) [1], C ARTESIAN K - MEANS METHOD
( CK - MEANS ) [15], A NON - EXHAUSTIVE ADAPTATION OF THE O PTIMIZED
P RODUCT Q UANTIZATION METHOD (I-OPQ), A L OCALLY OPTIMIZED
PRODUCT QUANTIZATION METHOD (LOPQ) [18], OPQ-P AND OPQ-NP
[16], [51], AND PQ-RO, PQ-RR, RVQ-P AND RVQ-NP [19].
TABLE IX
Method R@1 R@10 R@100 R@1000 R@10000 Recall@R ON DEEP1B - C OMPARISON BETWEEN OUR METHOD
PQ (ADC) [1] 0.224 0.600 0.927 0.996 0.999 ( M - K - MEANS -t1 , M - K - MEANS -n1 WITH N =24), I NVERTED M ULTI -I NDEX
PQ (IVFADC) [1] 0.320 0.739 0.953 0.972 0.972
PQ-RO [19] 0.177 0.501 0.854 N/A N/A
(IMI) [22], N ON -O RTHOGONAL I NVERTED M ULTI -I NDEX (NO-IMI) [22]
PQ-RR [19] 0.107 0.331 0.695 N/A N/A AND G ENERALIZED N ON -O RTHOGONAL I NVERTED M ULTI -I NDEX
ck-means [15] 0.231 0.635 0.930 1 1 (GNO-IMI) [22].
OPQ-P [16], [51] 0.219 0.563 0.917 N/A N/A
OPQ-NP [16], [51] 0.242 0.627 0.938 N/A N/A Method R@1 R@5 R@10
I-OPQ [18] 0.299 0.691 0.875 0.888 0.888
LOPQ [18] 0.380 0.780 0.886 0.888 0.888 NO-IMI [22] 0.272 0.492 0.593
RVQ [52] 0.264 0.659 0.949 1 1 IMI [22] 0.241 0.450 0.545
RVQ-P [19] 0.397 0.821 0.983 N/A N/A
RVQ-NP [19] 0.271 0.686 0.958 N/A N/A GNO-IMI [22] 0.276 0.508 0.613
m-k-means-t1
m-k-means-t2
0.501
0.590
0.988
0.989
1
1
1
1
1
1
m-k-means-t1 0.694 0.892 0.912
m-k-means-n1 0.436 0.986 1 1 1 m-k-means-n1 0.768 0.988 0.999
m-k-means-n2 0.561 0.986 1 1 1

1520-9210 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like