Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Article cvprw15

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Deep Learning of Binary Hash Codes for Fast Image Retrieval

Kevin Lin† , Huei-Fang Yang† , Jen-Hao Hsiao‡ , Chu-Song Chen†



Academia Sinica, Taiwan ‡ Yahoo! Taiwan
{kevinlin311.tw,song}@iis.sinica.edu.tw, hfyang@citi.sinica.edu.tw, jenhaoh@yahoo-inc.com
https://github.com/kevinlin311tw/caffe-cvprw15

Abstract Krizhevsky et al. [14] used the feature vectors from the
7th layer in image retrieval and demonstrated outstanding
Approximate nearest neighbor search is an efficient performance on ImageNet. However, because the CNN
strategy for large-scale image retrieval. Encouraged by the features are high-dimensional and directly computing the
recent advances in convolutional neural networks (CNNs), similarity between two 4096-dimensional vectors is ineffi-
we propose an effective deep learning framework to gener- cient, Babenko et al. [1] proposed to compress the CNN
ate binary hash codes for fast image retrieval. Our idea is features using PCA and discriminative dimensionality re-
that when the data labels are available, binary codes can be duction, and obtained a good performance.
learned by employing a hidden layer for representing the la- In CBIR, both image representations and computational
tent concepts that dominate the class labels. The utilization cost play an essential role. Due to the recent growth of vi-
of the CNN also allows for learning image representations. sual contents, rapid search in a large database becomes an
Unlike other supervised methods that require pair-wised in- emerging need. Many studies aim at answering the ques-
puts for binary code learning, our method learns hash codes tion that how to efficiently retrieve the relevant data from
and image representations in a point-wised manner, mak- the large-scale database. Due to the high-computational
ing it suitable for large-scale datasets. Experimental re- cost, traditional linear search (or exhaustive search) is not
sults show that our method outperforms several state-of- appropriate for searching in a large corpus. Instead of lin-
the-art hashing algorithms on the CIFAR-10 and MNIST ear search, a practical strategy is to use the technique of
datasets. We further demonstrate its scalability and efficacy Approximate Nearest Neighbor (ANN) or hashing based
on a large-scale dataset of 1 million clothing images. method [6, 29, 18, 20, 15, 30] for speedup. These meth-
ods project the high-dimensional features to a lower di-
mensional space, and then generate the compact binary
1. Introduction codes. Benefiting from the produced binary codes, fast im-
age search can be carried out via binary pattern matching
Content-based image retrieval aims at searching for sim- or Hamming distance measurement, which dramatically re-
ilar images through the analysis of image content; hence duces the computational cost and further optimizes the effi-
image representations and similarity measure become criti- ciency of the search. Some of these methods belong to the
cal to such a task. Along this research track, one of the most pair-wised method that use similarity matrix (containing the
challenging issues is associating the pixel-level information pair-wised similarity of data) to describe the relationship of
to the semantics from human perception [25, 27]. Despite the image pairs or data pairs, and employ this similarity in-
several hand-crafted features have been proposed to repre- formation to learn hash functions. However, it is demanding
sent the images [19, 2, 22], the performance of these visual to construct the matrix and generate the codes when dealing
descriptors is still limited until the recent breakthrough of with a large-scale dataset.
deep learning. Recent studies [14, 7, 21, 23] have shown Inspiring from the advancement of deep learning, we
that deep CNN significantly improves the performance on raise a question that can we take the advantage of deep
various vision tasks, such as object detection, image clas- CNN to achieve hashing? Instead of the use of the pair-
sification, and segmentation. These accomplishments are wised learning method, can we generate the binary compact
attributed to the ability of deep CNN to learn the rich mid- codes directly from the deep CNN? To address these ques-
level image representations. tions, we propose a deep CNN model that can simultane-
As deep CNNs learn rich mid-level image descriptors, ously learn image representations and binary codes, under

1
the assumption that the data are labeled. That is, our method another representative approach, which produces the com-
is designed particularly for supervised learning. Further- pact binary codes via thresholding with non-linear functions
more, we argue that when a powerful learning model such along the PCA direction of the given data.
as deep CNN is used and the data labels are available, the bi- Recent studies have shown that using supervised infor-
nary codes can be learned by employing some hidden layer mation can boost the binary hash codes learning perfor-
for representing the latent concepts (with binary activation mance. Supervised approaches [18, 20, 15] incorporate la-
functions such as sigmoid) that dominate the class labels bel information during learning. These supervised hashing
in the architecture. This is different from other supervised methods usually use the pair-wised labels for generating ef-
methods (such as [30]) that take into consideration the data fective hash functions. However, these algorithms generally
labels but require pair-wised inputs to the prepared learning require a large sparse matrix to describe the similarity be-
process. In other words, our approach learns binary hashing tween data points in the training set.
codes in a point-wised manner, taking advantage of the in- Beside the research track of hashing, image representa-
cremental learning nature (via stochastic gradient descent) tions also play an essential role in CBIR. CNN-based visual
of deep CNN. The employment of deep architecture also descriptors have been applied on the task of image retrieval
allows for efficient-retrieval feature learning. Our method recently. Krizhevsky et al. [14] firstly use the features ex-
is suitable for large datasets in comparison of conventional tracted from seventh layer to retrieve images, and achieve
approaches. impressive performance on ImageNet. Babenko et al. [1]
Our method is with the following characteristics: focus on dimensional reduction of the CNN features, and
improve the retrieval performance with compressed CNN
• We introduce a simple yet effective supervised learn- features. Though these recent works [14, 1] present good
ing framework for rapid image retrieval. results on the task of image retrieval, the learned CNN fea-
• With small modifications to the network model, our tures are employed for retrieval by directly performing pat-
deep CNN simultaneously learns domain specific im- tern matching in the Euclidean space, which is inefficient.
age representations and a set of hashing-like functions Deep architectures have been used for hash learning.
for rapid image retrieval. However, most of them are unsupervised, where deep auto-
encoders are used for learning the representations [24, 13].
• The proposed method outperforms all of the state- Xia et al. [30] propose a supervised hashing approach to
of-the-art works on the public dataset MNIST and learn binary hashing codes for fast image retrieval through
CIFAR-10. Our model improves the previous best re- deep learning and demonstrate state-of-the-art retrieval per-
trieval performance on CIFAR10 dataset by 30% pre- formance on public datasets. However, in their pre-
cision, and on MNIST dataset by 1% precision. processing stage, a matrix-decomposition algorithm is used
for learning the representation codes for data. It thus re-
• Our approach learns binary hashing codes in a point-
quires the input of a pair-wised similarity matrix of the data
wised manner and is easily scalable to the data size in
and is unfavorable for the case when the data size is large
comparison of conventional pair-wised approaches.
(e.g., 1M in our experiment) because it consumes both con-
This paper is organized as follows: We briefly review the siderable storage and computational time.
related work of hashing algorithms and image retrieval with In contrast, we present a simple but efficient deep learn-
deep learning in Section 2. We elaborate on the details of ing approach to learn a set of effective hash-like functions,
our method in Section 3. Finally, experimental results are and it achieves more favorable results on the publicly avail-
provided in Section 4, followed by conclusions in Section 5. able datasets. We further apply our method to a large-scale
dataset of 1 million clothing images to demonstrate the scal-
2. Related Work ability of our approach. We will describe the proposed
method in next section.
Several hashing algorithms [6, 29, 18, 20, 28, 10] have
been proposed to approximately identify data relevant to the 3. Method
query. These approaches can be classified into two main
categories, unsupervised and supervised methods. Figure 1 shows the proposed framework. Our method
Unsupervised hashing methods use unlabeled data to includes three main components. The first component is
learn a set of hash functions [6, 29, 8]. The most repre- the supervised pre-training on the large-scale ImageNet
sentative one is the Locality-Sensitive Hashing (LSH) [6], dataset [14]. The second component is fine-tuning the net-
which aims at maximizing the probability that similar data work with the latent layer to simultaneously learn domain-
are mapped to similar binary codes. LSH generates the bi- specific feature representation and a set of hash-like func-
nary codes by projecting the data points to a random hyper- tion. The third retrieves images similar to the query one
plane with random threshold. Spectral hashing (SH) [29] is via the proposed hierarchical deep search. We use the pre-
Module1: Supervised Pre-Training on ImageNet

F7 F8
CNN
4096 nodes 1000 nodes
ImageNet (~1.2M images)

Module2: Fine-tuning on Target Domain Parameter Transferring

F7 Latent Layer (H) F8


CNN
Target domain dataset 4096 nodes h nodes n nodes

Module3: Image Retrieval via Hierarchical Deep Search


Similarity
101001 Computation

101010 Query

101011 Candidate Pool


Query Image
Coarse-level Search Fine-level Search Results

Figure 1: The proposed image retrieval framework via hierarchical deep search. Our method consists of three main com-
ponents. The first is the supervised pre-training of a convolutional neural network on the ImageNet to learn rich mid-level
image representations. In the second component, we add a latent layer to the network and have neurons in this layer learn
hashes-like representations while fine-tuning it on the target domain dataset. The final stage is to retrieve similar images
using a coarse-to-fine strategy that utilizes the learn hashes-like binary codes and F7 features.

trained CNN model proposed by Krizhevsky et al. [14] functions simultaneously. We assume that the final outputs
from the Caffe CNN library [11], which is trained on the of the classification layer F8 rely on a set of h hidden at-
large-scale ImageNet dataset which contains more than 1.2 tributes with each attribute on or off. In other points of view,
million images categorized into 1000 object classes. Our images inducing similar binary activations would have the
method for learning binary codes is described in detail as same label. To fulfill this idea, we embed the latent layer H
follows. between F7 and F8 as shown in the middle row of Figure 1.
The latent layer H is a fully connected layer, and its neuron
3.1. Learning Hash-like Binary Codes activities are regulated by the succeeding layer F8 that en-
Recent studies [14, 7, 5, 1] have shown that the fea- codes semantics and achieves classification. The proposed
ture activations of layers F6−8 induced by the input im- latent layer H not only provides an abstraction of the rich
age can serve as the visual signatures. The use of these features from F7 , but also bridges the mid-level features and
mid-level image representations demonstrates impressive the high-level semantics. In our design, the neurons in the
improvement on the task of image classification, retrieval, latent layer H are activated by sigmoid functions so the ac-
and others. However, these signatures are high-dimensional tivations are approximated to {0, 1}.
vectors that are inefficient for image retrieval in a large cor- To achieve domain adaptation, we fine-tune the proposed
pus. To facilitate efficient image retrieval, a practical way network on the target-domain dataset via back propagation.
to reduce the computational cost is to convert the feature The initial weights of the deep CNN are set as the weights
vectors to binary codes. Such binary compact codes can be trained from ImageNet dataset. The weights of the latent
quickly compared using hashing or Hamming distance. layer H and the final classification layer F8 are randomly
In this work, we propose to learn the domain specific im- initialized. The initial random weights of latent layer H
age representations and a set of hash-like (or binary coded) acts like LSH [6] which uses random projections for con-
structing the hashing bits. The codes are then adapted from

Top
LSH to those that suit the data better from supervised deep-
network learning. Without dramatic modifications to a deep
CNN model, the propose model learns domain specific vi-

Skirt
sual descriptors and a set of hashing-like functions simulta-
neously for efficient image retrieval.

Heels
3.2. Image Retrieval via Hierarchical Deep Search
Zeiler and Fergus [32] analyzed the deep CNN and
showed that the shallow layers learn local visual descriptors

Bag
while the deeper layers of CNN capture the semantic infor-
mation suitable for recognition. We adopt a coarse-to-fine
search strategy for rapid and accurate image retrieval. We
firstly retrieve a set of candidates with similar high-level se- Figure 2: Sample images from the Yahoo-1M Shopping
mantics, that is, with similar hidden binary activations from Dataset. The heterogeneous product images demonstrate
the latent layer. Then, to further filter the images with simi- highly variation, and are challenging to image classification
lar appearance, similarity ranking is performed based on the and retrieval.
deepest mid-level image representations.

4. Experimental Results
Coarse-level Search. Given an image I, we first extract
the outputs of the latent layer as the image signature which In this section, we demonstrate the benefits of our ap-
is denoted by Out(H). The binary codes are then obtained proach. We start with introducing the datasets and then
by binarizing the activations by a threshold. For each bit present our experimental results with performance com-
j = 1 · · · h (where h is the number of nodes in the latent parison to several state-of-the-arts on the public datasets,
layer), we output the binary codes of H by MNIST and CIFAR-10 datasets. Finally, we verify the scal-
{ ability and the efficacy of our approach on the large-scale
j 1 Outj (H) ≥ 0.5, Yahoo-1M dataset.
H = (1)
0 otherwise.
4.1. Datasets
Let Γ = {I1 , I2 , . . . , In } denote the dataset consisting of MNIST Dataset [16] consists of 10 categories of the
n images for retrieval. The corresponding binary codes handwritten digits form 0 to 9. There are 60,000 training
of each images are denoted as ΓH = {H1 , H2 , . . . , Hn } images, and 10,000 test images. All the digits are normal-
with Hi ∈ {0, 1}h . Given a query image Iq and its bi- ized to gray-scale images with size 28 × 28.
nary codes Hq , we identify a pool of m candidates, P = CIFAR-10 Dataset [12] contains 10 object categories
{I1c , I2c , . . . , Im
c
}, if the Hamming distance between Hq and and each class consists of 6,000 images, resulting in a total
Hi ∈ ΓH is lower than a threshold. of 60,000 images. The dataset is split into training and test
sets, with 50,000 and 10,000 images respectively.
Yahoo-1M Dataset contains a total of 1,124,087 shop-
Fine-level Search. Given the query image Iq and the can-
ping product images, categorized into 116 clothing-specific
didate pool P , we use the features extracted from the layer
classes. The dataset is collected by crawling the images
F7 to identify the top k ranked images to form the candi-
from the Yahoo shopping sites. All the images are labeled
date pool P . Let Vq and ViP denote the feature vectors of
with a category, such as Top, Dress, Skirt and so on. Fig-
the query image q and of the image Iic from the pool, re-
ure 2 shows some examples of the dataset.
spectively. We define the similarity level between Iq and
In the experiments of MNIST and CIFAR-10, we retrieve
the i-th image of P as the Euclidean distance between their
the relevant images using the learned binary codes in order
corresponding features vectors,
to fairly compare with other hashing algorithms. In the ex-
si = ∥Vq − ViP ∥. (2) periments of Yahoo-1M dataset, we retrieve similar images
from the entire dataset via the hierarchical search.
The smaller the Euclidean distance is, the higher level the
4.2. Evaluation Metrics
similarity of the two images is. Each candidate Iic is ranked
in ascending order by the similarity; hence, top k ranked We use a ranking based criterion [4] for evaluation.
images are identified. Given a query image q and a similarity measure, a rank can
Query Image Top 10 Retrieved Images

48 bits 128 bits 48 bits 128 bits


Two

Six

Figure 3: Top 10 retrieved images from MNIST dataset by vary bit numbers of the latent binary codes. Relevant images with
similar appearance are retrieved when the bit numbers increased.

Table 1: Performance Comparison (Error, %) of Classifica-


Image Retrieval Precision of MNIST
tion Error Rates on the MNIST dataset. 1
Ours
CNNH+
0.9 CNNH
Methods Test Error (%) KSH
ITQ-CCA
0.8 MLH
2-Layer CNN + 2-Layer NN [31] 0.53 BRE
Precision

Stochastic Pooling [31] 0.47 ITQ


0.7 SH
NIN + Dropout [17] 0.47 LSH

Conv. maxout + Dropout [9] 0.45 0.6

Ours w/ 48 nodes latent layer 0.47 0.5

Ours w/ 128 nodes latent layer 0.50


0.4
200 400 600 800 1000
# of Top Images Retrieved
be assigned for each dataset image. We evaluate the rank-
ing of top k images with respect to a query image q by a Figure 4: Image retrieval precision with 48 bits of MNIST
precision: dataset.
∑k
i=1Rel(i)
P recision@k = , (3) We compare our results with several state-of-the-arts [31,
k
17, 9] in Table 1. Our approach with 48 latent nodes at-
where Rel(i) denotes the ground truth relevance between a tains 0.47% error rate and performs favorably against most
query q and the i-th ranked image. Here, we consider only of the alternatives. It it worth noting that our model is
the category label in measuring the relevance so Rel(i) ∈ designed particularly for image retrieval whereas others
{0, 1} with 1 for the query and the ith image with the same are optimizing for a classification task through modifica-
label and 0 otherwise. tion of a network. For example, the work of [31] pro-
posed the maxout activation function which improves the
4.3. Results on MNIST Dataset
accuracy of dropout’s approximate model averaging tech-
Performance of Image Classification. To adapt our deep nique. Another representative work is Network in Network
CNN on the new domain, we modify the layer F8 to 10- (NIN) [17], which enhances the discriminability of local
way softmax to predict 10 digit classes. In order to measure patches via multilayer perception, and avoids overfitting us-
the effect of latent layer embedded in the deep CNN, we ing the global average pooling instead of the fully connected
set the number of neurons h in the latent layer to 48 and layers. Also note that our method with 48 latent nodes
128, respectively. Then, we apply stochastic gradient de- yields an error rate lower than the model with 128 nodes
scent (SGD) to train the CNN on the MNIST dataset. The does. This may be due to that few latent nodes are capable
network is trained for 50,000 iterations with a learning rate of representing latent concepts for classification and adding
of 0.001. more neurons can cause overfitting.
Query Image Top 10 Retrieved Images

48 bits 128 bits 48 bits 128 bits


Airplane

Horse

Figure 5: Top 10 retrieved images from CIFAR-10 by vary bit numbers of the latent binary codes. Relevant images with
similar appearance are retrieved when the bit numbers increased.

Table 2: Performance Comparison (mAP, %) of Classifica-


Image Retrieval Precision of CIFAR-10
tion Accuracy on the CIFAR-10 dataset. 1
Ours
0.9 CNNH+
CNNH
Methods Accuracy (%) KSH
0.8
ITQ-CCA
Stochastic Pooling [31] 84.87 0.7 MLH
BRE
Precision

CNN + Spearmint [26] 85.02 0.6 ITQ


SH
MCDNN [3] 88.79 0.5 LSH
AlexNet + Fine-tuning [14] 89 0.4
NIN + Dropout [17] 89.59 0.3
NIN + Dropout + Augmentation [17] 91.2
0.2

Ours w/ 48 nodes latent layer 89.4 0.1


200 400 600 800 1000
Ours w/ 128 nodes latent layer 89.6
# of Top Images Retrieved

Figure 6: Image retrieval precision with 48 bits of CIFAR-


10 dataset.
Performance of Images Retrieval. In this experiment,
we unify the retrieval evaluation that retrieve the relevant
images using 48 bits binary code and hamming distance
labels is effective.
measure. The retrieval is performed by randomly select-
ing 1,000 query images from the testing set for the system We further analyze the quality of the learned hash-like
to retrieve relevant ones from the training set. codes for h = 48 and h = 128, respectively, as shown in
Figure 3. As can be seen, both settings can learn informative
To evaluate the retrieval performance, we compare the binary codes for image retrieval.
proposed method with several state-of-the-art hashing ap-
proaches, including supervised (KSH [18], MLH [20], 4.4. Results on CIFAR-10 Dataset
BRE [15], CNNH [30], and CNNH+ [30]) and unsuper-
vised methods (LSH [6], SH [29], and ITQ [8]). Fig- Performance of Image Classification. To transfer the
ure 4 shows the retrieval precision of different methods deep CNN to the domain of CIFAR-10, we modify F8 to 10-
with respect to different number of retrieved images. As way softmax to predict 10 object categories, and h is also
can be seen, our method demonstrates stable performance set as 48 and 128. We then fine-tune our network model
(98.2+
− 0.3% retrieval precision) regardless of the number of on the CIFAR-10 dataset, and finally achieves around 89%
images retrieved. Furthermore, our approach improves the testing accuracy after 50, 000 training iterations. As shown
precision to 98.5% from 97.5% achieved by CNNH+ [30], in Table 2, the proposed method is more favorable against
which learns the hashing functions via decomposition of the most approaches [31, 26, 3, 14, 17], which indicates that
pair-wised similarity information. This improvement indi- embedding the binary latent layer in the deep CNN does
cates that our point-wised method that requires only class not severely alter the performance.
Dress Camis Top Coat Mary Janes Top
56% Dress 76% Camis 72% Top 98% Coat 34% Flats 99% Top
Suit Top 21% Dress Shirt 30% Mary Janes Dress
Skirt Dress Top-XL Down Jacket 14% Heels Camis
Top Skirt Skirt Jacket Casual Shoes Top-XL

Figure 7: Image classification results on Yahoo-1M dataset. The first row indicates ground truth label. The bars below depict
the prediction scores sorted in ascending order. Red and blue bar represent the correct and incorrect predictions, respectively.

Performance of Image Retrieval. In order for a fair com- Image Retrieval Precision of Yahoo-1M Dataset
parison with other hashing algorithms, we unify the evalu- 1
ation method that retrieves the relevant images by 48 bits Ours-HDS
Ours-BCS
binary codes and Hamming distance. Figure 6 shows the 0.9
Ours-ES
precision curves with respect to different number of the AlexNet feature
0.8
top retrieved samples. Our approach achieves better per-

Precision
formance than other unsupervised and supervised methods. 0.7

Moreover, it attains a precision of 89% while varying the


0.6
number of retrieved images, improving the performance by
a margin of 30% compared to CNNH+ [30]. These results 0.5
suggest that the use of a latent layer for representing the
0.4
hidden concepts is a practical approach to learning efficient
binary codes. 0.3
5 10 15 20 25 30 35 40
Figure 5 shows our retrieval results. The proposed la-
# of Top Images Retrieved
tent binary codes successfully retrieve images with rele-
vant category, similar appearance, and/or both. Increasing
Figure 8: Image retrieval precision of Yahoo-1M dataset.
the bit numbers from h = 48 to h = 128 retrieves more
appearance-relevant images according to our empirical eye-
ball checking. For example, in Figure 5, using h = 128 because the products might be ambiguous between some
bits binary code tends to retrieve more relevant horse-head specific categories. For example, it might be difficult to dis-
images (instead of entire horses) than that of the h = 48 tinguish between the Mary Janes and the Flats as shown in
bits. Fig 7. However, our method can still retrieve the images
4.5. Results on Yahoo-1M Dataset. similar to the query image.

Performance of Image Classification. To show the scal-


ability and efficacy of our method, we further test it on the Performance of Images Retrieval. In this experiment,
large-scale Yahoo-1M dataset. This dataset consists of plen- we demonstrate that our method can learn efficient deep bi-
tiful product images that are heterogeneous and they are nary codes for the dataset of million data. This is demand-
variant in person poses with noisy backgrounds. ing to achieve by using previous pairwised-data approaches
We set the number of neurons in the classification layer due to the large time and storage complexity.
to 116, and h in the latent layer to 128. We then fine- Because image representations are critical to image re-
tune our network with the entire Yahoo-1M dataset. Af- trieval, we compare the retrieval results obtained by fea-
ter 750, 000 training iterations, our proposed approach tures from different network modes: (1) AlexNet: F7 fea-
achieves 83.75% accuracy (obtained by the final layer) on ture from the pre-trained CNN [14]; (2) Ours-ES: F7 fea-
the task of 116 categories clothing classification. As shown tures from our network; (3) Ours-BCS: Latent binary codes
in Fig 7, though the clothing images are backgroundless from our network; and (4) Ours-HDS: F7 features and la-
or of noisy backgrounds, with or without human, the pro- tent binary codes from our network.
posed method demonstrates a good classification perfor- We conduct the exhaustive search (or linear search)
mance. Note that some of the images are miss-predicted based on L2 -norm distance when the F7 features are used in
Query Image Top 5 Retrieved Images Query Image Top 5 Retrieved Images

Ours-HDS Ours-BCS

Ours-HDS Ours-BCS
3 3 3 3 3 3 3 7 3 3

Denim Jacket Mary Janes


3 3 7 3 3 3 3 3 3 7

Ours-ES

Ours-ES
3 3 3 7 3 3 3 7 3 3

AlexNet

AlexNet
7 7 7 7 7 7 7 7 7 7

Figure 9: Top 5 retrieved images from Yahoo-1M dataset by different features. The blue check marks indicate the query and
retrieved images share the same label; the black crosses indicate otherwise.

retrieval; the hashing is performed based on Hamming dis- In contrast, computing the hamming distance between two
tance when the binary codes of the latent layer are used; 128 bits binary codes takes 0.113 ms. Thus, Ours-BCS is
the coarse-to-fine hierarchical search is performed to re- 971.3x faster than traditional exhaustive search with 4096-
trieve relevant images by using both the laytent layer codes dimensional features.
and F7 . We randomly select 1000 images from the Yahoo-
1M dataset, and retrieve the relevant images from the same
5. Conclusions
dataset.
Figure 8 shows the precision regarding to various num- We present a simple yet effective deep learning frame-
ber of the top images retrieved using different CNN fea- work to create the hash-like binary codes for fast image
tures. The proposed methods perform more favorably retrieval. We add a latent-attribute layer in the deep CNN to
against the original AlexNet feature. Apparently, the proce- simultaneously learn domain specific image representations
dure of fine-tuning successfully transfers deep CNN to the and a set of hash-like functions. Our method does not rely
new domain (clothing images). Among the fine-tuned mod- on pairwised similarities of data and is highly scalable to
els, Our-ES and Our-HDS show good retrieval precision the dataset size. Experimental results show that, with only
at first. However, Ours-BCS outperforms Ours-ES with a simple modification of the deep CNN, our method im-
higher and more stable retrieval precision when more than proves the previous best retrieval results with 1% and 30%
12 images are retrieved. This indicates the learned binary retrieval precision on the MNIST and CIFAR-10 datasets,
codes are informative and with high discriminative power. respectively. We further demonstrate the scalability and
Ours-HDS complements both Ours-BCS and Ours-ES and efficacy of the proposed approach on the large-scale dataset
achieves the best retrieval precision in overall. of 1 million shopping images.
Figure 9 shows the top 5 images retrieved by different
features. As can be seen, AlexNet retrieves the images with Acknowledgement: This work was supported in part
great diversity. The fine-tuned models retrieve more images by the Ministry of Science and Technology of Taiwan
with the same label as the query than AlexNet. Ours-HDS, under Contract MOST 103-2221-E-001-010.
Ours-BCS, and Ours-ES demonstrate good performance,
and successfully retrieve similar products. Nevertheless,
benefiting from the binary codes, Ours-BCS achieves the
References
fastest search among the approaches compared. Extract- [1] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky.
ing CNN features takes around 60 milliseconds (ms) on the Neural codes for image retrieval. In Proc. ECCV, pages 584–
machine with Geforce GTX 780 GPU and 3 GB memory. 599. Springer, 2014. 1, 2, 3
The search is carried out on the CPU mode with C/C++ im- [2] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up
plementation. Performing an Euclidean distance measure robust features. In Proc. ECCV, pages 404–417. Springer,
between two 4096-dimensional vectors takes 109.767 ms. 2006. 1
[3] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column [18] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Su-
deep neural networks for image classification. In Proc. pervised hashing with kernels. In Proc. CVPR, pages 2074–
CVPR, pages 3642–3649. IEEE, 2012. 6 2081, 2012. 1, 2, 6
[4] J. Deng, A. C. Berg, and F.-F. Li. Hierarchical semantic in- [19] D. G. Lowe. Distinctive image features from scale-invariant
dexing for large scale image retrieval. In Proc. CVPR, 2011. keypoints. IJCV, 60(2):91–110, 2004. 1
4 [20] M. Norouzi and D. M. Blei. Minimal loss hashing for com-
[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, pact binary codes. In Proc. ICML, pages 353–360, 2011. 1,
E. Tzeng, and T. Darrell. DeCAF: A deep convolutional acti- 2, 6
vation feature for generic visual recognition. In Proc. ICML, [21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and
2014. 3 transferring mid-level image representations using convolu-
[6] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in tional neural networks. In Proc. CVPR, 2014. 1
high dimensions via hashing. In VLDB, volume 99, pages [22] G. Qiu. Indexing chromatic and achromatic patterns for
518–529, 1999. 1, 2, 4, 6 content-based colour image retrieval. PR, 35(8):1675–1686,
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- 2002. 1
ture hierarchies for accurate object detection and semantic [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
segmentation. In Proc. CVPR, 2014. 1, 3 Cnn features off-the-shelf: an astounding baseline for recog-
[8] Y. Gong and S. Lazebnik. Iterative quantization: A pro- nition. In Proc. CVPRW, pages 512–519. IEEE, 2014. 1
crustean approach to learning binary codes. In Proc. CVPR, [24] R. Salakhutdinov and G. Hinton. Semantic hashing. Interna-
pages 817–824, 2011. 2, 6 tional Journal of Approximate Reasoning, 500(3):500, 2007.
[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, 2
and Y. Bengio. Maxout networks. arXiv preprint [25] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and
arXiv:1302.4389, 2013. 5 R. Jain. Content-based image retrieval at the end of the early
[10] P. Jain, B. Kulis, and K. Grauman. Fast image search for years. IEEE Trans. PAMI, 22(12):1349–1380, 2000. 1
learned metrics. In Proc. CVPR, pages 1–8, 2008. 2 [26] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- optimization of machine learning algorithms. In Proc. NIPS,
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 2951–2959, 2012. 6
tional architecture for fast feature embedding. arXiv preprint [27] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang,
arXiv:1408.5093, 2014. 3 and J. Li. Deep learning for content-based image retrieval:
[12] A. Krizhevsky. Learning multiple layers of features from A comprehensive study. In Proc. ACM MM, pages 157–166,
tiny images. Computer Science Department, University of 2014. 1
Toronto, Tech. Report, 2009. 4 [28] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hash-
[13] A. Krizhevsky and G. E. Hinton. Using very deep autoen- ing for scalable image retrieval. In Proc. CVPR, pages 3424–
coders for content-based image retrieval. In ESANN, 2011. 3431, 2010. 2
2 [29] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Proc. NIPS, pages 1753–1760, 2009. 1, 2, 6
classification with deep convolutional neural networks. In [30] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hash-
Proc. NIPS, 2012. 1, 2, 3, 6, 7 ing for image retrieval via image representation learning. In
[15] B. Kulis and T. Darrell. Learning to hash with binary re- Proc. AAAI, 2014. 1, 2, 6, 7
constructive embeddings. In Proc. NIPS, pages 1042–1050, [31] M. D. Zeiler and R. Fergus. Stochastic pooling for regular-
2009. 1, 2, 6 ization of deep convolutional neural networks. arXiv preprint
[16] Y. LeCun and C. Cortes. The mnist database of handwritten arXiv:1301.3557, 2013. 5, 6
digits, 1998. 4 [32] M. D. Zeiler and R. Fergus. Visualizing and understand-
[17] M. Lin, Q. Chen, and S. Yan. Network in network. In Proc. ing convolutional networks. In Proc. ECCV, pages 818–833.
ICLR, 2014. 5, 6 Springer, 2014. 4

You might also like