Article cvprw15
Article cvprw15
Article cvprw15
Abstract Krizhevsky et al. [14] used the feature vectors from the
7th layer in image retrieval and demonstrated outstanding
Approximate nearest neighbor search is an efficient performance on ImageNet. However, because the CNN
strategy for large-scale image retrieval. Encouraged by the features are high-dimensional and directly computing the
recent advances in convolutional neural networks (CNNs), similarity between two 4096-dimensional vectors is ineffi-
we propose an effective deep learning framework to gener- cient, Babenko et al. [1] proposed to compress the CNN
ate binary hash codes for fast image retrieval. Our idea is features using PCA and discriminative dimensionality re-
that when the data labels are available, binary codes can be duction, and obtained a good performance.
learned by employing a hidden layer for representing the la- In CBIR, both image representations and computational
tent concepts that dominate the class labels. The utilization cost play an essential role. Due to the recent growth of vi-
of the CNN also allows for learning image representations. sual contents, rapid search in a large database becomes an
Unlike other supervised methods that require pair-wised in- emerging need. Many studies aim at answering the ques-
puts for binary code learning, our method learns hash codes tion that how to efficiently retrieve the relevant data from
and image representations in a point-wised manner, mak- the large-scale database. Due to the high-computational
ing it suitable for large-scale datasets. Experimental re- cost, traditional linear search (or exhaustive search) is not
sults show that our method outperforms several state-of- appropriate for searching in a large corpus. Instead of lin-
the-art hashing algorithms on the CIFAR-10 and MNIST ear search, a practical strategy is to use the technique of
datasets. We further demonstrate its scalability and efficacy Approximate Nearest Neighbor (ANN) or hashing based
on a large-scale dataset of 1 million clothing images. method [6, 29, 18, 20, 15, 30] for speedup. These meth-
ods project the high-dimensional features to a lower di-
mensional space, and then generate the compact binary
1. Introduction codes. Benefiting from the produced binary codes, fast im-
age search can be carried out via binary pattern matching
Content-based image retrieval aims at searching for sim- or Hamming distance measurement, which dramatically re-
ilar images through the analysis of image content; hence duces the computational cost and further optimizes the effi-
image representations and similarity measure become criti- ciency of the search. Some of these methods belong to the
cal to such a task. Along this research track, one of the most pair-wised method that use similarity matrix (containing the
challenging issues is associating the pixel-level information pair-wised similarity of data) to describe the relationship of
to the semantics from human perception [25, 27]. Despite the image pairs or data pairs, and employ this similarity in-
several hand-crafted features have been proposed to repre- formation to learn hash functions. However, it is demanding
sent the images [19, 2, 22], the performance of these visual to construct the matrix and generate the codes when dealing
descriptors is still limited until the recent breakthrough of with a large-scale dataset.
deep learning. Recent studies [14, 7, 21, 23] have shown Inspiring from the advancement of deep learning, we
that deep CNN significantly improves the performance on raise a question that can we take the advantage of deep
various vision tasks, such as object detection, image clas- CNN to achieve hashing? Instead of the use of the pair-
sification, and segmentation. These accomplishments are wised learning method, can we generate the binary compact
attributed to the ability of deep CNN to learn the rich mid- codes directly from the deep CNN? To address these ques-
level image representations. tions, we propose a deep CNN model that can simultane-
As deep CNNs learn rich mid-level image descriptors, ously learn image representations and binary codes, under
1
the assumption that the data are labeled. That is, our method another representative approach, which produces the com-
is designed particularly for supervised learning. Further- pact binary codes via thresholding with non-linear functions
more, we argue that when a powerful learning model such along the PCA direction of the given data.
as deep CNN is used and the data labels are available, the bi- Recent studies have shown that using supervised infor-
nary codes can be learned by employing some hidden layer mation can boost the binary hash codes learning perfor-
for representing the latent concepts (with binary activation mance. Supervised approaches [18, 20, 15] incorporate la-
functions such as sigmoid) that dominate the class labels bel information during learning. These supervised hashing
in the architecture. This is different from other supervised methods usually use the pair-wised labels for generating ef-
methods (such as [30]) that take into consideration the data fective hash functions. However, these algorithms generally
labels but require pair-wised inputs to the prepared learning require a large sparse matrix to describe the similarity be-
process. In other words, our approach learns binary hashing tween data points in the training set.
codes in a point-wised manner, taking advantage of the in- Beside the research track of hashing, image representa-
cremental learning nature (via stochastic gradient descent) tions also play an essential role in CBIR. CNN-based visual
of deep CNN. The employment of deep architecture also descriptors have been applied on the task of image retrieval
allows for efficient-retrieval feature learning. Our method recently. Krizhevsky et al. [14] firstly use the features ex-
is suitable for large datasets in comparison of conventional tracted from seventh layer to retrieve images, and achieve
approaches. impressive performance on ImageNet. Babenko et al. [1]
Our method is with the following characteristics: focus on dimensional reduction of the CNN features, and
improve the retrieval performance with compressed CNN
• We introduce a simple yet effective supervised learn- features. Though these recent works [14, 1] present good
ing framework for rapid image retrieval. results on the task of image retrieval, the learned CNN fea-
• With small modifications to the network model, our tures are employed for retrieval by directly performing pat-
deep CNN simultaneously learns domain specific im- tern matching in the Euclidean space, which is inefficient.
age representations and a set of hashing-like functions Deep architectures have been used for hash learning.
for rapid image retrieval. However, most of them are unsupervised, where deep auto-
encoders are used for learning the representations [24, 13].
• The proposed method outperforms all of the state- Xia et al. [30] propose a supervised hashing approach to
of-the-art works on the public dataset MNIST and learn binary hashing codes for fast image retrieval through
CIFAR-10. Our model improves the previous best re- deep learning and demonstrate state-of-the-art retrieval per-
trieval performance on CIFAR10 dataset by 30% pre- formance on public datasets. However, in their pre-
cision, and on MNIST dataset by 1% precision. processing stage, a matrix-decomposition algorithm is used
for learning the representation codes for data. It thus re-
• Our approach learns binary hashing codes in a point-
quires the input of a pair-wised similarity matrix of the data
wised manner and is easily scalable to the data size in
and is unfavorable for the case when the data size is large
comparison of conventional pair-wised approaches.
(e.g., 1M in our experiment) because it consumes both con-
This paper is organized as follows: We briefly review the siderable storage and computational time.
related work of hashing algorithms and image retrieval with In contrast, we present a simple but efficient deep learn-
deep learning in Section 2. We elaborate on the details of ing approach to learn a set of effective hash-like functions,
our method in Section 3. Finally, experimental results are and it achieves more favorable results on the publicly avail-
provided in Section 4, followed by conclusions in Section 5. able datasets. We further apply our method to a large-scale
dataset of 1 million clothing images to demonstrate the scal-
2. Related Work ability of our approach. We will describe the proposed
method in next section.
Several hashing algorithms [6, 29, 18, 20, 28, 10] have
been proposed to approximately identify data relevant to the 3. Method
query. These approaches can be classified into two main
categories, unsupervised and supervised methods. Figure 1 shows the proposed framework. Our method
Unsupervised hashing methods use unlabeled data to includes three main components. The first component is
learn a set of hash functions [6, 29, 8]. The most repre- the supervised pre-training on the large-scale ImageNet
sentative one is the Locality-Sensitive Hashing (LSH) [6], dataset [14]. The second component is fine-tuning the net-
which aims at maximizing the probability that similar data work with the latent layer to simultaneously learn domain-
are mapped to similar binary codes. LSH generates the bi- specific feature representation and a set of hash-like func-
nary codes by projecting the data points to a random hyper- tion. The third retrieves images similar to the query one
plane with random threshold. Spectral hashing (SH) [29] is via the proposed hierarchical deep search. We use the pre-
Module1: Supervised Pre-Training on ImageNet
F7 F8
CNN
4096 nodes 1000 nodes
ImageNet (~1.2M images)
101010 Query
Figure 1: The proposed image retrieval framework via hierarchical deep search. Our method consists of three main com-
ponents. The first is the supervised pre-training of a convolutional neural network on the ImageNet to learn rich mid-level
image representations. In the second component, we add a latent layer to the network and have neurons in this layer learn
hashes-like representations while fine-tuning it on the target domain dataset. The final stage is to retrieve similar images
using a coarse-to-fine strategy that utilizes the learn hashes-like binary codes and F7 features.
trained CNN model proposed by Krizhevsky et al. [14] functions simultaneously. We assume that the final outputs
from the Caffe CNN library [11], which is trained on the of the classification layer F8 rely on a set of h hidden at-
large-scale ImageNet dataset which contains more than 1.2 tributes with each attribute on or off. In other points of view,
million images categorized into 1000 object classes. Our images inducing similar binary activations would have the
method for learning binary codes is described in detail as same label. To fulfill this idea, we embed the latent layer H
follows. between F7 and F8 as shown in the middle row of Figure 1.
The latent layer H is a fully connected layer, and its neuron
3.1. Learning Hash-like Binary Codes activities are regulated by the succeeding layer F8 that en-
Recent studies [14, 7, 5, 1] have shown that the fea- codes semantics and achieves classification. The proposed
ture activations of layers F6−8 induced by the input im- latent layer H not only provides an abstraction of the rich
age can serve as the visual signatures. The use of these features from F7 , but also bridges the mid-level features and
mid-level image representations demonstrates impressive the high-level semantics. In our design, the neurons in the
improvement on the task of image classification, retrieval, latent layer H are activated by sigmoid functions so the ac-
and others. However, these signatures are high-dimensional tivations are approximated to {0, 1}.
vectors that are inefficient for image retrieval in a large cor- To achieve domain adaptation, we fine-tune the proposed
pus. To facilitate efficient image retrieval, a practical way network on the target-domain dataset via back propagation.
to reduce the computational cost is to convert the feature The initial weights of the deep CNN are set as the weights
vectors to binary codes. Such binary compact codes can be trained from ImageNet dataset. The weights of the latent
quickly compared using hashing or Hamming distance. layer H and the final classification layer F8 are randomly
In this work, we propose to learn the domain specific im- initialized. The initial random weights of latent layer H
age representations and a set of hash-like (or binary coded) acts like LSH [6] which uses random projections for con-
structing the hashing bits. The codes are then adapted from
Top
LSH to those that suit the data better from supervised deep-
network learning. Without dramatic modifications to a deep
CNN model, the propose model learns domain specific vi-
Skirt
sual descriptors and a set of hashing-like functions simulta-
neously for efficient image retrieval.
Heels
3.2. Image Retrieval via Hierarchical Deep Search
Zeiler and Fergus [32] analyzed the deep CNN and
showed that the shallow layers learn local visual descriptors
Bag
while the deeper layers of CNN capture the semantic infor-
mation suitable for recognition. We adopt a coarse-to-fine
search strategy for rapid and accurate image retrieval. We
firstly retrieve a set of candidates with similar high-level se- Figure 2: Sample images from the Yahoo-1M Shopping
mantics, that is, with similar hidden binary activations from Dataset. The heterogeneous product images demonstrate
the latent layer. Then, to further filter the images with simi- highly variation, and are challenging to image classification
lar appearance, similarity ranking is performed based on the and retrieval.
deepest mid-level image representations.
4. Experimental Results
Coarse-level Search. Given an image I, we first extract
the outputs of the latent layer as the image signature which In this section, we demonstrate the benefits of our ap-
is denoted by Out(H). The binary codes are then obtained proach. We start with introducing the datasets and then
by binarizing the activations by a threshold. For each bit present our experimental results with performance com-
j = 1 · · · h (where h is the number of nodes in the latent parison to several state-of-the-arts on the public datasets,
layer), we output the binary codes of H by MNIST and CIFAR-10 datasets. Finally, we verify the scal-
{ ability and the efficacy of our approach on the large-scale
j 1 Outj (H) ≥ 0.5, Yahoo-1M dataset.
H = (1)
0 otherwise.
4.1. Datasets
Let Γ = {I1 , I2 , . . . , In } denote the dataset consisting of MNIST Dataset [16] consists of 10 categories of the
n images for retrieval. The corresponding binary codes handwritten digits form 0 to 9. There are 60,000 training
of each images are denoted as ΓH = {H1 , H2 , . . . , Hn } images, and 10,000 test images. All the digits are normal-
with Hi ∈ {0, 1}h . Given a query image Iq and its bi- ized to gray-scale images with size 28 × 28.
nary codes Hq , we identify a pool of m candidates, P = CIFAR-10 Dataset [12] contains 10 object categories
{I1c , I2c , . . . , Im
c
}, if the Hamming distance between Hq and and each class consists of 6,000 images, resulting in a total
Hi ∈ ΓH is lower than a threshold. of 60,000 images. The dataset is split into training and test
sets, with 50,000 and 10,000 images respectively.
Yahoo-1M Dataset contains a total of 1,124,087 shop-
Fine-level Search. Given the query image Iq and the can-
ping product images, categorized into 116 clothing-specific
didate pool P , we use the features extracted from the layer
classes. The dataset is collected by crawling the images
F7 to identify the top k ranked images to form the candi-
from the Yahoo shopping sites. All the images are labeled
date pool P . Let Vq and ViP denote the feature vectors of
with a category, such as Top, Dress, Skirt and so on. Fig-
the query image q and of the image Iic from the pool, re-
ure 2 shows some examples of the dataset.
spectively. We define the similarity level between Iq and
In the experiments of MNIST and CIFAR-10, we retrieve
the i-th image of P as the Euclidean distance between their
the relevant images using the learned binary codes in order
corresponding features vectors,
to fairly compare with other hashing algorithms. In the ex-
si = ∥Vq − ViP ∥. (2) periments of Yahoo-1M dataset, we retrieve similar images
from the entire dataset via the hierarchical search.
The smaller the Euclidean distance is, the higher level the
4.2. Evaluation Metrics
similarity of the two images is. Each candidate Iic is ranked
in ascending order by the similarity; hence, top k ranked We use a ranking based criterion [4] for evaluation.
images are identified. Given a query image q and a similarity measure, a rank can
Query Image Top 10 Retrieved Images
Six
Figure 3: Top 10 retrieved images from MNIST dataset by vary bit numbers of the latent binary codes. Relevant images with
similar appearance are retrieved when the bit numbers increased.
Horse
Figure 5: Top 10 retrieved images from CIFAR-10 by vary bit numbers of the latent binary codes. Relevant images with
similar appearance are retrieved when the bit numbers increased.
Figure 7: Image classification results on Yahoo-1M dataset. The first row indicates ground truth label. The bars below depict
the prediction scores sorted in ascending order. Red and blue bar represent the correct and incorrect predictions, respectively.
Performance of Image Retrieval. In order for a fair com- Image Retrieval Precision of Yahoo-1M Dataset
parison with other hashing algorithms, we unify the evalu- 1
ation method that retrieves the relevant images by 48 bits Ours-HDS
Ours-BCS
binary codes and Hamming distance. Figure 6 shows the 0.9
Ours-ES
precision curves with respect to different number of the AlexNet feature
0.8
top retrieved samples. Our approach achieves better per-
Precision
formance than other unsupervised and supervised methods. 0.7
Ours-HDS Ours-BCS
Ours-HDS Ours-BCS
3 3 3 3 3 3 3 7 3 3
Ours-ES
Ours-ES
3 3 3 7 3 3 3 7 3 3
AlexNet
AlexNet
7 7 7 7 7 7 7 7 7 7
Figure 9: Top 5 retrieved images from Yahoo-1M dataset by different features. The blue check marks indicate the query and
retrieved images share the same label; the black crosses indicate otherwise.
retrieval; the hashing is performed based on Hamming dis- In contrast, computing the hamming distance between two
tance when the binary codes of the latent layer are used; 128 bits binary codes takes 0.113 ms. Thus, Ours-BCS is
the coarse-to-fine hierarchical search is performed to re- 971.3x faster than traditional exhaustive search with 4096-
trieve relevant images by using both the laytent layer codes dimensional features.
and F7 . We randomly select 1000 images from the Yahoo-
1M dataset, and retrieve the relevant images from the same
5. Conclusions
dataset.
Figure 8 shows the precision regarding to various num- We present a simple yet effective deep learning frame-
ber of the top images retrieved using different CNN fea- work to create the hash-like binary codes for fast image
tures. The proposed methods perform more favorably retrieval. We add a latent-attribute layer in the deep CNN to
against the original AlexNet feature. Apparently, the proce- simultaneously learn domain specific image representations
dure of fine-tuning successfully transfers deep CNN to the and a set of hash-like functions. Our method does not rely
new domain (clothing images). Among the fine-tuned mod- on pairwised similarities of data and is highly scalable to
els, Our-ES and Our-HDS show good retrieval precision the dataset size. Experimental results show that, with only
at first. However, Ours-BCS outperforms Ours-ES with a simple modification of the deep CNN, our method im-
higher and more stable retrieval precision when more than proves the previous best retrieval results with 1% and 30%
12 images are retrieved. This indicates the learned binary retrieval precision on the MNIST and CIFAR-10 datasets,
codes are informative and with high discriminative power. respectively. We further demonstrate the scalability and
Ours-HDS complements both Ours-BCS and Ours-ES and efficacy of the proposed approach on the large-scale dataset
achieves the best retrieval precision in overall. of 1 million shopping images.
Figure 9 shows the top 5 images retrieved by different
features. As can be seen, AlexNet retrieves the images with Acknowledgement: This work was supported in part
great diversity. The fine-tuned models retrieve more images by the Ministry of Science and Technology of Taiwan
with the same label as the query than AlexNet. Ours-HDS, under Contract MOST 103-2221-E-001-010.
Ours-BCS, and Ours-ES demonstrate good performance,
and successfully retrieve similar products. Nevertheless,
benefiting from the binary codes, Ours-BCS achieves the
References
fastest search among the approaches compared. Extract- [1] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky.
ing CNN features takes around 60 milliseconds (ms) on the Neural codes for image retrieval. In Proc. ECCV, pages 584–
machine with Geforce GTX 780 GPU and 3 GB memory. 599. Springer, 2014. 1, 2, 3
The search is carried out on the CPU mode with C/C++ im- [2] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up
plementation. Performing an Euclidean distance measure robust features. In Proc. ECCV, pages 404–417. Springer,
between two 4096-dimensional vectors takes 109.767 ms. 2006. 1
[3] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column [18] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Su-
deep neural networks for image classification. In Proc. pervised hashing with kernels. In Proc. CVPR, pages 2074–
CVPR, pages 3642–3649. IEEE, 2012. 6 2081, 2012. 1, 2, 6
[4] J. Deng, A. C. Berg, and F.-F. Li. Hierarchical semantic in- [19] D. G. Lowe. Distinctive image features from scale-invariant
dexing for large scale image retrieval. In Proc. CVPR, 2011. keypoints. IJCV, 60(2):91–110, 2004. 1
4 [20] M. Norouzi and D. M. Blei. Minimal loss hashing for com-
[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, pact binary codes. In Proc. ICML, pages 353–360, 2011. 1,
E. Tzeng, and T. Darrell. DeCAF: A deep convolutional acti- 2, 6
vation feature for generic visual recognition. In Proc. ICML, [21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and
2014. 3 transferring mid-level image representations using convolu-
[6] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in tional neural networks. In Proc. CVPR, 2014. 1
high dimensions via hashing. In VLDB, volume 99, pages [22] G. Qiu. Indexing chromatic and achromatic patterns for
518–529, 1999. 1, 2, 4, 6 content-based colour image retrieval. PR, 35(8):1675–1686,
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- 2002. 1
ture hierarchies for accurate object detection and semantic [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
segmentation. In Proc. CVPR, 2014. 1, 3 Cnn features off-the-shelf: an astounding baseline for recog-
[8] Y. Gong and S. Lazebnik. Iterative quantization: A pro- nition. In Proc. CVPRW, pages 512–519. IEEE, 2014. 1
crustean approach to learning binary codes. In Proc. CVPR, [24] R. Salakhutdinov and G. Hinton. Semantic hashing. Interna-
pages 817–824, 2011. 2, 6 tional Journal of Approximate Reasoning, 500(3):500, 2007.
[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, 2
and Y. Bengio. Maxout networks. arXiv preprint [25] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and
arXiv:1302.4389, 2013. 5 R. Jain. Content-based image retrieval at the end of the early
[10] P. Jain, B. Kulis, and K. Grauman. Fast image search for years. IEEE Trans. PAMI, 22(12):1349–1380, 2000. 1
learned metrics. In Proc. CVPR, pages 1–8, 2008. 2 [26] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- optimization of machine learning algorithms. In Proc. NIPS,
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 2951–2959, 2012. 6
tional architecture for fast feature embedding. arXiv preprint [27] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang,
arXiv:1408.5093, 2014. 3 and J. Li. Deep learning for content-based image retrieval:
[12] A. Krizhevsky. Learning multiple layers of features from A comprehensive study. In Proc. ACM MM, pages 157–166,
tiny images. Computer Science Department, University of 2014. 1
Toronto, Tech. Report, 2009. 4 [28] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hash-
[13] A. Krizhevsky and G. E. Hinton. Using very deep autoen- ing for scalable image retrieval. In Proc. CVPR, pages 3424–
coders for content-based image retrieval. In ESANN, 2011. 3431, 2010. 2
2 [29] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Proc. NIPS, pages 1753–1760, 2009. 1, 2, 6
classification with deep convolutional neural networks. In [30] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hash-
Proc. NIPS, 2012. 1, 2, 3, 6, 7 ing for image retrieval via image representation learning. In
[15] B. Kulis and T. Darrell. Learning to hash with binary re- Proc. AAAI, 2014. 1, 2, 6, 7
constructive embeddings. In Proc. NIPS, pages 1042–1050, [31] M. D. Zeiler and R. Fergus. Stochastic pooling for regular-
2009. 1, 2, 6 ization of deep convolutional neural networks. arXiv preprint
[16] Y. LeCun and C. Cortes. The mnist database of handwritten arXiv:1301.3557, 2013. 5, 6
digits, 1998. 4 [32] M. D. Zeiler and R. Fergus. Visualizing and understand-
[17] M. Lin, Q. Chen, and S. Yan. Network in network. In Proc. ing convolutional networks. In Proc. ECCV, pages 818–833.
ICLR, 2014. 5, 6 Springer, 2014. 4