Pininterest Visual Search
Pininterest Visual Search
Pininterest Visual Search
1 1 1 1 1 1,2
Yushi Jing , David Liu , Dmitry Kislyuk , Andrew Zhai , Jiajing Xu , Jeff Donahue ,
1
1
Sarah Tavel
Visual Discovery, Pinterest
2
University of California, Berkeley
{jing, dliu, dkislyuk, andrew, jiajing, jdonahue, sarah}@pinterest.com
ABSTRACT
arXiv:1505.07647v3 [cs.CV] 8 Mar 2017
Keywords
significant progress has been made in building Web-scale vi-
visual search, visual shopping, open source sual search systems, there are few publications describing
end-to-end architectures deployed on commercial applica-
1. INTRODUCTION tions. This is in part due to the complexity of real-world
Visual search, or content-based image retrieval [5], is an visual search systems, and in part due to business consider-
active research area driven in part by the explosive growth of ations to keep core search technology proprietary.
online photos and the popularity of search engines. Google We faced two main challenges in deploying a commercial
Goggles, Google Similar Images and Amazon Flow are sev- visual search system at Pinterest. First, as a startup we
eral examples of commercial visual search systems. Although needed to control the development cost in the form of both
human and computational resources. For example, feature
Permission to make digital or hard copies of all or part of this work for personal or
computation can become expensive with a large and con-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
tinuously growing image collection, and with engineers con-
on the first page. Copyrights for components of this work owned by others than the stantly experimenting with new features to deploy, it is vital
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or for our system to be both scalable and cost effective. Sec-
republish, to post on servers or to redistribute to lists, requires prior specific permission ond, the success of a commercial application is measured by
and/or a fee. Request permissions from Permissions@acm.org. the benefit it brings to the users (e.g. improved user engage-
KDD’15, August 10-13, 2015, Sydney, NSW, Australia.
ment) relative to the cost of development and maintenance.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3664-2/15/08 ...$15.00.
As a result, our development progress needs to be frequently
DOI: http://dx.doi.org/10.1145/2783258.2788621 . validated through A/B experiments with live user traffic.
isolation. After deploying the end-to-end system, we use
A/B tests to measure user engagement on live traffic.
Related Pins (Figure 2) is a feature that recommends Pins
based on the Pin the user is currently viewing. These rec-
ommendations are primarily generated from the “curation
graph” of users, boards, and Pins. However, there is a long
tail of less popular Pins without recommendations. Using
visual search, we generate recommendations for almost all
Pins on Pinterest. Our second application, Similar Looks
(Figure 1 ) is a discovery experience we tested specifically
for fashion Pins. It allowed users to select a visual query
from regions of interest (e.g. a bag or a pair of shoes) and
identified visually similar Pins for users to explore or pur-
chase. Instead of using the whole image, visually similarity
is computed between the localized objects in the query and
database images. To our knowledge, this is the first pub-
lished work on object detection/localization in a commer-
cially deployed visual search system.
Our experiments demonstrate that 1) one can achieve very
low false positive rate (less than 1%) with good detection
rate by combining the object detection/localization meth-
ods with metadata, and 2) using feature representations
from the VGG [21] [3] model significantly improves visual
search accuracy on our Pinterest benchmark datasets, and
3) we observe significant gains in user engagement when vi-
sual search is used to power Related Pins and Similar Looks
applications.
FullM=VisualJoin/part-0000
FullM=VisualJoin/part-0001
...
( Shard by ImageSignature )
Live Experiments
Our system identified over 80 million “clickable” objects from
a subset of Pinterest images. A clickable red dot is placed
upon the detected object. Once the user clicks on the dot,
our visual search system retrieves a collection of Pins most
visually similar to the object. We launched the system to a
small percentage of Pinterest live traffic and collected user
engagement metrics such as CTR for a period of one month.
Specifically, we looked at the clickthrough rate of the dot,
the clickthrough rate on our visual search results, and also
compared engagement on Similar Look results with the ex-
Figure 9: Once a user clicks on the red dot, the sys- isting Related Pin recommendations.
tem shows products that have a similar appearance As shown in Figure 10, an average of 12% of users who
to the query object. viewed a pin with a dot clicked on a dot in a given day.
Those users went on to click on an average 0.55 Similar
Look results. Although this data was encouraging, when
Table 2: Object detection/classification accuracy we compared engagement with all related content on the
(%) pin close-up (summing both engagement with Related Pins
Text Img Both and Similar Look results for the treatment group, and just
Objects # TP FP TP FP TP FP related pin engagement for the control), Similar Looks ac-
shoe 873 79.8 6.0 41.8 3.1 34.4 1.0 tually hurt overall engagement on the pin close-up by 4%.
dress 383 75.5 6.2 58.8 12.3 47.0 2.0 After the novelty effort wore off, we saw gradual decrease in
glasses 238 75.2 18.8 63.0 0.4 50.0 0.2 CTR on the red dots which stabilizes at around 10%.
bag 468 66.2 5.3 59.8 2.9 43.6 0.5 To test the relevance of our Similar Looks results inde-
watch 36 55.6 6.0 66.7 0.5 41.7 0.0 pendently of the bias resulting from the introduction of a
pants 253 75.9 2.0 60.9 2.2 48.2 0.1 new user behavior (learning to click on the “object dots”),
shorts 89 73.0 10.1 44.9 1.2 31.5 0.2 we designed an experiment to blend Similar Looks results
bikini 32 71.9 1.0 31.3 0.2 28.1 0.0 directly into the existing Related Pins product (for Pins
earrings 27 81.5 4.7 18.5 0.0 18.5 0.0 containing detected objects). This gave us a way to directly
Average 72.7 6.7 49.5 2.5 38.1 0.5 measure if users found our visually similar recommendations
relevant, compared to our non-visual recommendations. On
pins where we detected an object, this experiment increased
overall engagement (repins and close-ups) in Related Pins by
As previously described, the text-based approach applies 5%. Although we set an initial static blending ratio for this
manually crafted rules (e.g. regular expressions) to the Pin- experiment (one visually similar result to three production
terest meta-data associated with images (which we treat as results), this ratio adjusts in response to user click data.
weak labels). For example, an image annotated with “spring
fashion, tote with flowers” will be classified as “bag,” and is
considered as a positive sample if the image contains a “bag”
object box label. For image-based evaluation, we compute 5. CONCLUSION AND FUTURE WORK
the intersection between the predicted object bounding box We demonstrate that, with the availability of distributed
and the labeled object bounding box of the same type, and computational platforms such as Amazon Web Services and
count an intersection to union ratio of 0.3 or greater as a open-source tools, it is possible for a handful of engineers or
positive match. an academic lab to build a large-scale visual search system
Table 2 demonstrates that neither text annotation filters using a combination of non-proprietary tools. This paper
nor object localization alone were sufficient for our detection presented our end-to-end visual search pipeline, including
task due to their relatively high false positive rates at 6.7% incremental feature updating and two-step object detection
and 2.5% respectively. Not surprisingly, combining two ap- and localization method that improves search accuracy and
proaches significantly decreased our false positive rate to less reduces development and deployment costs. Our live prod-
than 1%. uct experiments demonstrate that visual search features can
Specifically, we saw that for classes like “glasses” text an- increase user engagement.
notations were insufficient and image-based classification ex- We plan to further improve our system in the following ar-
celled (due to a distinctive visual shape of glasses). For other eas. First, we are interested in investigating the performance
classes, such as “dress”, this situation was reversed (the false and efficiency of CNN based object detection methods in the
positive rate for our dress detector was high, 12.3%, due context of live visual search systems. Second, we are inter-
to occlusion and high variance in style for that class, and ested in leveraging Pinterest “curation graph” to enhance
Figure 10: Engagement rates for Similar Looks ex-
periment
6. REFERENCES
[1] S. Bengio, J. Dean, D. Erhan, E. Ie, Q. V. Le,
A. Rabinovich, J. Shlens, and Y. Singer. Using web
co-occurrence statistics for improving image
categorization. CoRR, abs/1312.5697, 2013.
[2] T. L. Berg, A. C. Berg, J. Edwards, M. Maire,
R. White, Y.-W. Teh, E. Learned-Miller, and D. A.
Forsyth. Names and faces in the news. In Proceedings
of the Conference on Computer Vision and Pattern
Recognition (CVPR), pages 848–854, 2004.
[3] K. Chatfield, K. Simonyan, A. Vedaldi, and
A. Zisserman. Return of the devil in the details:
Delving deep into convolutional nets. In British
Machine Vision Conference, 2014.
[4] M. Cheng, N. Mitra, X. Huang, P. H. S. Torr, and
S. Hu. Global contrast based salient region detection.
Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI), 2014.
[5] R. Datta, D. Joshi, J. Li, and J. Wang. Image
retrieval: Ideas, influences, and trends of the new age.
ACM Computing Survey, 40(2):5:1–5:60, May 2008.
[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov.
Scalable object detection using deep neural networks.
In 2014 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2014, Columbus, OH,
USA, June 23-28, 2014, pages 2155–2162, 2014.
[7] P. F. Felzenszwalb, R. B. Girshick, and D. A.
McAllester. Cascade object detection with deformable
part models. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages
2241–2248, 2010.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection Figure 11: Examples of object search results for
and semantic segmentation. arXiv preprint shoes. Boundaries of detected objects are automati-
arXiv:1311.2524, 2013. cally highlighted. The top image is the query image.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid
pooling in deep convolutional networks for visual
recognition. In Transactions on Pattern Analysis and
Figure 12: Samples of object detection and localization results for bags. [Green: ground truth, blue: detected
objects.]
Figure 13: Samples of object detection and localization results for shoes.
Machine Intelligence (T-PAMI), pages 346–361. Society Conference on Computer Vision and Pattern
Springer, 2014. Recognition (CVPR), 13, pages 1155–1162,
[10] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and Washington, DC, USA, 2013.
N. Sundaresan. Large scale visual recommendations
from street fashion images. In Proceedings of the
International Conference on Knowledge Discovery and
Data Mining (SIGKDD), 14, pages 1925–1934, 2014.
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
Caffe: Convolutional architecture for fast feature
embedding. arXiv preprint arXiv:1408.5093, 2014.
[12] Y. Jing and S. Baluja. Visualrank: Applying pagerank
to large-scale image search. IEEE Transactions on
Pattern Analysis and Machine Intelligence (T-PAMI),
30(11):1877–1890, 2008.
[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
R. Sukthankar, and L. Fei-Fei. Large-scale video
classification with convolutional neural networks. In
2014 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 1725–1732, 2014.
[14] A. Krizhevsky, S. Ilya, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems
(NIPS), pages 1097–1105. 2012.
[15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson,
R. E. Howard, W. Hubbard, and L. D. Jackel.
Backpropagation applied to handwritten zip code
recognition. Neural Comput., 1(4):541–551, Dec. 1989.
[16] S. Liu, Z. Song, M. Wang, C. Xu, H. Lu, and S. Yan.
Street-to-shop: Cross-scenario clothing retrieval via
parts alignment and auxiliary set. In Proceedings of
the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2012.
[17] J. Long, E. Shelhamer, and T. Darrell. Fully
convolutional networks for semantic segmentation.
arXiv preprint arXiv:1411.4038, 2014.
[18] M. Muja and D. G. Lowe. Fast matching of binary
features. In Proceedings of the Conference on
Computer and Robot Vision (CRV), 12, pages
404–410, Washington, DC, USA, 2012. IEEE
Computer Society.
[19] H. Müller, W. Müller, D. M. Squire,
S. Marchand-Maillet, and T. Pun. Performance
evaluation in content-based image retrieval: Overview
and proposals. Pattern Recognition Letter,
22(5):593–601, 2001.
[20] O. Russakovsky, J. Deng, H. Su, J. Krause,
S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al. ImageNet large scale
visual recognition challenge. arXiv preprint
arXiv:1409.0575, 2014.
[21] K. Simonyan and A. Zisserman. Very deep
convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556, 2014.
[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. arXiv
preprint arXiv:1409.4842, 2014.
[23] K. Yamaguchi, M. H. Kiapour, L. Ortiz, and T. Berg.
Retrieving similar styles to parse clothing.
Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI), 2014.
[24] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency
detection. In Proceedings of the IEEE Computer