Image-Based Product Recommendation System With Convolutional Neural Networks
Image-Based Product Recommendation System With Convolutional Neural Networks
Image-Based Product Recommendation System With Convolutional Neural Networks
Networks
Abstract services category, for example, in the past six months say
they looked up the product on-line.
Most on-line shopping search engines are still largely However, the explosive growth in the amount of avail-
depend on knowledge base and use key word matching as able digital information has created the challenge of infor-
their search strategy to find the most likely product that con- mation overload for on-line shoppers, which inhibits timely
sumers want to buy. This is inefficient in a way that the de- access to items of interest on the Internet. This has in-
scription of products can vary a lot from the seller’s side to creased the demand for recommendation systems. Though
the buyer’s side. almost every e-commerce company nowadays has its own
In this paper, we present a smart search engine for on- recommendation system that can be used to provide all sorts
line shopping. Basically it uses images as its input, and tries of suggestions, they are mostly text-based and usually rely
to understand the information about products from these im- on knowledge base and use key word matching system.
ages. We first use a neural network to classify the input im- This requires on-line shoppers to provide descriptions of
age as one of the product categories. Then use another neu- products, which can vary a lot from the sellers’ side to the
ral network to model the similarity score between pair im- buyers’ side.
ages, which will be used for selecting the closest product in With the rapid development of neural network these
our e-item database. We use Jaccard similarity to calculate recent years, we can now change the traditional search
the similarity score for training data. We collect product paradigms from text description to visual discovery. A
information data (including image, class label etc.) from snapshot of a product tells a detailed story of its appear-
Amazon to learn these models. Specifically, our dataset ance, usage, brand and so on. While a few pioneering works
contains information about 3.5 million products with image, about image-based search have been applied, the applica-
and there are 20 categories in total. Our method achieves tion of image matching using artificial intelligence in the
a classification accuracy of 0.5. Finally we are able to rec- on-line shopping field remains largely unexplored. Based
ommend products with similarity higher than 0.5, and offer on this idea, here we build a smart recommendation system,
fast and accurate on-line shopping support. which takes images of objects instead of description text as
its input.
The input to our algorithm is an image of any object that
1. Introduction the customer wants to buy. We then use a Convolutional
Neural Network(CNN) model to classify the category that
The on-line retail ecosystem is fast evolving and on-line this object probably belongs to, and use the input vector of
shopping is unavoidable growing around the world. A dig- the last fully connected layer as a feature vector to feed in a
ital analytics firm eMarketer shows that on-line retail sales similarity calculation CNN model to find the closest prod-
will continue double and account for more than 12% of ucts in our database. More concretely, the two functionali-
global sales by 2019. As reported in the result of the Nielsen ties that we want to achieve in the recommendation system
Global Connected Commerce Survey (2015)1 , 63% of re- are:
spondents who shopped or purchased the travel products or
1 The
1. Classification: given a photo of the product taken by
Nielsen Global Connected Commerce Survey was conducted be-
tween August and October 2015 and polled more than 13,000 consumers in
the customer, find the category that this product most
26 countries throughout Asia-Pacific, Europe, Latin America, the Middle likely belong to. We have 20 categories in our data
East, Africa and North America. set in total. The details of these categories are shown
1
in the Datasets and Features section. For example, a mation. The paper [2] indicates that visual similarity and se-
image of iPhone(s) will be classified as “Cell Phones mantic similarity are correlated. Thus, we introduce a new
& Accessories”. model to calculate similarities between images based on se-
mantic information. Paper [15] and [14] share the same idea
2. Recommendation: given the features of the photo and as we do here.
the category that this product belongs to, calculate sim-
ilarity scores and find the most similar products in our 3. Approach
database. Ideally, people looking for iPhones should
be recommended iPhones. There are two major problems that we want to solve in
our project. First, determine the category that a given image
2. Related Work belongs to; second, find and recommend the most similar
products according to the given image. Since our project
Paper [6] presented an idea of combining image recom- is mainly based on convolutional neural network, we would
mendation and image recommendation decades ago. In this first introduce common used convolutional neural network
project, we use Amazon product dataset, which is used to layers.
build typical recommender system using collaborative fil-
tering in [4] and [8]. In the field of image recommendation, 3.1. CNN Layers
[5] tends to recommend images using Tuned perceptual The most important step of CNN is Convolutional(Conv)
retrieval(PR), complementary nearest neighbor consensus layer. As we can see from Figure 1 that conv layer would
(CNNC), Gaussian mixture models (GMM), Markov chain translate small rectangle of input layer into a number of out-
(MCL), and Texture agnostic retrieval (TAR) etc. CNNC, put layer using matrix multiplication.
GMM, TAR, and PR are easy to train, but CNNC and GMM
are hard to test while PR, GMM, and TAR are hard to gen-
eralize. Also, since data consists of images, the neural net-
work should be a worth trying method.
Paper [7] presented AlexNet model that can classify im-
ages into 1000 different categories. In adddition, paper [9]
presented VGG neural network that classify images in Im-
ageNet Challenge 2014. In our first part of project, we use Figure 1. conv layer
both models to classify the categories of the products. How-
ever, both papers did not present a method for image recom- Pooling layers are similar to convolutional layers except
mendation. that it would use non-parameter method to transform small
Although there are papers that studies image similarity rectangle into a number. Max pooling are commonly used
such as [12] and [11], most of them are based on category in CNN, which would output the maximum number in the
similarity, i.e. products are regarded as similar if they are in rectangle of input layer.
the same category. However, products that come from the
same category can still vary a lot. Thus, one reliable strat-
egy is to first classify target image into a certain category
and then recommend images from this classified category.
In paper [13], they considered using neural network to
calculate the similarities within category. However, the pa-
per only consider ConvNet, DeepRanking etc. Since we
Figure 2. pooling layer
have larger dataset, deeper convolutional neural network
such as AlexNet and VGG should outperform naive Con-
vNets. The idea could be also found in [3] and [10]. 3.2. Classification
Paper [1] is also focusing on learning similarity using
In this step, we would like to classify an input image
CNN. However, it considers more on the case that multi-
into one of the 20 categories. We construct AlexNet and
product contained in a single image. In our project, we
VGG model for the classification task and compare them
assume that users are looking for a product and so image
with SVM model as a baseline model.
would only contains one product.
Before we recommend, we need to answer What is the • Support Vector Machine: a linear classification
measurement of similarity. The most nature answer is ei- model, used as a baseline model here. This model
ther cosine similarity or L2 norm similarity. Another way is basically a fully connected layer. We use Multi-
to measure the similarity is by introducing semantic infor- class Support Vector Machine (SVM) loss plus a L2
2
norm term as the loss function. For an image i, we
use the RGB pixels as input features xi ∈ Rd , where
d = 224 × 224. We calculate the class scores for
n = 20 classes through a linear transformation
s = W xi + b (1)
For the recommendation step, we use the last fully con- The larger the score scosine is, the more similar the two
nected layer in our classification model as feature vectors images are.
3
The data-driven approach to calculate the similarity • salesRank - sales rank information
score is to train the following 3-layer neural network:
• brand - brand name
h1 =f (v · W1 + b1 )
• categories - list of categories the product belongs to
h2 =f (h1 · W2 + b2 ) (5)
smodel =sigmoid(h2 · W 3 + b3 )
4
100, regularization coefficient 0.01, learning rate 0.0007
and dropout 0.5. We can see from the result that our mod-
els suffer from the over-fitting problem. The training ac-
curacy of both AlexNet model and VGG model are almost
1.5 times of the accuracy for their validation set and test set.
That is why we need relatively higher regularization coef-
ficients(0.01). But as we increases the regularization coef-
ficient, the test accuracy does not go up any more. For the
same reason, the dropout fractions we choose for these two
Figure 7. Preprocessing of the input image. The left is the original models are also relatively large (0.6 and 0.5 respectively).
input image (960 × 1280 pixels). The middle is the resized image However, we still suffer from the over-fitting to some ex-
(224 × 224 pixels). The right is the demeaned image. tent.
5
category and extract features. Our recommendation is based Currently we only use 20 categories when doing classi-
on the cosine similarity score, since we find out it outper- fication. However, products within category varies a lot,
forms the L2 similarity. The left column shows the input which explains our low accuracy in classification. We
images that the user took. The middle column shows the would try to find a more specific category information and
top four similar products that our system recommends. The train our model on it.
right column shows the top four similar products that the Besides, We would also like to try deeper neural net-
Amazon App recommends. works such as ResNet.
7. Appendices
7.1. UI
We build a user interactive App for our recommendation
system. On this UI, we can upload images or take pho-
tos and get the recommendation from our models. Figure 9
shows the interface of the UI given a watch image example.
6
on World Wide Web, pages 507–517. International World
Wide Web Conferences Steering Committee, 2016.
[5] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and
N. Sundaresan. Large scale visual recommendations from
street fashion images. In Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery
and data mining, pages 1925–1934. ACM, 2014.
[6] I. Kanellopoulos and G. Wilkinson. Strategies and best prac-
tice for neural network image classification. International
Journal of Remote Sensing, 18(4):711–725, 1997.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[8] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel.
Image-based recommendations on styles and substitutes. In
Proceedings of the 38th International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval,
pages 43–52. ACM, 2015.
[9] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[10] M. Tan, S.-P. Yuan, and Y.-X. Su. A learning-based approach
to text image retrieval: using cnn features and improved sim-
ilarity metrics. arXiv preprint arXiv:1703.08013, 2017.
[11] G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus. Learning
invariance through imitation. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages
2729–2736. IEEE, 2011.
[12] G. Wang, D. Hoiem, and D. Forsyth. Learning image sim-
ilarity from flickr groups using stochastic intersection ker-
nel machines. In Computer Vision, 2009 IEEE 12th Interna-
tional Conference on, pages 428–435. IEEE, 2009.
[13] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image
similarity with deep ranking. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1386–1393, 2014.
[14] J. Yang, J. Fan, D. Hubball, Y. Gao, H. Luo, W. Ribarsky, and
M. Ward. Semantic image browser: Bridging information
visualization with automated intelligent image analysis. In
Visual Analytics Science And Technology, 2006 IEEE Sym-
posium On, pages 191–198. IEEE, 2006.
[15] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
age descriptions to visual denotations: New similarity met-
rics for semantic inference over event descriptions. Transac-
tions of the Association for Computational Linguistics, 2:67–
78, 2014.