Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Image-Based Product Recommendation System With Convolutional Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Image-based Product Recommendation System with Convolutional Neural

Networks

Luyang Chen Fan Yang Heqing Yang


Stanford University Stanford University Stanford University
450 Serra Mall, Stanford, CA 450 Serra Mall, Stanford, CA 450 Serra Mall, Stanford, CA
lych@stanford.edu fanfyang@stanford.edu heqing@stanford.edu

Abstract services category, for example, in the past six months say
they looked up the product on-line.
Most on-line shopping search engines are still largely However, the explosive growth in the amount of avail-
depend on knowledge base and use key word matching as able digital information has created the challenge of infor-
their search strategy to find the most likely product that con- mation overload for on-line shoppers, which inhibits timely
sumers want to buy. This is inefficient in a way that the de- access to items of interest on the Internet. This has in-
scription of products can vary a lot from the seller’s side to creased the demand for recommendation systems. Though
the buyer’s side. almost every e-commerce company nowadays has its own
In this paper, we present a smart search engine for on- recommendation system that can be used to provide all sorts
line shopping. Basically it uses images as its input, and tries of suggestions, they are mostly text-based and usually rely
to understand the information about products from these im- on knowledge base and use key word matching system.
ages. We first use a neural network to classify the input im- This requires on-line shoppers to provide descriptions of
age as one of the product categories. Then use another neu- products, which can vary a lot from the sellers’ side to the
ral network to model the similarity score between pair im- buyers’ side.
ages, which will be used for selecting the closest product in With the rapid development of neural network these
our e-item database. We use Jaccard similarity to calculate recent years, we can now change the traditional search
the similarity score for training data. We collect product paradigms from text description to visual discovery. A
information data (including image, class label etc.) from snapshot of a product tells a detailed story of its appear-
Amazon to learn these models. Specifically, our dataset ance, usage, brand and so on. While a few pioneering works
contains information about 3.5 million products with image, about image-based search have been applied, the applica-
and there are 20 categories in total. Our method achieves tion of image matching using artificial intelligence in the
a classification accuracy of 0.5. Finally we are able to rec- on-line shopping field remains largely unexplored. Based
ommend products with similarity higher than 0.5, and offer on this idea, here we build a smart recommendation system,
fast and accurate on-line shopping support. which takes images of objects instead of description text as
its input.
The input to our algorithm is an image of any object that
1. Introduction the customer wants to buy. We then use a Convolutional
Neural Network(CNN) model to classify the category that
The on-line retail ecosystem is fast evolving and on-line this object probably belongs to, and use the input vector of
shopping is unavoidable growing around the world. A dig- the last fully connected layer as a feature vector to feed in a
ital analytics firm eMarketer shows that on-line retail sales similarity calculation CNN model to find the closest prod-
will continue double and account for more than 12% of ucts in our database. More concretely, the two functionali-
global sales by 2019. As reported in the result of the Nielsen ties that we want to achieve in the recommendation system
Global Connected Commerce Survey (2015)1 , 63% of re- are:
spondents who shopped or purchased the travel products or
1 The
1. Classification: given a photo of the product taken by
Nielsen Global Connected Commerce Survey was conducted be-
tween August and October 2015 and polled more than 13,000 consumers in
the customer, find the category that this product most
26 countries throughout Asia-Pacific, Europe, Latin America, the Middle likely belong to. We have 20 categories in our data
East, Africa and North America. set in total. The details of these categories are shown

1
in the Datasets and Features section. For example, a mation. The paper [2] indicates that visual similarity and se-
image of iPhone(s) will be classified as “Cell Phones mantic similarity are correlated. Thus, we introduce a new
& Accessories”. model to calculate similarities between images based on se-
mantic information. Paper [15] and [14] share the same idea
2. Recommendation: given the features of the photo and as we do here.
the category that this product belongs to, calculate sim-
ilarity scores and find the most similar products in our 3. Approach
database. Ideally, people looking for iPhones should
be recommended iPhones. There are two major problems that we want to solve in
our project. First, determine the category that a given image
2. Related Work belongs to; second, find and recommend the most similar
products according to the given image. Since our project
Paper [6] presented an idea of combining image recom- is mainly based on convolutional neural network, we would
mendation and image recommendation decades ago. In this first introduce common used convolutional neural network
project, we use Amazon product dataset, which is used to layers.
build typical recommender system using collaborative fil-
tering in [4] and [8]. In the field of image recommendation, 3.1. CNN Layers
[5] tends to recommend images using Tuned perceptual The most important step of CNN is Convolutional(Conv)
retrieval(PR), complementary nearest neighbor consensus layer. As we can see from Figure 1 that conv layer would
(CNNC), Gaussian mixture models (GMM), Markov chain translate small rectangle of input layer into a number of out-
(MCL), and Texture agnostic retrieval (TAR) etc. CNNC, put layer using matrix multiplication.
GMM, TAR, and PR are easy to train, but CNNC and GMM
are hard to test while PR, GMM, and TAR are hard to gen-
eralize. Also, since data consists of images, the neural net-
work should be a worth trying method.
Paper [7] presented AlexNet model that can classify im-
ages into 1000 different categories. In adddition, paper [9]
presented VGG neural network that classify images in Im-
ageNet Challenge 2014. In our first part of project, we use Figure 1. conv layer
both models to classify the categories of the products. How-
ever, both papers did not present a method for image recom- Pooling layers are similar to convolutional layers except
mendation. that it would use non-parameter method to transform small
Although there are papers that studies image similarity rectangle into a number. Max pooling are commonly used
such as [12] and [11], most of them are based on category in CNN, which would output the maximum number in the
similarity, i.e. products are regarded as similar if they are in rectangle of input layer.
the same category. However, products that come from the
same category can still vary a lot. Thus, one reliable strat-
egy is to first classify target image into a certain category
and then recommend images from this classified category.
In paper [13], they considered using neural network to
calculate the similarities within category. However, the pa-
per only consider ConvNet, DeepRanking etc. Since we
Figure 2. pooling layer
have larger dataset, deeper convolutional neural network
such as AlexNet and VGG should outperform naive Con-
vNets. The idea could be also found in [3] and [10]. 3.2. Classification
Paper [1] is also focusing on learning similarity using
In this step, we would like to classify an input image
CNN. However, it considers more on the case that multi-
into one of the 20 categories. We construct AlexNet and
product contained in a single image. In our project, we
VGG model for the classification task and compare them
assume that users are looking for a product and so image
with SVM model as a baseline model.
would only contains one product.
Before we recommend, we need to answer What is the • Support Vector Machine: a linear classification
measurement of similarity. The most nature answer is ei- model, used as a baseline model here. This model
ther cosine similarity or L2 norm similarity. Another way is basically a fully connected layer. We use Multi-
to measure the similarity is by introducing semantic infor- class Support Vector Machine (SVM) loss plus a L2

2
norm term as the loss function. For an image i, we
use the RGB pixels as input features xi ∈ Rd , where
d = 224 × 224. We calculate the class scores for
n = 20 classes through a linear transformation

s = W xi + b (1)

where W ∈ Rn×d is the weight matrix and b ∈ Rn is


the bias term. The SVM loss is given by
X
LSV M (W, b; xi ) = max(0, sj − syi + 1) (2)
j6=yi

where yi is the label for the true class.


• AlexNet: a deep convolutional neural network classi-
fication model proposed by [7]. As we can see, (Figure
3) AlexNet model first contains 2 convolutional layers
with max pooling and batch normalization; then there
are 3 convolutional layers with separated feature; one Figure 4. VGG model
max pooling before three fully connected layers.
of images. For any images in the dataset, there will be one
corresponding feature vector. And this feature vector will
be the input for our recommendation model. The work flow
of this step is shown in the following bullets.

• Feature extraction: the classification model is used to


identify which category the target image belongs to.
Figure 3. AlexNet model
Then we extract the input from last fully connected
layer of classification model as features.
The original model was trained to classify images in
the ImageNet LSVRC-2010 content, where there were • Input of the model: the feature vector of the target im-
1000 categories. Since our problem only contains 20 age extracted in the above.
categories, we change the last fully connected layer to
4096×20. To save time, we use the pre-trained weights • Similarity calculation: using different measures to cal-
of the first 5 neurons and train the last three fully con- culate similarity scores between feature vector of tar-
nected layers. get image and feature vectors of all images in the target
category to measure similarity between image pairs.
• VGG: a deep convolutional neural network classifica- We have tried L2 distance, cosine distance and neural
tion model proposed by [9]. As showed below (Figure network models to compute the similarity scores. For
4), VGG contains 13 convolutional layers with max two different images i and j, the L2 distance score is
pooling every 2 or 3 convolutional layers; then 3 fully defined as
connected layers and softmax as the final layer.
sL2 = kvi − vj k2 (3)
The original model was trained to classify images
in the ImageNet ILSVRC-2014 content, where there where vi , vj ∈ Rl are the two corresponding feature
were 1000 categories. We change the last fully con- vectors, and l = 4096 is the length of feature vectors.
nected layer to 4096 × 20. We also utilize the pre- The smaller the score sL2 is, the more similar the two
trained weights as initialization of parameters and to images are.
train the last three fully connected layers. We also add The cosine distance score is defined as
batch normalization layers after the activation func-
tions in the first two fully connected layers. vi> vj
scosine = (4)
3.3. Recommendation kvi kkvj k

For the recommendation step, we use the last fully con- The larger the score scosine is, the more similar the two
nected layer in our classification model as feature vectors images are.

3
The data-driven approach to calculate the similarity • salesRank - sales rank information
score is to train the following 3-layer neural network:
• brand - brand name
h1 =f (v · W1 + b1 )
• categories - list of categories the product belongs to
h2 =f (h1 · W2 + b2 ) (5)
smodel =sigmoid(h2 · W 3 + b3 )

where v = [v1 , v2 ] ∈ Rl×2 is obtained by concatenat-


ing two feature vectors. f (x) = max(0.01x, x) is the
leaky ReLU function. The first layer can be treated as
a 1-d convolution layer with Leaky ReLU as the acti-
vation function and W1 ∈ R2 and b1 ∈ R as parame-
ters. The second layer is a fully-connected layer with
Leaky ReLU as the activation function and W2 ∈ Rl
and b2 ∈ R as parameters. The output layer is a linear
transformation with the sigmoid function as activation
function. The larger the score smodel is, the more sim-
ilar the two images are. There is no easy way to define Figure 5. Label distribution of the dataset.
similarity score purely based on the image pixels. For-
tunately, the input images has a corresponding title de- Considering the imbalance across different classes and
scribing the product. To characterize how similar two due to the limitation of the machine memory, we randomly
images are, we use the Jaccard similarity of two sets sample 500 images in each class and collect 10000 images
of tokens in the titles of two images as the similarity of for the classification task. Then we split the dataset into
two images. The Jaccard similarity of two set A and B 7:2:1 for training, validation and testing respectively. Each
is defined as image in the dataset has 300×300 pixels. We use raw pixels
of images as the input for our classification neural network
|A ∩ B| model. Examples of the data are shown in Figure 6. For the
sJaccard = (6)
|A ∪ B| convenience of tuning hyper-parameters, we resize images
which is a number between 0 and 1. This is also the into 224 × 224 × 3 using “scipy.misc” for VGG, and resize
reason that we use the sigmoid function as the activa- into 227 × 227 for AlexNet.
tion function for the last layer. We train this model by
minimizing the L2 loss ksmodel − sJaccard k22 .
• Output: top k images (products) that are most similar
to the target image.

4. Dataset and Features


For building the recommendation system, we use Ama- Figure 6. Examples of the data. These are three products from
zon product image data, spanning May 1996 July 2014, category “Computers & Accessories”.
which includes 9.4 million products. Excluding the ones
that lack images, we collected a dataset of 3.5 million prod-
ucts, with 20 categories in total. Figure 5 shows the dis- 5. Experiment
tribution plot of all the labels of the dataset. The detailed
information of each image contains: 5.1. Data preprocessing
The raw images need pre-processing before being used
• asin - ID of the product, e.g. 0000031852
as inputs of the classification models. An original image
• title - name of the product is first resized into the standard input size of either VGG
model (224 × 224) or AlexNet model (227 × 227). Then it
• price - price in US dollars (at time of crawl) is demeaned in each channel (Figure 7).
• imUrl - url of the product image For the recommendation model, the ground truth similar-
ity is defined using the Jaccard similarity (6) of sets of to-
• related - related products (also bought, also viewed, kens in the title of two images. (title information is attached
bought together, buy after viewing) to each image in the dataset.) However, we care more about

4
100, regularization coefficient 0.01, learning rate 0.0007
and dropout 0.5. We can see from the result that our mod-
els suffer from the over-fitting problem. The training ac-
curacy of both AlexNet model and VGG model are almost
1.5 times of the accuracy for their validation set and test set.
That is why we need relatively higher regularization coef-
ficients(0.01). But as we increases the regularization coef-
ficient, the test accuracy does not go up any more. For the
same reason, the dropout fractions we choose for these two
Figure 7. Preprocessing of the input image. The left is the original models are also relatively large (0.6 and 0.5 respectively).
input image (960 × 1280 pixels). The middle is the resized image However, we still suffer from the over-fitting to some ex-
(224 × 224 pixels). The right is the demeaned image. tent.

Model training validation test


the pair of images that are similar. If we use all the data, the
accuracy accuracy accuracy
majority of the pairs will have similarity scores close to 0.
Therefore, instead of using all the pairs, we only consider SVM (baseline) 0.2616 0.1807 0.2679
pairs within the same category. Moreover, we particularly AlexNet 0.6484 0.4064 0.3946
want to find image pairs that have relatively large Jaccard VGG 0.8769 0.5110 0.5010
similarity. We find those pairs with similarity scores above Table 1. Model accuracy results of AlexNet and VGG compared
0.5 and there are around 800 such pairs. We also sample with baseline
1000 pairs with similarity scores equal to 0. We use these
pairs as training examples.
5.4. Recommendation
5.2. Evaluation
For the recommendation task, we trained a neural net-
We split the dataset into 7 : 2 : 1 for training, validation work model described in section 3.2. We evaluate the model
and test respectively. To evaluate the models, we run our using RMSE. Table 2 shows our best errors on training data,
models on the test dataset and compare the output with the validation data and test data. The model is trained with reg-
ground truth. ularization coefficient 0.02, learning rate 0.001 and dropout
For classification problem, we evaluate the model by cal- 0.9. To avoid over-fitting, we have a add the regulariza-
culating classification accuracy: tion term and choose a relatively large coefficient. Here the
#correctly classified images dropout fraction is relatively small (0.1) because the number
Accuracy = . (7) of images within one category is limited, unlike the situa-
#images in validation dataset
tion in classification task where we have image data in all
For the recommendation task, we evaluate the model us- 20 categories.
ing the root mean square error (RSME).
v training validation test
u
u1 X N error error error
RSME = t (smodel − sJaccard )2 (8) 0.1318 0.1448 0.1524
N i=1
Table 2. RMSE of the neural network model for recommendation
5.3. Classification
There is no baseline similarity scores for two metrics L2
For the classification task, we have trained two Convo-
distance and cosine distance. We cannot provide a RMSE
lutional Neural Networks (VGG16 and AlexNet) to classify
for these two metrics.
the categories of products images versus our baseline model
Not many e-commercial platforms have the feature of
– linear classification model (SVM model).
searching items by images. By now, we find that the Ama-
Table 1 shows our best accuracy on training data, valida-
zon mobile App has such feature. Thus we compared our
tion data and test data for these three models respectively.
recommendation results with theirs. Figure 8 shows two
For SVM model, we use learning rate 0.0005, regulariza-
examples of outputs using our recommendation system and
tion coefficients 0.001. For AlexNet, we use mini-batch
the Amazon App. We use the VGG model to predict the
size 128, regularization coefficient 0.01, learning rate 0.001
and dropout 0.62 . For VGG model, we use mini-batch size fraction of the neurons’ values are set to 0. We denote 0.4 in this example
as dropout fraction. This applies to all the other dropout coefficient in our
2A dropout coefficient of 0.6 in our model means at each layer, 0.4 report.

5
category and extract features. Our recommendation is based Currently we only use 20 categories when doing classi-
on the cosine similarity score, since we find out it outper- fication. However, products within category varies a lot,
forms the L2 similarity. The left column shows the input which explains our low accuracy in classification. We
images that the user took. The middle column shows the would try to find a more specific category information and
top four similar products that our system recommends. The train our model on it.
right column shows the top four similar products that the Besides, We would also like to try deeper neural net-
Amazon App recommends. works such as ResNet.

7. Appendices
7.1. UI
We build a user interactive App for our recommendation
system. On this UI, we can upload images or take pho-
tos and get the recommendation from our models. Figure 9
shows the interface of the UI given a watch image example.

Figure 8. Examples of our recommendation system results, com-


pared with Amazon mobile App search results.

As we can see from the example results, one of the input


images is a mug. Our model recommends similar mugs or
a mug-shape-like pot; while the Amazon App recommends
either mugs or bottles. Results like this example give sim-
ilar recommendations between our model and Amazon’s.
The other input image is a laptop. Our model suggests we
might be searching for a Mac laptop (shown in the first two
images); while the Amazon App mostly recommends key-
boards/keyboard protectors. Results like this example im-
ply that our model may understand the content of images
better. Figure 9. Examples of our recommendation system results, com-
pared with Amazon mobile App search results.
6. Conclusion
In this project we build a smart shopping recommender References
for image search. We tried out different neural network [1] S. Bell and K. Bala. Learning visual similarity for product
models for image classification and different ways to quan- design with convolutional neural networks. ACM Transac-
tify the similarity between two images. We are able to tions on Graphics (TOG), 34(4):98, 2015.
achieve a classification accuracy of 0.5 and recommend [2] T. Deselaers and V. Ferrari. Visual and semantic similar-
products with similarity score higher than 0.5. There is ity in imagenet. In Computer Vision and Pattern Recogni-
over-fitting issue in our model, which can be one of the tion (CVPR), 2011 IEEE Conference on, pages 1777–1784.
things to do in future work. IEEE, 2011.
[3] A. Dosovitskiy and T. Brox. Generating images with percep-
As shown in the Dataset and Features section, though tual similarity metrics based on deep networks. In Advances
we have a huge data set, due to the limitation on time and in Neural Information Processing Systems, pages 658–666,
machine memory, we only used 10,000 out of 3.5 million 2016.
images. In the next step, we can try to train our model on [4] R. He and J. McAuley. Ups and downs: Modeling the visual
a larger amount of data using batches. This can potentially evolution of fashion trends with one-class collaborative fil-
increase the accuracy of the model. tering. In Proceedings of the 25th International Conference

6
on World Wide Web, pages 507–517. International World
Wide Web Conferences Steering Committee, 2016.
[5] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and
N. Sundaresan. Large scale visual recommendations from
street fashion images. In Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery
and data mining, pages 1925–1934. ACM, 2014.
[6] I. Kanellopoulos and G. Wilkinson. Strategies and best prac-
tice for neural network image classification. International
Journal of Remote Sensing, 18(4):711–725, 1997.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[8] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel.
Image-based recommendations on styles and substitutes. In
Proceedings of the 38th International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval,
pages 43–52. ACM, 2015.
[9] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[10] M. Tan, S.-P. Yuan, and Y.-X. Su. A learning-based approach
to text image retrieval: using cnn features and improved sim-
ilarity metrics. arXiv preprint arXiv:1703.08013, 2017.
[11] G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus. Learning
invariance through imitation. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages
2729–2736. IEEE, 2011.
[12] G. Wang, D. Hoiem, and D. Forsyth. Learning image sim-
ilarity from flickr groups using stochastic intersection ker-
nel machines. In Computer Vision, 2009 IEEE 12th Interna-
tional Conference on, pages 428–435. IEEE, 2009.
[13] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image
similarity with deep ranking. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1386–1393, 2014.
[14] J. Yang, J. Fan, D. Hubball, Y. Gao, H. Luo, W. Ribarsky, and
M. Ward. Semantic image browser: Bridging information
visualization with automated intelligent image analysis. In
Visual Analytics Science And Technology, 2006 IEEE Sym-
posium On, pages 191–198. IEEE, 2006.
[15] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
age descriptions to visual denotations: New similarity met-
rics for semantic inference over event descriptions. Transac-
tions of the Association for Computational Linguistics, 2:67–
78, 2014.

You might also like