Product Recommendations in E-Commerce Systems Using Content-Based Clustering and Collaborative Filtering
Product Recommendations in E-Commerce Systems Using Content-Based Clustering and Collaborative Filtering
Product Recommendations in E-Commerce Systems Using Content-Based Clustering and Collaborative Filtering
Linda Hansson
atp10lh1@student.lu.se
September 7, 2015
The algorithm is tested on real product and purchase data from two differ-
ent companies - a big online book store and a smaller online clothing store.
It is evaluated both for functionality as a backfiller to other algorithms and as
a strong individual algorithm. The evaluation mainly looks at the number of
purchases as metric but also uses accuracy and recall as evaluation metrics.
The algorithm shows some promise for using it as an individual algorithm.
I would like to thank the following people who have contributed to this work in one way
or another:
Mikael Hammar at Apptus’ for his advice, feedback and many good discussions
Edward Blurock and Bengt Nilsson from MAH for participation in the above discussions.
I would also like to thank Edward Blurock for helping me become better at generalising
Björn Brodén at Apptus for help with understanding Apptus’ system and helping with code
for getting information from Apptus’ system as well as making the baselines. I would also
like to thank the rest of the people at Apptus for their help and making me feel welcome.
Erik Lindström at LTH for taking on the supervisor role for this thesis even though this is
not his field of study and for providing stable support in the background.
I would also like to note that this research is part of the project “20130185 Förbättrade
sök och rekommendationer för e-Handel " funded by a grant from the KK-foundation; see
http://www.kks.se
3
4
Contents
1 Introduction 7
1.1 Introduction to Recommender Systems . . . . . . . . . . . . . . . . . . . 7
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Aim of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Limitations and Important Contributions . . . . . . . . . . . . . 9
1.4 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background 11
2.1 Collaborative Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Common Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Content-based Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Hybrid Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . 16
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Approach 19
3.1 Clustering on specific attributes . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 The Apache Commons Math Machine Learning Library . . . . . 20
3.1.2 The Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 General Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 First Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Working Design . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Evaluation 27
4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Apptus’ Test Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5
CONTENTS
4.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.1 Interpreting Test Names . . . . . . . . . . . . . . . . . . . . . . 33
4.6.2 Results from Attribute Specific Clustering . . . . . . . . . . . . . 33
4.6.3 Results for Company A . . . . . . . . . . . . . . . . . . . . . . . 34
4.6.4 Results for Company B . . . . . . . . . . . . . . . . . . . . . . . 36
5 Discussion 39
5.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusions 43
Bibliography 45
6
Chapter 1
Introduction
Over the last years, ecommerce has been growing. In 2013, the total turnover for e-
commerce in Europe increased with 17% compared to the year before1 and big companies
can have hundreds of thousands of products or more to choose from on their website. Both
the customer and the company want the customer to easily find relevant products both dur-
ing search and when simply browsing and this is where recommender systems comes into
the picture.
Some of the recommenders use only behavioural data, e.g. ratings, purchases, clicks and
so on, to predict what items a customer would like. When doing predictions with little or
no data about the users or items, these kind of recommenders usually can’t give good, if
any, recommendations. This is known as the cold-start problem [Meyer, 2012].
There are different kinds of systems and not everyone agrees on what classification should
be used, but the most common classification divides all systems into one of three cate-
gories: Collaborative filters (CF), Content-based filters (CBF) or Hybrid filters
[Meyer, 2012][Leander, 2014]. In [Meyer, 2012] a new classification is suggested that is
based on the utility rather than the kind of information used by the system. The utilities
are the following:
1
Emota’s "E-Commerce and Distance Selling in Europe Report 2014/15", pg.6
7
1. Introduction
• Help to Decide
• Help to Explore
• Help to Compare
• Help to Discover
Amazon is one e-commerce site that provides personal product recommendations. Ac-
cording to [Linden et al., 2003], they use item-item collaborative filtering in their recom-
mendation system, i.e., they use behavioural data to make recommendations in the form
of for example "people who bought this also bought that". [Nageswara and Talwar, 2008]
mention that Amazon uses a hybrid recommendation system, i.e. a system that uses both
content data and behavioural data. These will be discussed in section 2.3 Hybrid Filters.
1.2 Purpose
Most sites today use heuristic methods or algorithms using behavioural data in other ways
in their recommender systems [Aldrich, 2015]. However, it could be of interest to ex-
plore other approaches where other methods, such as hybrid systems using both product
data and behavioural data, are used to recommend products, as methods only using be-
havioural data usually have trouble recommending products when they don’t have enough
data [Meyer, 2012].
Also, much of the research done in the field of hybrid recommender systems, such as
[Su and Khoshgoftaar, 2009][Kim et al., 2006][Tso and Schmidt-Thieme, 2006] have fo-
cused on ratings (either as the end goal or as a step to make recommendations) instead
of direct top-n recommendations, which is what this project is focused on. The project is
also using real data, which means problems will be encountered that would not be there
on manufactured data, and has the advantage of using Apptus’ test framework (this will be
presented in the Evaluation chapter). Finally, one of the project goals is to be able to gen-
eralise the procedure created during the project enough to work on different sites without
much configuration.
8
1.4 Definitions
from the best clusters, given a product that belongs to a specific cluster.
The model should be general enough to work on a diverse number of sites, i.e. sites for
e.g. clothing and books and sites with either big amounts of traffic or small amounts of
traffic.
This leads to the following research question: Given a product described by a set of
attribute-value pairs, is it possible that the use of a clustering algorithm (using the attribute-
value pairs to compute distances) together with behavioural data (to find the best related
clusters and best items to pick from those clusters), will lead to good product recommen-
dations? By good recommendations we mean that the algorithm recommends the products
that the customer eventually buys. In our case this will be tested in an offline setting, i.e.,
it will be tested on data generated from online stores but it will not be able to actually
influence the customers as it would if the algorithm was tested online.
While the work on the original algorithm, specific for one of the companies and testing
clustering for two specific attributes, is my own work, the generalised version that extends
this was designed jointly with the assistant supervisor for this thesis, Edward Blurock. He
is also part of the bigger project this thesis is part of.
1.4 Definitions
Some of the vocabulary used in this report can be seen as ambiguous so this section is
dedicated to explaining what the words will mean in this report.
An attribute has an attribute name, such as colour or title, and one or several attribute val-
ues, for example blue, “The Wind in the Willows”, etc. An item or a product is assumed
to have several attributes, including an identifier. The identifier may also be referred to as
a key, with or without the word product or item in front of it, in this report.
The product catalogue contains a set of products. Depending on the settings, it may be
only products that have been bought before or the whole set. If nothing is said about it,
it is assumed to be the whole set. Items and products will be used interchangeably even
though item is a much broader concept.
eSales and Apptus’ test framework will be used to refer to Apptus’ system that contains all
sorts of useful functions and classes. For example, if there is a line saying that eSales was
used to get some information that means one or several of the functions in Apptus’ system
9
1. Introduction
The concept of a wrapper is used in this report. It is mainly used because Apache Com-
mons MathTM uses that in their example for their KMeans++ implementation. A wrapper
is essentially an object holding together related information in a packaged format, very
much like a class in Java.
When talking about clusters, the word bucket will sometimes be used. A bucket is ba-
sically a wrapper for a cluster but sometimes it contains a little extra information. For this
report however, they can be treated as the same thing.
In the second step of the algorithm, we will use behavioural data to find connections,
or associations, between items. Those words will be used interchangeably in this report.
An item is connected to another item if they have been bought by the same person in the
same session. The more times they have been bought together, the stronger the connection
is.
10
Chapter 2
Background
This chapter will introduce the reader to some of the more theoretical background, includ-
ing the different kinds of systems, related work and the clustering techniques we will be
using in this project.
K-Nearest-Neighbour finds the k nearest neighbours to an item or a user and uses infor-
mation from the neighbours in some useful way, it can for example use the ratings of the
neighbouring items to predict the ratings of another item [Töscher et al., 2008].
In the simplest application, there are two classes an item can belong to and the k-nearest
neighbours (who are already classified) will decide which class a new item belongs to.
The item then belongs to the class most of its neighbours belong to; see example in figure
2.1 where k is set to 3.
Matrix Factorisation is mainly used for predicting ratings of items and uses a matrix to
11
2. Background
represent the ratings of items by users. The problem can be seen as fitting a linear model
to the user-item matrix Y so that Y ≈ UV where U and V are matrices that need to be
found [Wu, 2007]; see figure 2.2.
Matrix Factorisation is a latent factor model. For the user-item rating case this means
it uses the ratings to find latent factors for the users and items and builds the vectors of
matrices U and V on that. Latent factors can both be in a dimension that is easily un-
derstandable and more abstract. Easily interpretative latent factors could for example be
serious movies vs comedy ones, a measure of attributes such as how many action scenes
or how much and what kind of music is in the film or independent films vs. large-budget
films [Bell et al., 2010].
Matrix Factorisation is not adapted to item-item recommendations but in the case of rat-
12
2.2 Content-based Filters
ings it is usually slightly better than kNN [Meyer, 2012]. kNN methods used by themselves
have problems with scalability as their computational time is O(n2 ), where n is the number
of items or users to be clustered. There are however ways to improve the scalability, e.g.
by using another clustering technique, such as kMeans, to do the clustering part of the
algorithm offline [Meyer, 2012].
2.1.2 Groupings
Collaborative filters are usually divided into two sub groups: memory-based and model-
based filters. The memory-based ones use the data directly while the model-based ones use
the data to build a model that can make predictions by exploiting the user-item interactions
[Jiang et al., 2013]. Model-based techniques include clustering models, Bayesian belief
nets and SVD-based models (including MF) [Su and Khoshgoftaar, 2009][Jiang et al., 2013].
Memory-based techniques include the neighbourhood-based collaborative filtering
algorithm [Su and Khoshgoftaar, 2009].
Collaborative filtering can also be divided into user-based and item-based filters
[Tso and Schmidt-Thieme, 2006]. The user-based techniques generally looks at similari-
ties between users, e.g. by computing similarity measures based on how the users have
rated different items [Rendle et al., 2009]. The item-based techniques looks at similarities
between items [Sarwar et al., 2001].
For kNN-based techniques, similarities computed based on item-item usually give bet-
ter results than the user-user ones [Rendle et al., 2009] [Meyer, 2012][Hu et al., 2008].
According to [Sarwar et al., 2001], content-based methods can increase the coverage of
a recommender, i.e. increase the number of different products that are recommended and
cover more of the product catalogue. However, since the similarity is based on the items’
similarity to the reference item, recommenders based on content based filters tend to rec-
ommend items that are very similar to the reference one [Meyer, 2012] and miss connec-
tions such as if you are buying a mouse, you might want to buy a mouse pad or a keyboard
as well. The tendency to recommend very similar items is also called overspecialisation
[Meyer, 2012].
13
2. Background
Given that Content-based Filters works on similarity of items, there is an assumption that
customers would either be interested in consuming similar items to those they have already
consumed or placed in their cart (upselling) or are interested in alternative similar items
to those they are currently considering (alternative recommendations).
Unlike Collaborative Filters, Content-based filters do not suffer from the cold start prob-
lem caused by lack of behavioural data since it only uses content-based data. However, it
is very much dependent on the quality and structure of the content data since that is all it
uses. If the content data has many missing or corrupted values, this will affect the filter.
• Feature augmentation – one model takes part of its input from the output (could be
several generated features) of another model
Meyer has reviewed Burke’s classification and building on that instead proposes a clas-
sification with four types (or families as he calls them), see [Meyer, 2012].
By using hybrid filters one can decrease the impact of each of the individual filters short-
comings [Adomavicius and Tuzhilin, 2005], e.g. by combining a collaborative filter and a
content-based filter it would help with the cold-start problem the CF suffers from and the
overspecialisation the CBF suffers from. Several papers have also shown that the perfor-
mances of hybrid filters are better than that of their individual counterparts [Adomavicius and Tuzhilin, 2005].
14
2.4 K-Means++ Clustering
K-Means is a non-hierarchical clustering method that given n points and an integer k will
group the n points into k clusters [Huang, 1998]. It can only handle numerical values (or
categorical values transformed into numerical ones) [Huang, 1998] and works by minimis-
ing the average squared distance between points in the same cluster [Arthur and Vassilvitskii, 2007],
which can also be seen as minimising the within groups sum of squared errors (WGSS)
[Huang, 1998]. We will from now on mostly use WGSS when referring to it.
1. Assign initial centers, usually chosen uniformly at random from the points.
2. For each point, calculate distance to each center and assign it to the nearest one.
3. Recalculate the centers’ positions so they are the center of mass of all the points they
have been assigned
Repeat steps 2 and 3 until the positions of the centers no longer change. [Arthur and Vassilvitskii, 2007]
Figure 2.3 illustrates the algorithm for k = 3.
The algorithm’s computational complexity is O(T kn) where T is the number of itera-
tions (steps 2 and 3), k is the number of clusters and n is the number of points to cluster
[Huang, 1998].It has been proven that for each iteration WGSS is either decreased or the
same [Arthur and Vassilvitskii, 2007].
15
2. Background
[Arthur and Vassilvitskii, 2007] came up with an algorithm they decided to call k-Means++.
k-Means++ is like k-Means except it uses another technique to place the initial centers. The
first point is still chosen uniformly at random, but for the other k-1 centers, the following
defines the selection:
Let X be the set of points to cluster and D(x) be the distance from point x to the cen-
ter it is closest to of those centers already chosen. Then, each point x X has a probability
D(x)2
2 of being chosen each round a center is selected.[Arthur and Vassilvitskii, 2007]
P
x∈X D(x)
The disadvantages are that the minimum it finds is only a local minimum [Huang, 1998]
and it can generate bad clusterings, i.e. its accuracy is not that good compared to other
algorithms for creating clusters [Arthur and Vassilvitskii, 2007]. Another problem is that
the algorithm needs the number of clusters it should divide the points into, it can’t compute
the best possible value for k by itself.
[Gantner et al., 2010] implemented a hybrid using implicit feedback and content-data where
the content data was used to alleviate the cold-start problem. The implicit feedback was
used in their matrix factorisation based on Bayesian Personalised Ranking and then the
content data was mapped to latent factors for the items/users that didn’t have implicit feed-
back data. The existing latent factors was used to find the mappings needed to compute
the factors from the content data.
SSLIM [Ning and Karypis, 2012], a “set of sparse linear methods with side information”
utilizes purchase information and item information to get top-N items. This is an extension
of their earlier work on a method called SLIM. SLIM uses a customer’s previous purchase
history to get recommendation scores for all items that have not been purchased by that
customer, using an estimated aggregation coefficient matrix. The new set of methods uses
16
2.5 Related Work
the side information in different ways when estimating the aggregation coefficient matrix
and showed improvements compared to SLIM.
SPrank (Semantic Path-based ranking) was the first hybrid system to use implicit feed-
back and content-based filtering utilizing the Web of Data to recommend top-N items
[Ostuni et al., 2013]. In SPrank, both the Linked Open Data and the data model for the CF
algorithm are seen as graphs and they are merged to create one hybrid algorithm utilizing
both the links between content data and the implicit feedback [Ostuni et al., 2013].
[Peska and Vojtas, 2014] did offline experiments on a travel agency data set. They imple-
mented Forbes & Zu’s content boosted factorization method and used implicit feedback
(among others visited pages) and attribute information as input to the matrices.
[Hu et al., 2008] looked at using implicit feedback in item-item kNN and SVD to rec-
ommend TV shows, but did not use any content-based information, thus making it a pure
CF recommender system.
In [Leander, 2014], implicit feedback in form of clicks and purchases are used to find
links between product properties in an online clothing store. The work focused on getting
both high accuracy and high coverage and the results indicated that it was possible to get
good coverage without affecting the accuracy too much.
In Fab, a hybrid web page recommender system described by [Balabanović and Shoham, 1997],
content-based filtering is used to collect web pages that are related to one or several topics
chosen by users. Collaborative filtering is then used on the individual user level to filter
out pages already seen and limit the number of pages from the same site in the same rec-
ommendation batch, but also to forward pages rated highly by similar users.
[Oard and Kim, 1998] were one of the first to write about a general approach for how im-
plicit feedback could be used for recommendations and built on Nichols work on implicit
sources. Others before them who looked at implicit feedback in recommender systems
using specific sources include Karlgren, Morita & Shinoda, Konstan et al and Rucker &
Polanco. [Oard and Kim, 1998].
Later contributions to the field using implicit feedback include [Rendle et al., 2009] where
they use Bayesian Personalized Ranking, using clicks, to determine the ranking of items.
Rendle and Freudenthaler continued this work and in [Rendle and Freudenthaler, 2014]
proposed using non-uniform samplers to speed up the convergence of the Bayesian Per-
sonalized Ranker. Something to keep in mind, however, is that there is a big difference
between using clicks and payments to determine rankings as payments are expensive for a
user while clicks are not. One could make an analogy to how players’ behaviour changes
when playing poker with real money instead of fake.
17
2. Background
18
Chapter 3
Approach
In this chapter we will present to you the approach we took to implement the hybrid system.
In the first section, we will describe how we made a solution specific to one of Apptus’
customers using only two specific attributes to do the clustering. Next we will describe
how we decided to design our general system and the bumps in the road on the way there.
There are two different types of hierarchical clustering (agglomerative and divisive). The
type determines in which order the clusters are made but what they have in common is that
they use a distance function to determine how to form the next cluster and the end result is
a hierarchy of clusters of different sizes. Hierarchical clustering was ruled out because we
wanted to be able to use different attributes on different levels and not just combine them
in one metric.
DBSCAN creates clusters by picking points randomly and then sees if their near neigh-
bourhood (specified by user) has enough points (also specified by user) to create a cluster
from that. If so, the neighbourhood’s near neighbourhoods are also included in that cluster
if the point’s neighbourhood has enough points. Otherwise, the point is temporarily la-
belled as an outlier (or noise) and another point is picked. The temporary outliers that have
not been put in a cluster when all points have been visited becomes permanent outliers.
19
3. Approach
DBSCAN would have been a good alternative if it weren’t for the fact that it leaves outliers
and we wanted the algorithm to cluster all the points we sent into it. One big advantage to
using DBSCAN would have been that we wouldn’t have to specify the number of clusters.
In the end our choice fell on k-Means (or rather k-Means++, a more efficient version of it),
partly because it is well known and partly because it would be easy to explain the work-
ings of it to Apptus and their customers. Another relevant point for using k-Means was
that other people in the research project wanted to do fuzzy clustering at a later point in
their project. The advantages and disadvantages of the algorithm are explained in 2.4.
We looked at a couple of different options for working with the algorithm, among those R
packages that implement k-Means, e.g. flexclust, but decided against using R - primarily
because hardly anyone in the research project besides me had worked in R before. In the
end we decided to use Apache Commons Math Machine Learning Library since it uses
Java and that was the common language shared by the members of the research project
and the language most of Apptus’ system is written in.
We will spare you the trials and tribulations we went through to parse the big product
catalogue (at this point we weren’t working directly in Apptus’ system yet) and build up
an item representation that was much more memory efficient than our first attempts. Suf-
fice it to say that once we were done parsing and storing the items we had a much better
knowledge of how to read and write xml files and how a generic memory efficient system
for storing items could be done. The system built on having a central object that contained
two ArrayList objects, one for the attribute values and one for the attribute names and then
each item object referred to their values by storing the integer that corresponded to the
value’s index in the relevant list.
The Apache Commons Math Machine Learning Library also has implemented algorithms
for doing fuzzy k-Means clustering, DBSCAN and Multi-k-Means++ clustering. Multi-
k-Means++ clustering is basically like k-Means++ clustering but it does the clustering
several times and has an evaluator that decides which of the clusterings was the best and
returns that.
20
3.1 Clustering on specific attributes
The clustering itself is also done in two steps. First clustering is done on categorical At-
tribute A. This entails the following:
1. All items who have more than one value of a certain type for this attribute is sorted
out and clustered by Apache’s algorithm.
2. The product that weren’t part of the clustering but have one value of the right type
equal to any of the items that were clustered are placed in a cluster by classifying
them (more on classifying items later)
3. The products that did not have any common values of the right type with the ones
clustered are gone through separately and clusters are created separately. For each
product it checks if a cluster has been created with same value of the right type, if
so it puts the product in that cluster. Else it creates a new cluster, puts the product
in there and adds the cluster to the separate clusters.
4. The two cluster groups are put together into one group.
In the second step the clusters that have more members than a predetermined threshold,
in this case 50, are selected for a second round of clustering using numerical Attribute
E. Those clusters are removed from the group of clusters and replaced with the clusters
generated during this second clustering on the numerical attribute. We chose to set the
threshold at 50 because it seemed a good size to stop at. If the threshold had been smaller
we would have run the risk of making the clusters too small for recommending items from
within them and setting the threshold too high would leave the selection too great.
The clustering on the numerical attribute does not use k-Means++ but instead, for each
cluster, creates a definition of buckets (low and high limits) and then places each item in
the cluster into a bucket based on their value for the numerical attribute. The items that
don’t have a value for the attribute end up in a separate cluster with “NO_PRICE” ap-
pended to its id. Attribute E only has one value so no handling of several values is needed.
The reason why only values of a certain type were chosen is because when we did some
clustering at the beginning of the project with this attribute we found that we got less noise
doing that, the algorithm found more clusters that made sense instead of many clusters with
21
3. Approach
Some thought was put into how k should be chosen for the k-Means++ algorithm. In
the early stages our plan was to clustering on several different k’s, calculate the within
groups sum of squares and then plot that to determine which k to use or calculate it from
the within groups sum of squares without having to plot it. We implemented the calcu-
lation of the within groups sum of squares but it slowed down the program a lot (it more
than the doubled theptime) so we decided not to use that. Instead we fell back on a simple
rule of thumb: k = 2n , where n is the number of items to cluster [Mardia et al., 1979].
Two classifiers were built to be able to classify items - one for just clustering on Attribute
A and one extended to handle clustering on both Attribute A and Attribute E. What they
both do first is use the distance measure of Apache’s algorithm to calculate distances from
the item to each cluster and then sort those distances together with the cluster ids in a list
in increasing order. Then the classifiers check if there is a tie for smallest distance. It is
how the classifiers handle ties that sets them apart.
The one-level classifier (handling only clustering on Attribute A) picks one of the tied
clusters at random and returns the id of that cluster. The other classifier begins by check-
ing if the item has a value for that Attribute E. If it doesn’t, it checks if there are any tied
clusters with “NO_PRICE”-appended to their id and in that case chooses that one, other-
wise it chooses one at random.
If the item does have a value for the attribute it finds its bucket index, then goes through
all the tied clusters and checks if they are sub-clustered (done by checking if they were an
instance of a subclass to the otherwise used cluster class). If they aren’t, the process stops
and one of the tied clusters are chosen at random. If they are, the bucket index difference
between the item and the cluster is calculated and the id of the clusters with the lowest dif-
ference are kept. Once every cluster has been gone through, if there was only one lowest
difference, it returns the corresponding cluster id. Otherwise it chooses one of the ids with
lowest difference at random.
The classifiers were tested by classifying all the items to see how often they gave back the
correct cluster id. The one-level classifier had a misclassification rate of approximately
33%. A closer inspection showed that the items that were misclassified all belonged to the
same cluster and that this cluster was a really big one with a centroid that had many small
frequencies. They were misclassified into clusters that had one or two value frequencies in
their center or no value frequencies. As we thought this might actually be a better classifi-
cation for them than the big cluster we decided to keep it this way. The other classifier had a
misclassification rate of approximately 28%. Considering that it had the same sort of mis-
classification errors as the first classifier we were happy with the level of misclassifications.
Each level of clustering is in most cases done only once, but for some attributes in the
general implementation, where purchase data is used in the clustering, it is done more of-
ten. The recommendation part of the algorithm is called every time a recommendation is
requested based on a viewed item. First the item viewed will be classified into one of the
22
3.1 Clustering on specific attributes
clusters and then that information will be used in what I have chosen to call a connection
algorithm. We have implemented three different connection algorithms (described below)
that will use purchase data to find connections and from that recommend items.
The reason why we are doing these two parts is that the clustering is content-based and
if we were to only recommend from within the cluster we would get the problem of over-
specialisation and miss highly connected items that are not similar in content. This is why
we have the connection algorithms. The reason why we are doing the clustering is to be
able to give recommendations to items that have not been bought together with anything
or with very few items. This is where a classic collaborative filter would come up short.
Connection Algorithm 1
The first connection algorithm does not really live up to its name as it just takes the clus-
ter an item was classified into and recommends the top-n from the same cluster. It is
a seemingly simple algorithm that does not require any calculations of sales figures be-
tween clusters. However, in the implementations done by Apptus to get top-n of different
kinds (e.g. top seller, revenue-based, etc) complex factors such as aging is used to give a
better result than one would get if simply giving each sale the same importance. As men-
tioned, it can use several implementations, among those top-n based on revenue, sales, etc.
For our implementation, it uses the number of sales of each item to determine top-n. This
algorithm might later in the report also be referred to as the within-cluster algorithm.
Connection Algorithm 2
The second connection algorithm is the most complicated. As the first one, it takes the
cluster an item was classified into as the input. It then looks at each item in the cluster and
calculates their connections to other items in other clusters and adds those figures together.
In the end, this gives us the items in the product catalogue that are overall most connected
to the items of the cluster and these are what is used as the recommended products. Just as
in algorithm 1 our implementation uses number of sales to decide how strong a connection
is. This algorithm also allows connections between items in the same cluster so the items
in the cluster the original item was classified into are not ruled out as candidates. This
algorithm might later also be referred to as the cluster-to-item algorithm.
Connection Algorithm 3
The third algorithm is similar to the second one but instead of looking at cluster-to-item
we are looking at cluster-to-cluster. Just as before, all items in the cluster an item was
classified in is gone through, but now we are looking at how strong their connection is to
each cluster, i.e., how many products in each cluster it is connected to and how strongly.
In our implementation, the strength of the connection is again decided by how many times
the items were bought together. The figures from each item is then added up and top-n
computed among the items in the cluster it has the strongest overall connection to. Top-n
is in our implementation based on number of sales. This algorithm might later be referred
to as the cluster-to-cluster-algorithm.
23
3. Approach
A variant of this algorithm where several strongly connected clusters were looked at and
the strength of the connections and sales figures decided top-n was also considered but not
implemented in this project because of time constraints.
As mentioned earlier, the overall algorithm with clustering and then finding connections is
the same in the general design. To make the algorithm more general however, the way we
did the clustering needed to change. To continue using the implementation of kMeans++
we had used so far we had to come up with a better solution for storing an item’s values in
a double vector. Unlike before when we only had two attributes, we were now dealing with
an unknown number of attributes with an unknown number of possible values for them.
This meant we couldn’t just create one long double vector with some variable to keep track
of which part of the vector belonged to which attribute, because that would have quickly
caused the system to run out of memory.
However, we still had to solve the problem of how to store values from several attributes
in one double vector without taking up too much space. The plan was to keep the way that
the item information was stored in the item object (referencing attribute names values by
indices to save space) and simply change how the wrappers were created.
Our idea was to let each vector element represent one attribute and encode the values
in the double value stored there. Apache’s algorithm allows making own implementations
of the class that calculates the distance provided that the method takes two double vectors
as input and returns a double value. This meant we could send in two double vectors that
had encoded values, decode them and then use e.g. d = 1 − J, where J is the Jaccard
similarity coefficient, as distance measure and incorporate the weights into the measure.
In mathematical notation the Jaccard similarity coefficient can be described as J = |A∩B|
|A∪B|
,
where A and B are sets of values.
The encoding was implemented using the indices that referenced the actual values and
padding each encoded attribute value with zeros at the beginning so all the values had the
24
3.2 General Design
same length and then they were concatenated into a string and transformed back into a
double. The double value also contained information about how many digits were used to
represent one value. We discovered pretty quickly when we ran the code that something
was wrong and found that not only had we forgotten to take into account that the algorithm
would make gibberish out of the centres’ encoding but we could not use the encoding at
all since it required too many digits to work as a double value.
Since we were undecided and our assistant supervisor Edward Blurock, who is also part of
Apptus’ research group, wanted us to go with the second alternative, we ended up choosing
that one. Because the design of the second option was mainly our assistant supervisor’s,
the implementation of this design was done together with him. Although we have added
on quite a few transform classes after the initial system implementation was done, most
of the system implementation was done together with our assistant supervisor and done in
such a way that there aren’t any parts of it we can take sole credit for.
In this section I will describe the design of the system and its advantages and limitations.
The system is quite complex as we decided to add on quite a bit of functionality and gen-
eralise it to a much greater extent than we would have previously expected to do.
Overview of Design
The original design from the clustering on specific attributes was largely kept but became
a much smaller part of the overall solution. In the general solution, we decided to make it
possible to do more levels of clustering, i.e. more than the two used before and also have
the choice of how the clustering should be made and which variables to use.
Each level of clustering is basically its own unit that sends back its results to a class that
keeps track of all the levels of clustering. This class is called ListofLevelHierarchies. Get-
ting and transforming the item data into a suitable format for the clustering is done at each
level. The reason for this is that different attributes (as well as ways to transform the data)
and cluster techniques might be used at different levels.
ListOfLevelHierarchies keeps track of the levels, initialises them and handles the clus-
tering on several levels by sending on clusters to be sub-clustered to the right levels and
keeping all the cluster results in a tree structure. It is also the class that handles the top
25
3. Approach
level part of classifying an item into a cluster. The actual classification is done by the
classes contained in the tree structure holding the cluster results and is done in such a way
that it allows for fuzzy classification (used with fuzzy clustering) with an easy extension.
The Transform classes take care of transforming the item data into something a cluster
algorithm can use. Much of the code for transforming values of the same type as Attribute
A into data that could be used by Apache’s KMeans++ algorithm was kept and so was the
code for bucketing items with numerical data as Attribute E. Also the code surrounding
the k-Means++ algorithm, i.e. only use those wrappers that have several values in the ini-
tial clustering etc., was kept except in a more generalised form where it looked at what
conditions the user had specified for that level and attribute.
All the levels, attributes, transforms (to use to convert the item data into a clusterable
format) and conditions on the items are specified by the user in something we have chosen
to call a conditions file. At the start of the system, this conditions file will be processed and
the transforms initialised. Information about the different transforms and clustering tech-
niques implemented and how to configure the conditions file can be found in Appendix B.
The choicep of k for the k-Means++ algorithm was mostly kept the same way as before,
i.e. k = 2n . However, for the second data set, where some of the categorical attributes
were found to have very few values compared to the number of items, we had to change
the choice of k slightly to avoid getting empty clusters. First of all, after each cluster level
is done, the system will check if any of the resulting clusters are empty ones and remove
those. Secondly, there is a mechanism in place when k is determined that checks the size
of k against the number of possible center choices. To simplify things, it assumes only
one value per attribute can be chosen. If k is bigger than the number of possible center
choices, k is set to that number instead. This also serves to speed up the clustering process
since the algorithm will spend less time computing the initial centres if there are fewer to
compute.
Advantages
The advantages of this system are that it is very general and it is possible to work on many
different data sets and choose whatever attributes you want as long as Apptus’ frame work
can filter by it. In addition, any number of layers is possible as well as putting restrictions
on content and number of values an item should have for an attribute. It is also easily
extended to include new transforms and clustering techniques.
Limitations
One big limitation of this general system is that one needs to choose attributes and trans-
forms for it manually. Another is that the clustering can take time because of the amount
of vector elements used. Also, since the way the values are transformed require a lot of
space, restrictions such as only use n most frequent values have to be enforced which leads
to loss of data.
26
Chapter 4
Evaluation
In this chapter we will describe how we evaluated our algorithms. The first section will
be about which evaluation metrics we have used and why. Then we will describe the
data sets we have used, how Apptus’ Test Framework works, our choice of baselines, test
methodology and finally, the results.
We found figure 4.1 1 to be very good for understanding precision and recall; see footnote
for attribution of figure. Precision can also be described as the part of the recommended
items that were actually bought while recall can be described as the part of all purchases
that were recommended. We chose to use these two metrics because they are used by
Apptus in their test framework and also commonly used when evaluating how good an
algorithm is at recommending items [Leander, 2014].
However, since we are dealing with a ranking problem and not a classification problem,
Apptus has modified their versions of precision and recall. They have redefined recall and
1
Attribution: By Walber (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)],
via Wikimedia Commons
27
4. Evaluation
28
4.2 Data sets
precision as follows:
#hits
precision =
#displays
#hits
recall =
#queries
Recall measures how many of the relevant queries that had a hit. Here a hit means that
something that was recommended in response to a query was bought later in the session
and a relevant query is any query in a session where a payment was made. This means
that an unanswered query will impact the recall score negatively.
There is no noticeable difference in sparseness for the event data, i.e. they both have the
same amount of purchases per number of event packets. One event packet is roughly one
event and an event could be a click, a display, a query, an add to cart or a payment. Once we
were testing, the test framework was configured to only include click and payment events.
Only sessions that led to a payment were included in the final data sets.
29
4. Evaluation
is the average of the performance of all these predictions. As [Cremonesi et al., 2008] says,
this can lead to an overfitting of the model. Finally, m-fold cross-validation divides a data
set into m separate folds and then runs the test m times, each time using a different fold as
the test set and the rest combined as the training set.
However, as [Shani and Gunawardana, 2009] emphasises, in order to get an accurate esti-
mation of performance when testing a system offline, one has to mimic the behaviour the
system will face when it goes online, including the data available. It is not for our sce-
nario realistic that we have a training set with behavioural data and a set of test data that
even after we have made predictions on it we are not allowed to use. In the real world of
e-commerce, data becomes available as people do various things on the site. This, among
other things, means that from the beginning, you don’t have any behavioural data. This
simulates the cold start problem. If a method uses only behavioural data, there are a few
options of what to do in a cold start scenario. One would be to not give any recommen-
dations and then start recommending the top sellers once you have some behavioural data
until your algorithm has enough data to function properly. Another option would be to
define a prior probability distribution where all products are equally likely to be recom-
mended and then adjust the probability distribution when more behavioural data becomes
available.
[Shani and Gunawardana, 2009] discuss evaluation of recommender algorithms, but not
any implementations of them. Apptus has implemented one of the evaluation methods
they discuss and describe as ideal, where the time provides the order of the data and all
data is test data from the beginning but becomes available for use in the prediction after it
has been used as test data.
This approach is beneficial for Apptus’ algorithms as it better reflects their ability to adapt
to new information and trends but at the same time actually better mimics reality, where
behavioural data is seldom divided into static training and testing sets. The only thing you
usually have available in the beginning is a set of content data and no behavioural data
and this is how Apptus’ Test Framework works. You can’t ’cheat’ by accessing any future
information but get it when it appears on the timeline, just like in an online setting.
The Apptus Test Framework will query your algorithm for a set of products to recom-
mend each time an item is clicked in the session data. The number of recommendations
for each click can be configured. We decided to generate top-5 recommendations for each
click in our tests. During the time the events packets are read and recommendations made,
the recommendations and event ids are saved in a file. Once the program has run its course,
another program will take the generated recommendation files and analyse them against
what actually happened, e.g. which products were bought and which products were clicked
and displayed in the recommendation panel. It also calculates precision and recall, among
other things.
Apptus also have an algorithm, synergy, that combines results from different recommenders
in a smart way (we are not at liberty to explain how it works since it is a company secret).
We combine our algorithm with the best individual algorithm they use today within syn-
30
4.4 Baselines
ergy to see how well our algorithm works as a backfiller, i.e. how much it improves the
overall performance. A good backfiller algorithm can also be seen as one whose good
recommendations do not overlap much with the other algorithms that are used.
4.4 Baselines
The baselines for this test were made by Björn Brodén at Apptus and are really simple but
at the same time manage to use both content based data and behavioural data, which makes
the baselines into hybrid recommender systems. The individual baselines work as follows:
Given a click on a product and an attribute, it finds the attribute value/values for that prod-
uct. Then it uses Apptus’ system to find other products with one or several of the values in
common and uses behavioural data to get the top-n best sellers among those and returns
that.
There is one baseline for every attribute we have used in our clustering and to get compar-
isons with the backfiller functionality using Apptus’ synergy we have used these baselines
with Apptus’ best individual algorithm within synergy.
We were provided with data for Company A for a specific time frame and used all the
packets in there for the long tests. This turned out to be 32.2 million packets. For the short
tests the number of packets was 6.4 times less, i.e., 5 million packets. The reason for using
6.4 times less packets is that we actually set the packet limit for the short tests first and
were going to have ten times as many packets in the big test, but there weren’t that many
packets in the data set. To be able to compare the results, we used the same number of
packets for Company B’s data set.
The short tests were done first and then nine to ten of the most promising attribute con-
figurations were chosen to do long tests on. We chose to skip some configurations if they
were very similar to ones already chosen and they had the same or worse results than them.
This was done in favour of trying other configurations because if felt more meaningful to
test a diverse set of configurations.
For the short tests, reference tests were run on each individual attribute to get a sense
of how much the individual attribute impacted the result, e.g. if clustering was done on
Attrribute A followed by subclustering on Attribute B, we wanted to see if the subcluster-
ing made any difference or if the result was the same as just clustering on Attribute A.
31
4. Evaluation
In choosing attribute combinations, we had to limit our selection as we only had a fi-
nite time to run tests on. We decided to only have one and two levels of clustering because
we believe that many times the size of the clusters would otherwise be too small to be of
much use.
When it came to choosing which attributes to use and how to combine them we arbitrarily
chose attribute combinations we thought might be good indicators together. Testing dif-
ferent kinds of weight for an attribute would have introduced even more combinations to
test and we did not have that time, so we decided to simply split the weight evenly when
clustering more than one attribute on the same level. We also decided to limit each com-
bination to two attributes to minimise the number of configuration variants. Some limits
had to be set and these seemed like the best ones.
For some attributes, further choices needed to be made. For attributes with numerical
values, one had to choose the number of items wanted in each cluster. For the multi-level
clustering, this value was set to 40, 50 and 60 because that seemed like fair sizes to choose
products from. For same-level clustering, the number of vector elements that this choice
of values created together with some other attributes became too much for the system to
handle (i.e. it became extremely slow), so those values were changed to 4000, 5000 and
6000 for same-level clustering.
Other attributes simply had too many values for the system to handle without running
out of memory, so for those we had to cap the number of values used. The choice we
made here was to both use a figure near the maximum of what the system could handle
and some lower figures to compare with to see if it made any difference. From the first
data set we learned that this led to too many combinations (for one attribute combination
set there were 27 combinations) so for the second data set we decided to just stick with the
near maximum figure.
Another choice that had to be made for each configuration was if any filters should be
used on how many attribute values an item should have for the item to be included in
the first stage of the clustering (only used in the KMeans++-clustering). Since this would
complicate things further we decided not to use this feature, except for Category A in the
first data set since it had already been tested in the first tests for the specific attributes and
found to get good results.
In results for the specific attribute, we will see that the performance of Connector Al-
gorithm 2 as an individual algorithm is much better than the others and because of this
reason, only that connector algorithm was used when testing the general algorithm on the
first data set. For the second data set a more thorough testing was done where the first three
short tests were run on all three connector algorithm and then compared, both with syn-
ergy and as individual algorithm, and it found that Connector Algorithm 2 outperformed
the other two, so from there on we only used that algorithm.
32
4.6 Results
4.6 Results
In this section we present the results from our tests. First we will present the results from
the clustering on two specific attributes on company A’s data set and talk about our choices
of attributes and algorithms for the tests on the general algorithm. Following that we will
present the results on first Company A’s set and then on Company B’s. Discussion of the
results will mainly be done in the next chapter, Discussion, with exception of any parts
needed to explain our choices for the test configurations.
Because of space restrictions in the table all non-essential information has been dropped,
including just using A instead of Attribute A and removing the tail explaining which con-
nection algorithm was used unless several of them were used in the tests.
When underscore is used in the result name, that means the clustering was done on two
levels and the attribute that is before the underscore was clustered on before the other one.
If there is no underscore, clustering was done on one level with all the attributes that are
present in the result name. After the attribute name, there may be further specification
of filters that were used, such as FilterOutStopKeepN. This particular filter means that
the N most common tokens were used in the clustering but that first the 20 most com-
mon tokens, so called stop words, were filtered out. If no stop words were filtered out the
filter will be denoted by FilterOutNoStopKeepN. For Attribute E, the NbrInClusterN
denotes the number of items per cluster that were specified to generate the bucket defi-
nition; see B How To Configure the Conditions File. Most of the other filters are pretty
self-explanatory.
When looking at the results from the long tests using Apptus’ synergy algorithm we
see that the recall and precision figures are pretty close to each other. For this reason and
33
4. Evaluation
because connection algorithm 2 performed so much better individually than the other two,
we chose to disregard the more marked difference in number of purchases when using the
synergy algorithm and decided to do further tests in Company A’s data set using the system
with connection algorithm 2. The reason why we did not decide to use all the connector
algorithms in the next step of testing was because it would have taken too much time.
An important note in comparison with the tests from the general implementation is that
here items are only filtered out to be used in the initial clustering if they have at least 3 val-
ues of a specific type for Attribute A, compared to later tests where this was often instead
at least 2 values. How many values that were required will be made clear in each result
section.
34
4.6 Results
Short Tests
For the smaller number of packages, we ran a total of 72 tests for our own algorithms,
including the reference tests. In order not to clutter up the chapter with too many tables,
the ones with the results from the short tests have been put in Appendix A.
Long Tests
For the longer tests we decided to choose 10 of the most promising algorithms from the
short tests. We didn’t choose the ten top results straight off since some of the top combi-
nations are pretty similar and we wanted to give the other combinations a chance to get a
better score on the longer tests. In Table 4.4, the results for our chosen combinations are
shown. The corresponding results for the baselines are shown in Table 4.5.
Just as for the short tests the results for our own algorithms and the baselines were sepa-
rately combined with Apptus’ best individual algorithm using synergy and the results are
shown in Tables 4.6 and 4.7 respectively. Apptus’ best individual algorithm had a recall
of 0.08, precision value of 0.152 and 2188 purchases for the long test.
35
4. Evaluation
Short Tests
Just as for Company A’s data set, the results from the short tests are presented in Ap-
pendix A. The combinations with connection algorithm 2 outperformed both those with
connection algorithm 1 and connection algorithm 3 when looking at individual results so
we decided to use that one for the long tests.
36
4.6 Results
Long Tests
Just as for Company A’s data set, we decided to choose 10 of the most promising combi-
nations rom the short tests. The results from running the longer test on these combinations
is shown in Table 4.8. The results of the long tests on the baselines are presented in Table
4.9.
Table 4.8: Results from the long tests run for our own general
algorithm on Company B’s data set.
Like for all the other tests, we combined our own results with Apptus’ individual al-
gorithm using synergy and the results of that are shown in Table 4.10. The corresponding
results for the baselines are presented in Table 4.11. Apptus’ best individual algorithm had
a recall of 0.211, precision value of 0.082 and 6564 purchases for the long test.
37
4. Evaluation
38
Chapter 5
Discussion
In this chapter we will first discuss the results presented in the last chapter and look at pos-
sible sources of error. We will also briefly describe the positive and negative experiences
we have had while working on this thesis before we take a look at possible future work
related to the results and sources of error.
39
5. Discussion
After the results from Company A’s data set we were pleasantly surprised to note that
most of the combinations of our algorithm and most of the references got better results
than the corresponding baselines when looking at the individual algorithms. Most impor-
tantly, as can be seen when comparing Tables A.4 and A.5, the best individual algorithm
was one of the combinations for our algorithm. When using our algorithm as a backfiller
the results were not as good, which can be seen when comparing Tables A.6 and A.7. Here
the best result with synergy is actually achieved by one of the baselines (the one with At-
tribute A) which indicates that for Company B’s data set our algorithm would work best as
an individual algorithm rather than as a backfiller. Comparing the results in Tables 4.8 and
4.9 one can see that the results for the individual algorithms are very similar to the short
ones. A comparison of Tables 4.10 and 4.11 shows that this holds true for the backfiller
functionality as well.
Now, this looks like very good results, but we have neglected to take into account the re-
sults for Apptus’ own best individual algorithm. Comparing our best combination for the
short tests with Apptus’ best individual algorithm shows that ours has more than 100 pur-
chases less than it, which means it has about 12% more purchases than our best algorithm.
For the long tests, it has slightly less than 1000 purchases more than our best combination.
This translates to about 17% more purchases than our best algorithm. Clearly, Apptus’
best individual algorithm is increasing its results relative to our best algorithm over time.
An important thing to note here is that these tests are only offline tests. That means it
will only measure how good your system is at recommending the products the customers
actually bought without your recommendations influencing them. In an online setting, us-
ing e.g. an A/B test, your recommendations might influence the customer and cause them
to buy things they did not in the offline test. On the other hand, it would also cause the
customers to not see the other recommendations they did in the offline test and that might
cause them to buy less products or different products altogether.
Another important thing that we did not think of when we began testing is that Apptus’
best individual algorithm (or at least an earlier version of it) was used online in the time
period Company B’s data set was generated from. This means that the algorithm has given
recommendations to the customers and might have influenced their purchases. We thinks
this is a very unfair advantage and introduces a bias into the data set that favours Apptus’
algorithm. There is no way to know how much the result for Apptus’ algorithm would have
been affected if we had managed to find data where Apptus’ algorithm had not been active
online, but the online factor should definitely be taken into account when comparing the
results.
As we established earlier in the discussion, our algorithm does not seem cut out to work as
a backfiller to Apptus’ best individual algorithm. This is because instead of complement-
ing each other’s recommendations there is probably a big overlap in the recommendations
they make. A strong algorithm by itself is not always the best choice since an ensemble
of not-so-good recommenders could be preferable if they complement each other and pro-
vide a better result than one individual algorithm when they work together. We think the
40
5.1 Discussion of Results
choices of algorithms used to work together and the combining algorithm (which in our
case was synergy) can make a big impact on the performance of the overall system.
We have so far not talked about any comparisons between the different combinations or
talked any more about the results from the different connection algorithms than what we
did in the Results section. Looking at the synergy results for Company A’s data set we can
see that there is movement in the top results when comparing the results from the short
and long tests using synergy. What seems consistent between the results is that clusterings
that use Attribute A or B on the first level do well. For the baselines, using the exact value
of Attribute B (i.e. without tokenising the value), gives the best results for the shorter and
longer tests on Company A’s data set.
The best results from the individual algorithms for the short tests on Company B’s data set
seems to consist of combinations of Attributes A, F and H. Attribute F was also the best
attribute when using the baselines on the short tests. For the long tests on Company B’s
data the same attributes seem to give the best results as well and for the baselines Attribute
F is again the best. The reason why we think Attribute A is doing well in both data sets is
that it has a tree-like structure where there is a lot of information that can be used when
clustering the products. We have no thoughts we can write down on why Attributes F and
H are good indicators for the Company B’s data set without revealing too much about the
attributes and so cannot hold any discussion about these. Attribute B from Company A’s
data set can be chopped into several values and we think that is part of the reason why
we got good results on the combinations that used Attribute B to cluster on in the first
level. This allowed us to find connections between items that did not have the same but
similar values. We might also have gotten even better results if we had not had to restrict
the number of values used due to the design of the general implementation.
During the work of this thesis many critical choices had to be made and thus there are
several sources of error that might have affected the results we have gotten. These are
of two different natures - choices pertaining to the whole system with one clustering al-
gorithm system and one connection algorithm and choices pertaining to clustering with
kMeans++. For the whole system we think the main sources of error are which clustering
algorithm to use, how many attributes should be used on one level of clustering and also
how the levels of clustering is determined. One thing that could have been done differently
for the levels is to have a more dynamic model where the number of products in a cluster
determines if the products from there should be clustered further.
There are many sources of error that are connected to the use of the kMeans++ algo-
rithm and our general implementation where we use it. The first one is the value of k since
using different k’s can lead to vastly different groupings of points, or items in our case.
The second source of error we feel is worth mentioning is the attribute weights. To limit
the number of combinations when we had several attributes on one level we had to decide
on one attribute weight combination and ended up splitting the weight evenly. With other
weights we would have gotten different results (who knows if they would have been better
or worse) but we felt splitting it evenly was the best way to go considering we would have
the clustering on two levels and the reference clusterings to compare the results with. The
41
5. Discussion
last and third source of error we think worth mentioning is the choice and implementation
of transforms for the attributes, i.e., the way the attribute values are transformed from their
actual values to the double array used by Apache’s implementation of the KMeans++ algo-
rithm. For some attributes information is lost because there are too many possible values
to take into account and so not all of them will be used in the array.
There have been many new experiences during the work of this thesis. We have had the
chance to work at a company who deploys a recommender system to customers and work
in their software environment. We have also worked in a research team and both had help
from them and had to make compromises to make the work fit with the purposes and work-
ings of the research group.
In hindsight, it is easier to see the mistakes done along the way and one thing we definitely
would have done different is to spend more time in the beginning researching different
clustering methods instead of concentrating on the ones we already knew of. The fact that
we during the report writing have seen many of the mistakes we made along the way is
to us a sign of how much we have learned. We came into this project with a few courses
of knowledge about AI and data mining but with hardly any knowledge of recommender
systems. We feel that we have learned a lot during the work of this thesis, especially about
the field of recommender systems but also about data mining.
Another option would be to implement a system that could get rid of some sources of error
like the weights for the attributes through automated testing as this is otherwise something
that would have to be done manually for each of Apptus’ clients.
One interesting thing that came from the results was that Attribute A seemed to give good
clusterings for both data sets. It might be worth it to further investigate if Attribute A is a
generally good clusterer for data sets and if the same is true for attributes with a tree-like
structures in general.
More testing using different connection algorithms might also be warranted since that
mostly had to be skipped in this thesis due to time constraints. It would also be interesting
to see if a system using our algorithm with the baselines as backfillers would improve the
systems performance on Company B’s data set.
42
Chapter 6
Conclusions
For the earlier testing using specific attributes from Company A’s data set the system
looked promising to use as a backfiller. However, further testing on Company A’s data
set using the general system implementation belied this and did not introduce any poten-
tial for it to act as a strong individual algorithm either.
For Company B’s data set our algorithm showed big potential to work as an individual
algorithm but it was outperformed by Apptus’ best individual algorithm. Something to
keep in mind however is that it had an unfair advantage (see section 5.1 Discussion of
Results). It did not show much potential for working as a backfiller here since combining
Apptus’ best individual algorithm with one of the baselines gave better results for both the
short and long tests.
1. The algorithm will not have the same functionality (i.e. backfiller or individual
algorithm) on all data sets
2. A strong individual algorithm does not necessarily give a better overall result when
combining it with another strong algorithm
While the difference in which functionality our algorithm plays for different data sets does
not matter that much, the fact that it only gives good results on one of them does, since the
aim of our thesis was for it to work on e-commerce sites in general and not just a specific
one.
43
6. Conclusions
44
Bibliography
[Adomavicius and Tuzhilin, 2005] Adomavicius, G. and Tuzhilin, A. (2005). Toward the
next generation of recommender systems: A survey of the state-of-the-art and possible
extensions. IEEE Trans. on Knowl. and Data Eng., 17(6):734–749.
[Arthur and Vassilvitskii, 2007] Arthur, D. and Vassilvitskii, S. (2007). K-means++: The
advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA,
USA. Society for Industrial and Applied Mathematics.
[Balabanović and Shoham, 1997] Balabanović, M. and Shoham, Y. (1997). Fab: Content-
based, collaborative recommendation. Commun. ACM, 40(3):66–72.
[Bell et al., 2010] Bell, R. M., Koren, Y., and Volinsky, C. (2010). All together now: A
perspective on the netflix prize. Chance, 23(1):24–29.
[Burke, 2002] Burke, R. (2002). Hybrid recommender systems: Survey and experiments.
User Modeling and User-Adapted Interaction, 12(4):331–370.
[Cremonesi et al., 2008] Cremonesi, P., Turrin, R., Lentini, E., and Matteucci, M. (2008).
An evaluation methodology for collaborative recommender systems. In Proceedings
of the 2008 International Conference on Automated Solutions for Cross Media Con-
tent and Multi-channel Distribution, AXMEDIS ’08, pages 224–231, Washington, DC,
USA. IEEE Computer Society.
[Gantner et al., 2010] Gantner, Z., Drumond, L., Freudenthaler, C., Rendle, S., and
Schmidt-Thieme, L. (2010). Learning attribute-to-feature mappings for cold-start rec-
ommendations. In Proceedings of the 2010 IEEE International Conference on Data
Mining, ICDM ’10, pages 176–185, Washington, DC, USA. IEEE Computer Society.
45
BIBLIOGRAPHY
[Hu et al., 2008] Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative filtering for
implicit feedback datasets. In In IEEE International Conference on Data Mining (ICDM
2008, pages 263–272.
[Huang, 1998] Huang, Z. (1998). Extensions to the k-means algorithm for clustering large
data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283–
304.
[Jiang et al., 2013] Jiang, X., Niu, Z., Guo, J., Mustafa, G., Lin, Z.-H., Chen, B., and
Zhou, Q. (2013). Novel boosting frameworks to improve the performance of collabo-
rative filtering. In ACML, pages 87–99.
[Kim et al., 2006] Kim, B. M., Li, Q., Park, C. S., Kim, S. G., and Kim, J. Y. (2006). A
new approach for combining content-based and collaborative filters. J. Intell. Inf. Syst.,
27(1):79–91.
[Linden et al., 2003] Linden, G., Smith, B., and York, J. (2003). Amazon.com recommen-
dations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80.
[Lops et al., 2009] Lops, P., de Gemmis, M., and Semeraro, G. (2009). Content-based
recommender systems : State of the art and trends. In Kantor, P., Ricci, F., Rokach, L.,
and Shapira, B., editors, Recommender Systems Handbook. Springer. to appear.
[Mardia et al., 1979] Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate
Analysis. Academic Press. pg. 365.
[Meyer, 2012] Meyer, F. (2012). Recommender systems in industrial contexts. PhD thesis,
l’Université de Grenoble.
[Ning and Karypis, 2012] Ning, X. and Karypis, G. (2012). Sparse linear methods with
side information for top-n recommendations. In Proceedings of the Sixth ACM Con-
ference on Recommender Systems, RecSys ’12, pages 155–162, New York, NY, USA.
ACM.
[Oard and Kim, 1998] Oard, D. W. and Kim, J. (1998). Implicit feedback for recom-
mender systems. In Proceedings of the AAAI Workshop on Recommender Systems.
[Ostuni et al., 2013] Ostuni, V. C., Di Noia, T., Di Sciascio, E., and Mirizzi, R. (2013).
Top-n recommendations from implicit feedback leveraging linked open data. In Pro-
ceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages
85–92, New York, NY, USA. ACM.
46
BIBLIOGRAPHY
[Peska and Vojtas, 2014] Peska, L. and Vojtas, P. (2014). Recommending for disloyal cus-
tomers with low consumption rate., volume 8327 LNCS of Lecture Notes in Computer
Science. Faculty of Mathematics and Physics, Charles University in Prague.
[Rendle et al., 2009] Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L.
(2009). Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of
the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages
452–461, Arlington, Virginia, United States. AUAI Press.
[Sarwar et al., 2001] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. (2001). Item-
based collaborative filtering recommendation algorithms. In Proceedings of the 10th
International Conference on World Wide Web, WWW ’01, pages 285–295, New York,
NY, USA. ACM.
[Shani and Gunawardana, 2009] Shani, G. and Gunawardana, A. (2009). Evaluating rec-
ommender systems. Technical Report MSR-TR-2009-159.
[Su and Khoshgoftaar, 2009] Su, X. and Khoshgoftaar, T. M. (2009). A survey of collab-
orative filtering techniques. Adv. in Artif. Intell., 2009:4:2–4:2.
[Töscher et al., 2008] Töscher, A., Jahrer, M., and Legenstein, R. (2008). Improved
neighborhood-based algorithms for large-scale recommender systems. In Proceedings
of the 2Nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize
Competition, NETFLIX ’08, pages 4:1–4:6, New York, NY, USA. ACM.
[Wu, 2007] Wu, M. (2007). Collaborative filtering via ensembles of matrix factorizations.
In KDD Cup and Workshop 2007, pages 43–47. Max-Planck-Gesellschaft.
47
BIBLIOGRAPHY
48
Appendices
49
Appendix A
The Rest of the Result Tables
This appendix contains the rest of the result tables that were not included in the Results
chapter to limit the amount of tables there. In choosing which tables to move here we
decided to move the tables from the short tests here, partly because they took up the more
space than those from the long tests and partly because the results from long tests are more
important in our opinion.
51
A. The Rest of the Result Tables
52
A.1 Results for Company A
Table A.1: Results from the short tests run for our own general
algorithm on Company A’s data set.
53
A. The Rest of the Result Tables
In this section the results from the short test on Company B’s data set will be presented.
The presentation will follow the same pattern as for Company A.
54
A.2 Results for Company B
The combinations with connection algorithm 2 outperformed both those with connection
algorithm 1 and connection algorithm 3 when looking at individual results so we decided
to use that one. The results for the short tests on the baselines are shown in Table A.5.
55
A. The Rest of the Result Tables
57
A. The Rest of the Result Tables
To run the system, a file with configurations for how the clustering should work is needed.
In this appendix, how that file should be formatted and the different options available are
described.
B.1 Formatting
The configuration file should have the following format:
Levels
NameOfClusteringClass
Transformation
AttributeName NameOfTransform Weight Arguments
AttributeName NameOfTransform Weight Arguments
END
Filters
AND
AttributeName NameOfFilter Arguments
END
END
END
END
Several blocks enclosing NameOfClusteringClass to its END tag can be added to get hier-
archical clustering. The Transformation and Filters tags must be defined but they may be
empty, i.e. write Filter on one line and END on the next one. The filters can also specifiy
AND and OR conditions using the tags AND and OR respectively. If filters are defined,
one should use enclosing AND/OR tags for the program to actually read the filters.For
59
B. How To Configure the Conditions File
example, this could be used for putting in the condition that you both want the product to
have 2 values of attribute A and 1 value of attribute B. At the end of this appendix there
will be an example of a detailed configurations file.
B.2.1 TransformNValued
The TransfomNValued transform is the base class for the categorical value transforms. It
is optimal to use if you don’t have any requirements on the values that are included and it
does not take any arguments. It can handle several values for an attribute.
B.2.2 TransformNValuedFilterPrefix
The TransformNValuedFilterPrefix transform inherits TransformNValued and has the same
functionality but places restrictions on the values accepted by defining accepted prefix of
the values. For example, if one has q as the input argument to the transform it will only
keep the values that starts with q. The default argument for this transform if none is sent in
is the empty string. It does not allow for several prefixes as input arguments but the class
could easily be extended or changed to allow for that.
B.2.3 TransformNValuedFilterContains
The TransformNValuedFilterContains transform inherits TransformNValuedFilterPrefix
and has the same functionality except it will look for the input argument somewhere in
the value for the attribute instead of just at the beginning. Same restrictions and default
value apply as for the TransformNValuedFilterPrefix class.
B.2.4 TransformNValuedFilterNTopSeller
The TransformNValuedFilterNTopSeller transform inherits TransformNValued. It works
by querying eSales which n values for the attribute have been most common among the
products that have been sold so far. Since this is market specific and also dependent on the
time as new user data becomes available as time goes, this transform means the clustering
needs to be redone every once in a while. There is a boolean variable in the TopLevel class
called marketDependentClustering that regulates whether or not this reclustering should
be done. The transform takes one input argument – the number n. The default value for n
if no argument is given is 5000.
60
B.2 Transformation Classes
B.2.5 TransformNValuedFilterNFrequent
The TransformNValuedFilterNFrequent transform is similar to TransformNValuedFilterN-
TopSeller but instead of querying eSales for the top sellers it uses the product catalogue
information and keeps the n most frequent values for the attribute from there. It takes
the same input argument as TransformNValuedFilterNTopSeller and has the same default
value.
B.2.6 TransformNValuedFilterOutStopKeepNFrequent
The TransformNValuedFilterOutStopKeepNFrequent transform (yes, I know the names
are getting ridiculously long) is similar to TransformNValuedFilterNFrequent but is in-
tended for attributes that contain so called stop words, i.e. words that are common but
don’t really give much information like “and”, “but” and so on. This transform has two
input arguments, the number of frequent values to keep (n) and the number of most com-
mon values that we think won’t give us anything. They should be given in that order. The
default values for the parameters are 5000 for n and 20 for the stop words.
B.2.7 TransformNValuedFilterNFrequentTokenize
The TransformNValuedFilterNFrequentTokenize transform inherits TransformNValued-
FilterNFrequent and is very similar to it. The only difference is that it tokenises all values
and then sorts the tokens instead of the values, with the sorting based on frequency in the
product catalogue. It takes the same input argument and has the same default value as its
super class.
B.2.8 TransformNumerical
The TransformNumerical transform is intended to be used for attributes with numerical
values. It requires two arguments – the bucket definition class used to create the buckets
the values are placed in and how many items one would like to be in each bucket. If these
arguments are not supplied, an IOException will be thrown.
There are currently two bucket definition classes available, MedianSpacedBuckets and
EvenlySpacedBuckets. The EvenlySpacedBucket one does pretty much what it sounds
like, it calculates the difference between the highest and lowest value and then defines the
difference between each bucket boundary as the global difference divided by the num-
ber of buckets. The number of buckets is calculated from the number of items divided by
the specified number of items per bucket. The MedianSpacedBuckets calculates its bucket
boundaries by sorting the values and then placing the specified number of items per bucket
in each bucket until it runs out of values. If a value and the one used as a boundary value
are the same, it jumps over that value for the next bucket. The TransformNumerical trans-
form will add a double element at the end of the array as an indicator that this attribute has
no values for this item.
61
B. How To Configure the Conditions File
The DoubleBucketBase class is used when clustering (or bucketing) attributes with nu-
merical values. It assumes that the transform class used is either TransformNumerical or
a future extension of that class.
Levels
SimpleClusters
Transformation
Attribute_A TransformNValued 0.5
Attribute_C TransformNValuedFilterOutStopKeepNFrequent 0.5 5000 20
END
Filters
AND
Attribute_A NValuedFilter 1
Attribute_C NValuedFilter 1
END
END
62
B.5 A Detailed Configuration File Example
END
DoubleBucketBase
Transformation
Attribute_B TransformNumerical 1 MedianSpacedBuckets 50
END
Filters
END
END
END
63