Big Data and The Web
Big Data and The Web
Big Data and The Web
Lucca, Italy
Big Data and the Web: Algorithms for
Data Intensive Scalable Computing
PhD Program in Computer Science and Engineering
XXIV Cycle
By
Gianmarco De Francisci Morales
2012
The dissertation of Gianmarco De Francisci Morales is
approved.
Program Coordinator: Prof. Rocco De Nicola, IMT Lucca
Supervisor: Dott. Claudio Lucchese, ISTI-CNR Pisa
Co-Supervisor: Dott. Ranieri Baraglia, ISTI-CNR Pisa
Tutor: Dott. Leonardo Badia, University of Padova
The dissertation of Gianmarco De Francisci Morales has been reviewed
by:
Aristides Gionis, Yahoo! Research Barcelona
Iadh Ounis, University of Glasgow
IMT Institute for Advanced Studies, Lucca
2012
Where is the wisdom we
have lost in knowledge?
Where is the knowledge we
have lost in information?
To my moter, for her uncondtional love
and supor troghot al tee years.
Acknowledgements
I owe my deepest and earnest gratitude to my supervisor,
Claudio Lucchese, who shepherded me towards this goal with
great wisdom and everlasting patience.
I am grateful to all my co-authors, without whom this work
would have been impossible. A separate acknowledgement
goes to Aris Gionis for his precious advice, constant presence
and exemplary guidance.
I thank all the friends and colleagues in Lucca with whom
I shared the experience of being a Ph.D. student, my Lab in
Pisa that accompanied me through this journey, and all the
people in Barcelona that helped me feel like at home.
Thanks to my family and to everyone who believed in me.
ix
Contents
List of Figures xiii
List of Tables xv
Publications xvi
Abstract xviii
1 Introduction 1
1.1 The Data Deluge . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mining the Web . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Taxonomy of Web data . . . . . . . . . . . . . . . . . 8
1.3 Management of Data . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Data Intensive Scalable Computing . . . . . . . . . 14
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Related Work 19
2.1 DISC systems . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Computational Models and Extensions . . . . . . . 27
2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 S4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
x
3 SSJ 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Problem denition and preliminaries . . . . . . . . . . . . . 40
3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 MapReduce Term-Filtering (ELSA) . . . . . . . . . . 42
3.3.2 MapReduce Prex-Filtering (VERN) . . . . . . . . . 44
3.4 SSJ Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Double-Pass MapReduce Prex-Filtering (SSJ-2) . . 46
3.4.2 Double-Pass MapReduce Prex-Filtering with Re-
mainder File (SSJ-2R) . . . . . . . . . . . . . . . . . 49
3.4.3 Partitioning . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . 56
3.6.1 Running time . . . . . . . . . . . . . . . . . . . . . . 57
3.6.2 Map phase . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.3 Shufe size . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.4 Reduce phase . . . . . . . . . . . . . . . . . . . . . . 63
3.6.5 Partitioning the remainder le . . . . . . . . . . . . 65
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 SCM 67
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Problem denition . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Application scenarios . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 Computing the set of candidate edges . . . . . . . . 74
4.5.2 The STACKMR algorithm . . . . . . . . . . . . . . . 75
4.5.3 Adaptation in MapReduce . . . . . . . . . . . . . . 81
4.5.4 The GREEDYMR algorithm . . . . . . . . . . . . . . 84
4.5.5 Analysis of the GREEDYMR algorithm . . . . . . . . 85
4.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . 87
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
xi
5 T.Rex 98
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Problem denition and model . . . . . . . . . . . . . . . . . 106
5.3.1 Entity popularity . . . . . . . . . . . . . . . . . . . . 113
5.4 System overview . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.1 Constraint selection . . . . . . . . . . . . . . . . . . 121
5.5.2 Additional features . . . . . . . . . . . . . . . . . . . 122
5.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . 123
5.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.6.2 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.6.3 Evaluation measures . . . . . . . . . . . . . . . . . . 126
5.6.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6 Conclusions 131
A List of Acronyms 135
References 137
xii
List of Figures
1.1 The petabyte age. . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data Information Knowledge Wisdom hierarchy. . . . . . . 4
1.3 Complexity of contributed algorithms. . . . . . . . . . . . . 17
2.1 DISC architecture. . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Data ow in the MapReduce programming paradigm. . . . 26
2.3 Overview of S4. . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Twitter hashtag counting in S4. . . . . . . . . . . . . . . . . 33
3.1 ELSA example. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 VERN example. . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Pruned document pair: the left part (orange/light) has
been pruned, the right part (blue/dark) has been indexed. 47
3.4 SSJ-2 example. . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 SSJ-2R example. . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Running time. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Average mapper completion time. . . . . . . . . . . . . . . 59
3.8 Mapper completion time distribution. . . . . . . . . . . . . 60
3.9 Effect of Prex-ltering on inverted list length distribution. 62
3.10 Shufe size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.11 Average reducer completion time. . . . . . . . . . . . . . . 64
3.12 Remainder le and shufe size varying K. . . . . . . . . . . 65
4.1 Example of a STACKMR run. . . . . . . . . . . . . . . . . . 80
xiii
4.2 Communication pattern for iterative graph algorithms on
MR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Distribution of edge similarities for the datasets. . . . . . . 88
4.4 Distribution of capacities for the three datasets. . . . . . . . 89
4.5 flickr-small dataset: matching value and number of
iterations as a function of the number of edges. . . . . . . . 92
4.6 flickr-large dataset: matching value and number of
iterations as a function of the number of edges. . . . . . . . 93
4.7 yahoo-answers dataset: matching value and number of
iterations as a function of the number of edges. . . . . . . . 94
4.8 Violation of capacities for STACKMR. . . . . . . . . . . . . . 95
4.9 Normalized value of the b-matching achieved by the GREEDY-
MR algorithm as a function of the number of MapReduce
iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1 Osama Bin Laden trends on Twitter and news streams. . . . 101
5.2 Joplin tornado trends on Twitter and news streams. . . . . . 102
5.3 Cumulative Osama Bin Laden trends (news, Twitter and
clicks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 News-click delay distribution. . . . . . . . . . . . . . . . . . 114
5.5 Cumulative news-click delay distribution. . . . . . . . . . . 115
5.6 Overview of the T.REX system. . . . . . . . . . . . . . . . . 117
5.7 T.REX news ranking dataow. . . . . . . . . . . . . . . . . . 119
5.8 Distribution of entities in Twitter. . . . . . . . . . . . . . . . 123
5.9 Distribution of entities in news. . . . . . . . . . . . . . . . . 124
5.10 Average discounted cumulated gain on related entities. . . 129
xiv
List of Tables
2.1 Major Data Intensive Scalable Computing (DISC) systems. 21
3.1 Symbols and quantities. . . . . . . . . . . . . . . . . . . . . 54
3.2 Complexity analysis. . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Samples from the TREC WT10G collection. . . . . . . . . . 57
3.4 Statistics for the four algorithms on the three datasets. . . . 61
4.1 Dataset characteristics. [T[: number of items; [C[: number
of users; [E[: total number of item-user pairs with non zero
similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Table of symbols. . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 MRR, precision and coverage. . . . . . . . . . . . . . . . . . 128
xv
Publications
1. G. De Francisci Morales, A. Gionis, C. Lucchese, From Chatter to Head-
lines: Harnessing the Real-Time Web for Personalized News Recommen-
dations", WSDM12, 5th ACM International Conference on Web Search
and Data Mining, Seattle, 2012, pp. 153-162.
2. G. De Francisci Morales, A. Gionis, M. Sozio, Social Content Matching in
MapReduce, PVLDB, Proceedings of the VLDB Endowment, 4(7):460-469,
2011.
3. R. Baraglia, G. De Francisci Morales, C. Lucchese, Document Similarity
Self-Join with MapReduce, ICDM10, 10th IEEE International Conference
on Data Mining, Sydney, 2010, pp. 731-736.
4. G. De Francisci Morales, C. Lucchese, R. Baraglia, Scaling Out All Pairs
Similarity Search with MapReduce, LSDS-IR10, 8th Workshop on Large-
Scale Distributed Systems for Information Retrieval, @SIGIR10, Geneva,
2010, pp. 25-30.
5. G. De Francisci Morales, C. Lucchese, R. Baraglia, Large-scale Data Anal-
ysis on the Cloud, XXIV Convegno Annuale del CMG-Italia, Roma, 2010.
xvi
Presentations
1. G. De Francisci Morales, Harnessing the Real-Time Web for Personalized
News Recommendation, Yahoo! Labs, Sunnyvale, 16 Februrary 2012.
2. G. De Francisci Morales, Big Data and the Web: Algorithms for Data In-
tensive Scalable Computing, WSDM12, Seattle, 8 Februrary 2012.
3. G. De Francisci Morales, Social Content Matching in MapReduce, Yahoo!
Research, Barcelona, 10 March 2011.
4. G. De Francisci Morales, Cloud Computing for Large Scale Data Analy-
sis, Yahoo! Research, Barcelona, 2 December 2010.
5. G. De Francisci Morales, Scaling Out All Pairs Similarity Search with
MapReduce, Summer School on Social Networks, Lipari, 6 July 2010.
6. G. De Francisci Morales, How to Survive the Data Deluge: Petabyte Scale
Cloud Computing, ISTI-CNR, Pisa, 18 January 2010.
xvii
Abstract
This thesis explores the problem of large scale Web mining
by using Data Intensive Scalable Computing (DISC) systems.
Web mining aims to extract useful information and models
from data on the Web, the largest repository ever created.
DISCsystems are an emerging technology for processing huge
datasets in parallel on large computer clusters.
Challenges arise from both themes of research. The Web is
heterogeneous: data lives in various formats that are best
modeled in different ways. Effectively extracting information
requires careful design of algorithms for specic categories of
data. The Web is huge, but DISC systems offer a platform for
building scalable solutions. However, they provide restricted
computing primitives for the sake of performance. Efciently
harnessing the power of parallelism offered by DISC systems
involves rethinking traditional algorithms.
This thesis tackles three classical problems in Web mining.
First we propose a novel solution to nding similar items in
a bag of Web pages. Second we consider how to effectively
distribute content from Web 2.0 to users via graph matching.
Third we show how to harness the streams from the real-time
Web to suggest news articles. Our main contribution lies in
rethinking these problems in the context of massive scale Web
mining, and in designing efcient MapReduce and streaming
algorithms to solve these problems on DISC systems.
xviii
Chapter 1
Introduction
An incredible data deluge is currently drowning the world. Data sources
are everywhere, from Web 2.0 and user-generated content to large scien-
tic experiments, from social networks to wireless sensor networks. This
massive amount of data is a valuable asset in our information society.
Data analysis is the process of inspecting data in order to extract
useful information. Decision makers commonly use this information to
drive their choices. The quality of the information extracted by this pro-
cess greatly benets from the availability of extensive datasets.
The Web is the biggest and fastest growing data repository in the
world. Its size and diversity make it the ideal resource to mine for useful
information. Data on the Web is very diverse in both content and format.
Consequently, algorithms for Web mining need to take into account the
specic characteristics of the data to be efcient.
As we enter the petabyte age, traditional approaches for data analy-
sis begin to showtheir limits. Commonly available data analysis tools are
unable to keep up with the increase in size, diversity and rate of change
of the Web. Data Intensive Scalable Computing is an emerging alterna-
tive technology for large scale data analysis. DISC systems combine both
storage and computing in a distributed and virtualized manner. These
systems are built to scale to thousands of computers, and focus on fault
tolerance, cost effectiveness and ease of use.
1
1.1 The Data Deluge
How would you sort 1GB of data? Todays computers have enough
memory to keep this quantity of data, so any optimal in-memory al-
gorithm will sufce. What if you had to sort 100 GB of data? Even if
systems with more than 100 GB of memory exist, they are by no means
common or cheap. So the best solution is to use a disk based sorting al-
gorithm. However, what if you had 10 TB of data to sort? At a transfer
rate of about 100 MB/s for a normal disk it would take more than one
day to make a single pass over the dataset. In this case the bandwidth
between memory and disk is the bottleneck. In any case, todays disks
are usually 1 to 2 TB in size, which means that just to hold the data we
need multiple disks. In order to obtain acceptable completion times, we
also need to use multiple computers and a parallel algorithm.
This example illustrates a general point: the same problem at differ-
ent scales needs radically different solutions. In many cases we even
need change the model we use to reason about the problem because
the simplifying assumptions made by the models do not hold at every
scale. Citing Box and Draper (1986) Essentially, all models are wrong,
but some are useful, and arguably most of them do not scale.
Currently, an incredible data deluge is drowning the world. The
amount of data we need to sift through every day is enormous. For in-
stance the results of a search engine query are so many that we are not
able to examine all of them, and indeed the competition now focuses the
top ten results. This is just an example of a more general trend.
The issues raised by large datasets in the context of analytical ap-
plications are becoming ever more important as we enter the so-called
petabyte age. Figure 1.1 shows the sizes of the datasets for problems
we currently face (Anderson, 2008). The datasets are orders of magni-
tude greater than what ts on a single hard drive, and their management
poses a serious challenge. Web companies are currently facing this issue,
and striving to nd efcient solutions. The ability to manage and ana-
lyze more data is a distinct competitive advantage for them. This issue
has been labeled in various ways: petabyte scale, Web scale or big data.
2
Figure 1.1: The petabyte age.
But how do we dene big data? The denition is of course relative
and evolves in time as technology progresses. Indeed, thirty years ago
one terabyte would be considered enormous, while today we are com-
monly dealing with such quantity of data.
Gartner (2011) puts the focus not only on size but on three different
dimensions of growth for data, the 3V: Volume, Variety and Velocity. The
data is surely growing in size, but also in complexity as it shows up in
different formats and from different sources that are hard to integrate,
and in dynamicity as it arrives continuously, changes rapidly and needs
to be processed as fast as possible.
Loukides (2010) offers a different point of view by saying that big
data is when the size of the data itself becomes part of the problem
and traditional techniques for working with data run out of steam.
Along the same lines, Jacobs (2009) states that big data is data whose
size forces us to look beyond the tried-and-true methods that are preva-
lent at that time. This means that we can call big an amount of data that
forces us to use or create innovative methodologies.
We can think that the intrinsic characteristics of the object to be an-
alyzed demand modications to traditional data managing procedures.
3
Understanding
C
o
n
n
e
c
t
e
d
n
e
s
s
Data
Knowledge
Wisdom
Information
U
n
d
e
r
s
t
a
n
d
R
e
l
a
t
i
o
n
s
U
n
d
e
r
s
t
a
n
d
P
a
t
t
e
r
n
s
U
n
d
e
r
s
t
a
n
d
P
r
i
n
c
i
p
l
e
s
Figure 1.2: Data Information Knowledge Wisdom hierarchy.
Alternatively, we can take the point of view of the subject who needs to
manage the data. The emphasis is thus on user requirements such as
throughput and latency. In either case, all the previous denitions hint
to the fact that big data is a driver for research.
But why are we interested in data? It is common belief that data with-
out a model is just noise. Models are used to describe salient features in
the data, which can be extracted via data mining. Figure 1.2 depicts the
popular Data Information Knowledge Wisdom (DIKW) hierarchy (Row-
ley, 2007). In this hierarchy data stands at the lowest level and bears
the smallest level of understanding. Data needs to be processed and con-
densed into more connected forms in order to be useful for event com-
prehension and decision making. Information, knowledge and wisdom
are these forms of understanding. Relations and patterns that allow to
gain deeper awareness of the process that generated the data, and prin-
ciples that can guide future decisions.
4
For data mining, the scaling up of datasets is a double edged sword.
On the one hand, it is an opportunity because no data is like more data.
Deeper insights are possible when more data is available (Halevy et al.,
2009). Oh the other hand, it is a challenge. Current methodologies are
often not suitable to handle huge datasets, so new solutions are needed.
The large availability of data potentially enables more powerful anal-
ysis and unexpected outcomes. For example, Google Flu Trends can
detect regional u outbreaks up to ten days faster than the Center for
Disease Control and Prevention by analyzing the volume of u-related
queries to the Web search engine (Ginsberg et al., 2008). Companies like
IBM and Google are using large scale data to solve extremely challeng-
ing problems like avoiding trafc congestion, designing self-driving cars
or understanding Jeopardy riddles (Loukides, 2011). Chapter 2 presents
more examples of interesting large scale data analysis problems.
Data originates from a wide variety sources. Radio-Frequency Iden-
tication (RFID) tags and Global Positioning System (GPS) receivers are
already spread all around us. Sensors like these produce petabytes of
data just as a result of their sheer numbers, thus starting the so called
industrial revolution of data (Hellerstein, 2008).
Scientic experiments are also a huge data source. The Large Hadron
Collider at CERN is expected to generate around 50 TB of raw data per
day. The Hubble telescope captured millions of astronomical images,
each weighting hundreds of megabytes. Computational biology experi-
ments like high-throughput genome sequencing produce large quantities
of data that require extensive post-processing.
The focus of our work is directed to another massive source of data:
the Web. The importance of the Web from the scientic, economical and
political point of view has grown dramatically over the last ten years, so
much that internet access has been declared a human right by the United
Nations (La Rue, 2011). Web users produce vast amounts of text, audio
and video contents in the Web 2.0. Relationships and tags in social net-
works create massive graphs spanning millions of vertexes and billions
of edges. In the next section we highlight some of the opportunities and
challenges found when mining the Web.
5
1.2 Mining the Web
The Web is easily the single largest publicly accessible data source in the
world (Liu, 2007). The continuous usage of the Web has accelerated its
growth. People and companies keep adding to the already enormous
mass of pages already present.
In the last decade the Web has increased its importance to the point of
becoming the center of our digital lives (Hammersley, 2011). People shop
and read news on the Web, governments offer public services through it
and enterprises develop Web marketing strategies. Investments in Web
advertising have surpassed the ones in television and newspaper in most
countries. This is a clear testament to the importance of the Web.
The estimated size of the indexable Web was at least 11.5 billion pages
as of January 2005 (Gulli and Signorini, 2005). Today, the Web size is
estimated between 50 and 100 billion pages and roughly doubling every
eight months (Baeza-Yates and Ribeiro-Neto, 2011), faster than Moores
law. Furthermore, the Web has become innite for practical purpose, as it
is possible to generate an innite number of dynamic pages. As a result,
there is on the Web an abundance of data with growing value.
The value of this data lies in being representative of collective user
behavior. It is our digital footprint. By analyzing a large amount of these
traces it is possible to nd common patterns, extract user models, make
better predictions, build smarter products and gain a better understand-
ing of the dynamics of human behavior. Enterprises have started to re-
alize on which gold mine they are sitting on. Companies like Facebook
and Twitter base their business model entirely on collecting user data.
Data on the Web is often produced as a byproduct of online activity
of the users, and is sometimes referred to as data exhaust. This data is
silently collected while the users are pursuing their own goal online, e.g.
query logs from search engines, co-buying and co-visiting statistics from
online shops, click through rates from news and advertisings, and so on.
This process of collecting data automatically can scale much further
than traditional methods like polls and surveys. For example it is possi-
ble to monitor public interest and public opinion by analyzing collective
6
click behavior in news portals, references and sentiments in blogs and
micro-blogs or query terms in search engines.
As another example, Yahoo! and Facebook (2011) are currently repli-
cating the famous small world experiment ideated by Milgram. They
are leveraging the social network created by Facebook users to test the
six degrees of separation hypothesis on a planetary scale. The large
number of users allows to address the critiques of selection and non-
response bias made to the original experiment.
Let us now more precisely dene Web mining. Web mining is the ap-
plication of data mining techniques to discover patterns from the Web.
According to the target of the analysis at hand, Web mining can be cat-
egoryzed into three different types: Web structure mining, Web content
mining and Web usage mining (Liu, 2007).
Web structure mining mines the hyperlink structure of the Web using
graph theory. For example, links are used by search engines to nd
important Web pages, or in social networks to discover communi-
ties of users who share common interests.
Web content mining analyzes Web page contents. Web content mining
differs from traditional data and text mining mainly because of the
semi-structured and multimedial nature of Web pages. For exam-
ple, it is possible to automatically classify and cluster Web pages
according to their topics but it is also possible to mine customer
product reviews to discover consumer sentiments.
Web usage mining extracts information fromuser access patterns found
in Web server logs, which record the pages visited by each user, and
from search patterns found in query logs, which record the terms
searched by each user. Web usage mining investigates what users
are interested in on the Web.
Mining the Web is typically deemed highly promising and rewarding.
However, it is by no means an easy task and there is a ip side of the coin:
data found on the Web is extremely noisy.
7
The noise comes from two main sources. First, Web pages are com-
plex and contain many pieces of information, e.g., the main content of
the page, links, advertisements, images and scripts. For a particular ap-
plication, only part of the information is useful and the rest is consid-
ered noise. Second, the Web is open to anyone and does not enforce any
quality control of information. Consequently a large amount of informa-
tion on the Web is of low quality, erroneous, or even misleading, e.g.,
automatically generated spam, content farms and dangling links. This
applies also to Web server logs and Web search engine logs, where er-
ratic behaviors, automatic crawling, spelling mistakes, spam queries and
attacks introduce a large amount of noise.
The signal, i.e. the useful part of the information, is often buried un-
der a pile of dirty, noisy and unrelated data. It is the duty of a data ana-
lyst to separate the wheat from the chaff by using sophisticated cleaning,
pre-processing and mining techniques.
This challenge is further complicated by the sheer size of the data.
Datasets coming from the Web are too large to handle using traditional
systems. Storing, moving and managing them are complex tasks by
themselves. For this reason a data analyst needs the help of powerful yet
easy to use systems that abstract away the complex machinery needed
to deliver the required performance. The goal of these systems is to re-
duce the time-to-insight by speeding up the design-prototype-test cycle
in order to test a larger number of hypothesis, as detailed in Section 1.3.
1.2.1 Taxonomy of Web data
The Web is a very diverse place. It is an open platform where anybody
can add his own contribution. Resultingly, information on the Web is
heterogeneous. Almost any kind of information can be found on it, usu-
ally reproduced in a proliferation of different formats. As a consequence,
the categories of data available on the Web are quite varied.
Data of all kinds exist on the Web: semi-structured Web pages, struc-
tured tables, unstructured texts, explicit and implicit links, and multime-
dia les (images, audios, and videos) just to name a few. A complete
8
classication of the categories of data on the Web is out of the scope of
this thesis. However, we present next what we consider to be the most
common and representative categories, the ones on which we focus our
attention. Most of the Web ts one of these three categories:
Bags are unordered collections of items. The Web can be seen as a col-
lections of documents when ignoring hyperlinks. Web sites that
collect one specic kind of items (e.g. ickr or YouTube) can also
be modeled as bags. The items in the bag are typically represented
as sets, multisets or vectors. Most classical problems like similarity,
clustering and frequent itemset mining are dened over bags.
Graphs are dened by a set of vertexes connected by a set of edges. The
Web link structure and social networks t in this category. Graph
are an extremely exible data model as almost anything can be seen
as a graph. They can also be generated from predicates on a set of
items (e.g. similarity graph, query ow graph). Graph algorithms
like PageRank, community detection and matching are commonly
employed to solve problems in Web and social network mining.
Streams are unbounded sequences of items ordered by time. Search
queries and click streams are traditional examples, but streams are
generated as well by news portals, micro-blogging services and
real-time Web sites like twitter and status updates on social net-
works like Facebook, Google+ and LinkedIn. Differently from time
series, Web streams are textual, multimedial or have rich metadata.
Traditional stream mining problems are clustering, classication
and estimation of frequency moments.
Each of these categories has its own characteristics and complexities.
Bags of items include very large collections whose items can be analyzed
independently in parallel. However this lack of structure can also com-
plicate analysis as in the case of clustering and nearest neighbor search,
where each item can be related to any other item in the bag.
In contrast, graphs have a well dened structure that limits the lo-
cal relationships that need to be taken into account. For this reason lo-
cal properties like degree distribution and clustering coefcient are very
9
easy to compute. However global properties such as diameter and girth
generally require more complex iterative algorithms.
Finally, streams get continuously produced and a large part of their
value is in their freshness. As such, they cannot be analyzed in batches
and need to be processed as fast as possible in an online fashion.
For the reasons just described, the algorithms and the methodologies
needed to analyze each category of data are quite different from each
other. As detailed in Section 1.4, in this work we present three different
algorithms for large scale data analysis, each one explicitly tailored for
one of these categories of data.
1.3 Management of Data
Providing data for analysis is a problem that has been extensively stud-
ied. Many solutions exist but the traditional approach is to employ a
Database Management System (DBMS) to store and manage the data.
Modern DBMS originate in the 70s, when Codd (1970) introduced
the famous relational model that is still in use today. The model intro-
duces the familiar concepts of tabular data, relation, normalization, pri-
mary key, relational algebra and so on.
The original purpose of DBMSs was to process transactions in busi-
ness oriented processes, also known as Online Transaction Processing
(OLTP). Queries were written in Structured Query Language (SQL) and
run against data modeled in relational style. On the other hand, cur-
rently DBMSs are used in a wide range of different areas: besides OLTP,
we have Online Analysis Processing (OLAP) applications like data ware-
housing and business intelligence, stream processing with continuous
queries, text databases and much more (Stonebraker and etintemel,
2005). Furthermore, stored procedures are preferred over plain SQL for
performance reasons. Given the shift and diversication of application
elds, it is not a surprise that most existing DBMSs fail to meet todays
high performance requirements (Stonebraker et al., 2007a,b).
High performance has always been a key issue in database research.
There are usually two approaches to achieve it: vertical and horizon-
10
tal. The former is the simplest, and consists in adding resources (cores,
memory, disks) to an existing system. If the resulting systemis capable of
taking advantage of the new resources it is said to scale up. The inherent
limitation of this approach is that the single most powerful system avail-
able on earth could not sufce. The latter approach is more complex,
and consists in adding new separate systems in parallel. The multiple
systems are treated as a single logical unit. If the system achieves higher
performance it is said to scale out. However, the result is a parallel sys-
tem with all the hard problems of concurrency.
1.3.1 Parallelism
Typical parallel systems are divided into three categories according to
their architecture: shared memory, shared disk or shared nothing. In
the rst category we nd Symmetric Multi-Processors (SMPs) and large
parallel machines. In the second one we nd rack based solutions like
Storage Area Network (SAN) or Network Attached Storage (NAS). The
last category includes large commodity clusters interconnected by a local
network and is deemed to be the most scalable (Stonebraker, 1986).
Parallel Database Management Systems (PDBMSs) (DeWitt and Gray,
1992) are the result of these considerations. They attempt to achieve
high performance by leveraging parallelism. Almost all the designs of
PDBMSs use the same basic dataow pattern for query processing and
horizontal partitioning of the tables on a cluster of shared nothing ma-
chines for data distribution (DeWitt et al., 1990).
Unfortunately, PDBMSs are very complex systems. They need ne
tuning of many knobs and feature simplistic fault tolerance policies. In
the end, they do not provide the user with adequate ease of installation
and administration (the so called one button experience), and exibility
of use, e.g., poor support of User Dened Functions (UDFs).
To date, despite numerous claims about their scalability, PDBMSs
have proven to be protable only up to the tens or hundreds of nodes.
It is legitimate to question whether this is the result of a fundamental
theoretical problem in the parallel approach.
11
Parallelism has some well known limitations. Amdahl (1967) argued
in favor of a single-processor approach to achieve high performance. In-
deed, the famous Amdahls law states that the parallel speedup of a
program is inherently limited by the inverse of his serial fraction, the
non parallelizable part of the program. His law also denes the concept
of strong scalability, in which the total problem size is xed. Equation 1.1
species Amdahls law for N parallel processing units where r
s
and r
p
are the serial and parallel fraction of the program (r
s
+r
p
= 1)
SpeedUp(N) =
1
r
s
+
r
p
N
(1.1)
Nevertheless, parallelism has a theoretical justication. Gustafson
(1988) re-evaluated Amdahls law using a different assumption, i.e. that
the problem sizes increases with the computing units. In this case the
problem size per unit is xed. Under this assumption, the achievable
speedup is almost linear, as expressed by Equation 1.2. In this case r
s
and r
p
are the serial and parallel fraction measured on the parallel sys-
tem instead of the serial one. Equation 1.2 denes the concept of scaled
speedup or weak scalability
SpeedUp(N) = r
s
+r
p
N = N + (1 N) r
s
(1.2)
Even though the two equations are mathematically equivalent (Shi,
1996), they make drastically different assumptions. In our case the size
of the problem is large and ever growing. Hence it seems appropriate to
adopt Gustafsons point of view, which justies the parallel approach.
Parallel computing has a long history. It has traditionally focused
on number crunching. Common applications were tightly coupled and
CPUintensive (e.g. large simulations or nite element analysis). Control-
parallel programming interfaces like Message Passing Interface (MPI) or
Parallel Virtual Machine (PVM) are still the de-facto standard in this area.
These systems are notoriously hard to program. Fault tolerance is dif-
cult to achieve and scalability is an art. They require explicit control of
parallelism and are called the assembly language of parallel computing.
12
In stark contrast with this legacy, a new class of parallel systems has
emerged: cloud computing. Cloud systems focus on being scalable, fault
tolerant, cost effective and easy to use.
Lately cloud computing has received a substantial amount of atten-
tion from industry, academia and press. As a result, the term cloud
computing has become a buzzword, overloaded with meanings. There
is lack of consensus on what is and what is not cloud. Even simple client-
server applications are sometimes included in the category (Creeger, 2009).
The boundaries between similar technologies are fuzzy, so there is no
clear distinction among grid, utility, cloud, and other kinds of comput-
ing technologies. In spite of the many attempts to describe cloud com-
puting (Mell and Grance, 2009), there is no widely accepted denition.
However, within cloud computing, there is a more cohesive subset
of technologies which is geared towards data analysis. We refer to this
subset as Data Intensive Scalable Computing (DISC) systems. These sys-
tems are aimed mainly at I/O intensive tasks, are optimized for dealing
with large amounts of data and use a data-parallel approach. An inter-
esting feature is they are dispersed: computing and storage facilities are
distributed, abstracted and intermixed. These systems attempt to move
computation as close to data as possible because moving large quanti-
ties of data is expensive. Finally, the burden of dealing with the issues
caused by parallelism is removed from the programmer. This provides
the programmer with a scale-agnostic programming model.
The data-parallel nature of DISC systems abstracts away many of the
details of parallelism. This allows to design clean and elegant algorithms.
DISC systems offer a limited interface that allows to make strong as-
sumptions about user code. This abstraction is useful for performance
optimization, but constrains the class of algorithms that can be run on
these systems. In this sense, DISC systems are not general purpose com-
puting systems, but are specialized in solving a specic class of problems.
DISC systems are a natural alternative to PDBMSs when dealing with
large scale data. As such, a erce debate is currently taking place, both in
industry and academy, on which is the best tool (Dean and S. Ghemawat,
2010; DeWitt and Stonebraker, 2008; Stonebraker et al., 2010).
13
1.3.2 Data Intensive Scalable Computing
Let us highlight some of the requirements for a system used to perform
data intensive computing on large datasets. Given the effort to nd a
novel solution and the fact that data sizes are ever growing, this solution
should be applicable for a long period of time. Thus the most important
requirement a solution has to satisfy is scalability.
Scalability is dened as the ability of a system to accept increased
input volume without impacting the prots. This means that the gains
from the input increment should be proportional to the increment itself.
This is a broad denition used also in other elds like economy. For a
system to be fully scalable, the size of its input should not be a design
parameter. Forcing the system designer to take into account all possible
deployment sizes in order to cope with different input sizes leads to a
scalable architecture without fundamental bottlenecks.
However, apart from scalability, there are other requirements for a
large scale data intensive computing system. Real world systems cost
money to build and operate. Companies attempt to nd the most cost
effective way of building a large system because it usually requires a sig-
nicant money investment. Partial upgradability is an important money
saving feature, and is more easily attained with a loosely coupled sys-
tem. Operational costs like system administrators salaries account for a
large share of the budget of IT departments. To be protable, large scale
systems must require as little human intervention as possible. Therefore
autonomic systems are preferable, systems that are self-conguring, self-
tuning and self-healing. In this respect fault tolerance is a key property.
Fault tolerance is the property of a system to operate properly in
spite of the failure of some of its components. When dealing with a
large number of systems, the probability that a disk breaks or a server
crashes raises dramatically: it is the norm rather than the exception. A
performance degradation is acceptable as long as the systems does not
halt completely. A denial of service of a system has a negative economic
impact, especially for Web-based companies. The goal of fault tolerance
techniques is to create a highly available system.
14
To summarize, a large scale data analysis system should be scalable,
cost effective and fault tolerant.
To make our discussion more concrete we give some examples of
DISC systems. A more detailed overview can be found in Chapter 2
while here we just quickly introduce the systems we use in our research.
While other similar systems exist, we have chosen these systems be-
cause of their availability as open source software and because of their
widespread adoption both in academia and in industry. These factors
increase the reusability of the results of our research and the chances of
having practical impact on real-world problems.
The systems we make use of in this work implement two different
paradigms for processing massive datasets: MapReduce (MR) and stream-
ing. MapReduce offers the capability to analyze massive amounts of
stored data while streaming solutions are designed to process a mul-
titude of updates every second. We provide a detailed descriptions of
these paradigms in Chapter 2.
Hadoop
1
is a distributed computing framework that implements the
MapReduce paradigm(Dean and Ghemawat, 2004) together with a com-
panion distributed le system called Hadoop Distributed File System
(HDFS). Hadoop enables the distributed processing of huge datasets
across clusters of commodity computers by means of a simple functional
programming model.
Amention goes to Pig
2
, a high level framework for data manipulation
that runs on top of Hadoop (Olston et al., 2008). Pig is a very useful tool
for data exploration, pre-processing and cleaning.
Finally, S4
3
is a distributed scalable streamprocessing engine (Neumeyer
et al., 2010). While still a young project, its potential lies in complement-
ing Hadoop for stream processing.
1
http://hadoop.apache.org
2
http://pig.apache.org
3
http://incubator.apache.org/s4
15
1.4 Contributions
DISC systems are an emerging technology in the data analysis eld that
can be used to capitalize on massive datasets coming from the Web.
There is no data like more data is a famous motto that epitomizes the
opportunity to extract signicant information by exploiting very large
volumes of data. Information represents a competitive advantage for
actors operating in the information society, an advantage that is all the
greater the sooner it is achieved. Therefore, in the limit online analytics
will become an invaluable support for decision making.
To date, DISC systems have been successfully employed for batch
processing, while their use for online analytics has not received much
attention and is still an open area of research. Many data analysis algo-
rithms spanning different application areas have been already proposed
for DISC systems. So far, speedup and scalability results are encourag-
ing. We give an overview of these algorithms in Chapter 2.
However, it is not clear in the research community which problems
are a good match for DISC systems. More importantly, the ingredients
and recipes for building a successful algorithm are still hidden. Design-
ing efcient algorithms for these systems requires thinking at scale, care-
fully taking into account the characteristics of input data, trading off
communication and computing and addressing skew and load balanc-
ing problems. Meeting these requirements on a system with restricted
primitives is a challenging task and an area for research.
This thesis explores the landscape of algorithms for Web mining on
DISC systems and provides theoretical and practical insights on algo-
rithm design and performance optimization.
Our work builds on previous research in Data Mining, Information
Retrieval and Machine Learning. The methodologies developed in these
elds are essential to make sense of data on the Web. We also leverage
Distributed Systems and Database research. The systems and techniques
studied in these elds are the key to get an acceptable performance on
Web-scale datasets.
16
Algorithm Structure
D
a
t
a
C
o
m
p
l
e
x
i
t
y
MR-Iterative MR-Optimized S4-Streaming & MR
Bags
Streams & Graphs
Graphs
Social
Content
Matching
Similarity
Self-Join
Personalized
Online News
Recommendation
Figure 1.3: Complexity of contributed algorithms.
Concretely, our contributions can be mapped as shown in Figure 1.3.
We tackle three different problems that involve Web mining tasks on dif-
ferent categories of data. For each problem, we provide algorithms for
Data Intensive Scalable Computing systems.
First, we tackle the problem of similarity on bags of Web documents
in Chapter 3. We present SSJ-2 and SSJ-2R, two algorithms specically
designed for the MapReduce programming paradigm. These algorithms
are batch oriented and operate in a xed number of steps.
Second, we explore graph matching in Chapter 4. We propose an ap-
plication of matching to distribution of content from social media and
Web 2.0. We describe STACKMR and GREEDYMR, two iterative MapRe-
duce algorithms with different performance and quality properties. Both
algorithms provide approximation guarantees and scale to huge datasets.
17
Third, we investigate news recommendation for social network users
in Chapter 5. We propose a solution that takes advantage of the real-
time Web to provide personalized and timely suggestions. We present
T.REX, a methodology that combines stream and graph processing and
is amenable to parallelization on stream processing engines like S4.
To summarize, the main contribution of this thesis lies in addressing
classical problems like similarity, matching and recommendation in the
context of Web mining and in providing efcient and scalable solutions
that harness the power of DISC systems. While pursuing this general
objective we use some more specic goals as concrete stepping stones.
In Chapter 3 we show that carefully designing algorithms specically
for MapReduce gives substantial performance advantages over trivial
parallelization. By leveraging efcient communication patterns SSJ-2R
outperforms state-of-the-art algorithms for similarity join in MapReduce
by almost ve times. Designing efcient MapReduce algorithms requires
rethinking classical algorithms rather than using them as black boxes. By
applying this principle we provide scalable algorithms for exact similar-
ity computation without any need to tradeoff precision for performance.
In Chapter 4 we propose the rst solution to the graph matching
problem in MapReduce. STACKMR and GREEDYMR are two algorithms
for graph matching with provable approximation guarantees and high
practical value for large real-world systems. We further propose a gen-
eral scalable computational pattern for iterative graph mining in MR.
This pattern can support a variety of algorithms and we show how to
apply it to the two aforementioned algorithms for graph matching.
Finally in Chapter 5 we describe a novel methodology that combines
several signals from the real-time Web to predict user interest. T.REX is
able to harnesses information extracted from user-generated content, so-
cial circles and topic popularity to provide personalized and timely news
suggestions. The proposed system combines ofine iterative computa-
tion on MapReduce and online processing of incoming data in a stream-
ing fashion. This feature allows both to provide always fresh recommen-
dations and to cope with the large amount of input data.
18
Chapter 2
Related Work
In this chapter we give an overview of related work in terms of systems,
paradigms and algorithms for large scale Web mining.
We start by describing a general framework for DISC systems. Large
scale data challenges have spurred the design of a multitude of newDISC
systems. Here we review the most important ones and classify them ac-
cording to our framework in a layered architecture. We further distin-
guish between batch and online systems to underline their different tar-
gets in the data and application spectrum. These tools compose the big
data software stack used to tackle data intensive computing problems.
Then we offer a more detailed overview of the two most important
paradigms for large scale Web mining: MapReduce and streaming. These
two paradigms are able to cope with huge or even unbounded datasets.
While MapReduce offers the capability to analyze massive amounts of
stored data, streaming solutions offer the ability to process a multitude
of updates per second with low latency.
Finally, we review some of the most inuential algorithms for large
scale Web mining on DISC systems. We focus mainly on MapReduce
algorithms, which have received the largest share of attention. We show
different kind of algorithms that have been proposed in the literature to
process different types of data on the Web.
19
2.1 DISC systems
Even though existing DISC systems are very diverse, they share many
common traits. For this reason we propose a general architecture of DISC
systems, that captures the commonalities in the form of a multi-layered
stack, as depicted in Figure 2.1. A complete DISC solution is usually
devised by assembling multiple components. We classify the various
components in three layers and two sublayers.
Coordination
Computation
High Level
Languages
Distributed
Data
Data
Abstraction
Figure 2.1: DISC architecture.
At the lowest level we nd a coordination layer that serves as a basic
building block for the distributed services higher in the stack. This layer
deals with basic concurrency issues.
The distributed data layer builds on top of the coordination one. This
layer deals with distributed data access, but unlike a traditional dis-
20
tributed le system it does not offer standard POSIX semantics for the
sake of performance. The data abstraction layer is still part of the data
layer and offers different, more sophisticated interfaces to data.
The computation layer is responsible for managing distributed pro-
cessing. As with the data layer, generality is sacriced for performance.
Only embarrassingly data parallel problems are commonly solvable in
this framework. The high level languages layer encompasses a number
of languages, interfaces and systems that have been developed to sim-
plify and enrich access to the computation layer.
Table 2.1: Major DISC systems.
Batch Online
High Level Languages Sawzall, SCOPE, Pig
Latin, Hive, DryadLINQ,
FlumeJava, Cascading,
Crunch
Computation MapReduce, Hadoop,
Dryad, Pregel, Giraph,
Hama
S4, Storm, Akka
Data Abstraction BigTable, HBase,
PNUTS, Cassandra,
Voldemort
Distributed Data GFS, HDFS, Cosmos Dynamo
Coordination Chubby, Zookeeper
Table 2.1 classies some of the most popular DISC systems.
In the coordination layer we nd two implementations of a consensus
algorithm. Chubby (Burrows, 2006) is an implementation of Paxos (Lam-
port, 1998) while Zookeeper (Hunt et al., 2010) implements ZAB (Reed
and Junqueira, 2008). They are distributed services for maintaining con-
guration information, naming, providing distributed synchronization
and group services. The main characteristics of these services are very
high availability and reliability, thus sacricing high performance.
21
On the next level, the distributed data layer presents different kinds
of data storages. A common feature in this layer is to avoid full POSIX
semantic in favor of simpler ones. Furthermore, consistency is somewhat
relaxed for the sake of performance.
HDFS
1
, Google File System (GFS) (Ghemawat et al., 2003) and Cos-
mos (Chaiken et al., 2008) are distributed le systems geared towards
large batch processing. They are not general purpose le systems. For
example, in HDFS les can only be appended but not modied and in
GFS a record might get appended more than once (at least once seman-
tics). They use large blocks of 64 MB or more, which are replicated for
fault tolerance. Dynamo (DeCandia et al., 2007) is a low latency key-
values store used at Amazon. It has a Peer-to-Peer (P2P) architecture
that uses consistent hashing for load balancing and a gossiping protocol
to guarantee eventual consistency (Vogels, 2008).
The systems described above are either mainly append-only and batch
oriented le systems or simple key-value stores. However, it is sometime
convenient to access data in a different way, e.g. by using richer data
models or by employing read/write operations. Data abstractions built
on top of the aforementioned systems serve these purposes.
BigTable (Chang et al., 2006) and HBase
2
are non-relational data stores.
They are actually multidimensional, sparse, sorted maps designed for
semi-structured or non structured data. They provide random, realtime
read/write access to large amounts of data. Access to data is provided
via primary key only, but each key can have more than one column.
PNUTS (Cooper et al., 2008) is a similar storage service developed by
Yahoo! that leverages geographic distribution and caching, but offers
limited consistency guarantees. Cassandra (Lakshman and Malik, 2010)
is an open source Apache project initially developed by Facebook. It fea-
tures a BigTable-like interface on a Dynamo-style infrastructure. Volde-
mort
3
is an open source non-relational database built by LinkedIn, basi-
cally a large persistent Distributed Hash Table (DHT).
1
http://hadoop.apache.org/hdfs
2
http://hbase.apache.org
3
http://project-voldemort.com
22
In the computation layer we nd paradigms for large scale data in-
tensive computing. They are mainly dataow paradigms with support
for automated parallelization. We can recognize the same pattern found
in previous layers also here: trade off generality for performance.
MapReduce (Dean and Ghemawat, 2004) is a distributed computing
engine developed by Google, while Hadoop
4
is an open source clone. A
more detailed description of this framework is presented in Section 2.2.
Dryad (Isard et al., 2007) is Microsofts alternative to MapReduce. Dryad
is a distributed execution engine inspired by macro-dataow techniques.
Programs are specied by a Direct Acyclic Graph (DAG) whose ver-
texes are operations and whose edges are data channels. The system
takes care of scheduling, distribution, communication and execution on
a cluster. Pregel (Malewicz et al., 2010), Giraph
5
and Hama (Seo et al.,
2010) are systems that implement the Bulk Synchronous Parallel (BSP)
model (Valiant, 1990). Pregel is a large scale graph processing system
developed by Google. Giraph implements Pregels interface as a graph
processing library that runs on top of Hadoop. Hama is a generic BSP
framework for matrix processing.
S4 (Neumeyer et al., 2010) by Yahoo!, Storm
6
by Twitter and Akka
7
are
distributed stream processing engines that implement the Actor model
(Agha, 1986) They target a different part of the spectrum of big data,
namely online processing of high-speed and high-volume event streams.
Inspired by MapReduce, they provide a way to scale out stream process-
ing on a cluster by using simple functional components.
At the last level we nd high level interfaces to these computing sys-
tems. These interfaces are meant to simplify writing programs for DISC
systems. Even though this task is easier than writing custom MPI code,
DISC systems still offer fairly low level programming interfaces, which
require the knowledge of a full programming language. The interfaces at
this level alloweven non programmers to performlarge scale processing.
Sawzall (Pike et al., 2005), SCOPE (Chaiken et al., 2008) and Pig Latin
4
http://hadoop.apache.org
5
http://incubator.apache.org/giraph
6
https://github.com/nathanmarz/storm
7
http://akka.io
23
(Olston et al., 2008) are special purpose scripting languages for MapRe-
duce, Dryad and Hadoop. They are able to perform ltering, aggrega-
tion, transformation and joining. They share many features with SQL but
are easy to extend with UDFs. These tools are invaluable for data explo-
ration and pre-processing. Hive (Thusoo et al., 2009) is a data warehous-
ing system that runs on top of Hadoop and HDFS. It answers queries
expressed in a SQL-like language called HiveQL on data organized in
tabular format. Flumejava (Chambers et al., 2010), DryadLINQ (Yu et al.,
2008), Cascading
8
and Crunch
9
are native language integration libraries
for MapReduce, Dryad and Hadoop. They provide an interface to build
pipelines of operators fromtraditional programming languages, run them
on a DISC system and access results programmatically.
2.2 MapReduce
When dealing with large datasets like the ones coming from the Web,
the costs of serial solutions are not acceptable. Furthermore, the size of
the dataset and supporting structures (indexes, partial results, etc...) can
easily outgrow the storage capabilities of a single node. The MapReduce
paradigm (Dean and Ghemawat, 2004, 2008) is designed to deal with the
huge amount of data that is readily available nowadays. MapReduce
has gained increasing attention due to its adaptability to large clusters
of computers and to the ease of developing highly parallel and fault-
tolerant solutions. MR is expected to become the normal way to deal
with massive datasets in the future (Rajaraman and Ullman, 2010).
MapReduce is a distributed computing paradigm inspired by con-
cepts from functional languages. More specically, it is based on two
higher order functions: Map and Reduce. The Map function reads the
input as a list of key-value pairs and applies a UDF to each pair. The
result is a second list of intermediate key-value pairs. This list is sorted
and grouped by key in the shufe phase, and used as input to the Reduce
function. The Reduce function applies a second UDF to each intermedi-
8
http://www.cascading.org
9
https://github.com/cloudera/crunch
24
ate key with all its associated values to produce the nal result. The two
phases are strictly non overlapping. The general signatures of the two
phases of a MapReduce computation are as follows:
Map: k
1
, v
1
[k
2
, v
2
]
Reduce: k
2
, [v
2
] [k
3
, v
3
]
The Map and Reduce function are purely functional and thus without
side effects. This property makes them easily parallelizable because each
input key-value is independent from the other ones. Fault tolerance is
also easily achieved by just re-executing the failed function instance.
MapReduce assumes a distributed le systems from which the Map
instances retrieve their input data. The framework takes care of mov-
ing, grouping and sorting the intermediate data produced by the various
mappers (tasks that execute the Map function) to the corresponding re-
ducers (tasks that execute the Reduce function) .
The programming interface is easy to use and does not require any
explicit control of parallelism. A MapReduce program is completely de-
ned by the two UDFs run by mappers and reducers. Even though the
paradigm is not general purpose, many interesting algorithms can be
implemented on it. The most paradigmatic application is building an in-
verted index for a Web search engine. Simplistically, the algorithm reads
the crawled and ltered web documents from the le system, and for
every word it emits the pair word, doc_id in the Map phase. The Re-
duce phase simply groups all the document identiers associated with
the same word word, [doc_id
1
, doc_id
2
, . . .] to create an inverted list.
The MapReduce data ow is illustrated in Figure 2.2. The mappers
read their data from the distributed le system. The le system is nor-
mally co-located with the computing system so that most reads are local.
Each mapper reads a split of the input, applies the Map function to the
key-value pair and potentially produces one or more output pairs. Map-
pers sort and write intermediate values on the local disk.
Each reducer in turn pulls the data from various remote locations.
Intermediate key-value pairs are already partitioned and sorted by key
by the mappers, so the reducer just merge-sorts the different partitions to
25
DFS
Input 1
Input 2
Input 3
MAP
MAP
MAP
REDUCE
REDUCE
DFS
Output 1
Output 2
Shufe
Merge &
Group
Partition &
Sort
Figure 2.2: Data ow in the MapReduce programming paradigm.
group the same keys together. This phase is called shufe and is the most
expensive in terms of I/O operations. The shufe phase can partially
overlap with the Map phase. Indeed, intermediate results from mappers
can start being transferred as soon as they are written to disk. In the
last phase each reducer applies the Reduce function to the intermediate
key-value pairs and write the nal output to the le system.
MapReduce has become the de-fact standard for the development of
large scale applications running on thousand of inexpensive machines,
especially with the release of its open source implementation Hadoop.
Hadoop is an open source MapReduce implementation written in
Java. Hadoop also provides a distributed le system called HDFS, used
as a source and sink for MapReduce jobs. Data is split in chunks, dis-
tributed and replicated among the nodes and stored on local disks. MR
and HDFS daemons run on the same nodes, so the framework knows
which node contains the data. Great emphasis is placed on data locality.
The scheduler tries to run mappers on the same nodes that hold the input
data in order to reduce network trafc during the Map phase.
26
2.2.1 Computational Models and Extensions
Afewcomputational models for MapReduce have been proposed. Afrati
and Ullman (2009) propose an I/O cost model that captures the essential
features of many DISC systems. The key assumptions of the model are:
Files are replicated sets of records stored on a distributed le sys-
tem with a very large block size b and can be read and written in
parallel by processes;
Processes are the conventional unit of computation but have limits
on I/O: a lower limit of b (the block size) and an upper limit of s, a
quantity that can represent the available main memory;
Processors are the computing nodes, with a CPU, main memory
and secondary storage, and are available in innite supply.
The authors present various algorithms for multiway join and sort-
ing, and analyze the communication and processing costs for these ex-
amples. Differently from standard MR, an algorithm in this model is a
DAG of processes, in a way similar to Dryad. Additionally, the model as-
sumes that keys are not delivered in sorted order to the Reduce. Because
of these departures from the traditional MR paradigm, the model is not
appropriate to compare real-world algorithms developed for Hadoop.
Karloff et al. (2010) propose a novel theoretical model of computation
for MapReduce. The authors formally dene the Map and Reduce func-
tions and the steps of a MR algorithm. Then they proceed to dene a
new algorithmic class: /!(
i
. An algorithm in this class is composed
by a nite sequence of Map and Reduce rounds with some limitations.
Given an input of size n:
each Map or Reduce is implemented by a random access machine
that uses sub-linear space and polynomial time in n;
the total size of the output of each Map is less than quadratic in n;
the number of rounds is O(log
i
n).
27
The model makes a number of assumptions on the underlying infras-
tructure to derive the denition. The number of available processors is
assumed to be sub-linear. This restriction guarantees that algorithms in
/!( are practical. Each processor has a sub-linear amount of memory.
Given that the Reduce phase can not begin until all the Maps are done,
the intermediate results must be stored temporarily in memory. This
explains the space limit on the Map output which is given by the total
memory available across all the machines. The authors give examples of
algorithms for graph and string problems. The result of their analysis is
an algorithmic design technique for /!(.
A number of extensions to the base MR system have been developed.
Many of these works focus on extending MR towards the database area.
Yang et al. (2007) propose an extension to MR in order to simplify the
implementation of relational operators. More specically they target the
implementation of join, complex, multi-table select and set operations.
The normal MR workow is extended with a third nal Merge phase.
This function takes as input two different key-value pair lists and outputs
a third key-value pair list. The model assumes that the output of the
Reduce function is fed to the Merge. The signature are as follows.
Map: k
1
, v
1
[k
2
, v
2
]
Reduce: k
2
, [v
2
]
k
2
, [v
3
]
Merge: k
2
, [v
3
]
, k
3
, [v
4
]
[k
4
, v
5
]
0t<|L|
d
i
[t] d
j
[t]
Note that the results discussed in this chapter can be easily applied to
other similarity measures, such as Jaccard, Dice or overlap.
Differently from normal join problems, self-join involves joining a
large table with itself. Thus it is not possible to apply traditional algo-
rithms for data warehousing, that join a large fact table with smaller di-
mension tables. Recent algorithms for log processing in MR also assume
that one of the tables is much smaller that the other (Blanas et al., 2010).
Following Arasu et al. (2006); Xiao et al. (2009), we classify algorithms
for similarity self-join by their solution to the following sub-problems:
1. Signature scheme: a compact representation of each document;
2. Candidate generation: identifying potentially similar document
pairs given their signature;
3. Verication: computing document similarity;
4. Indexing: data structures for speeding up candidate generation
and verication.
40
The nave approach produces the document itself as the signature.
It generates all the O(N
2
) possible document-signature pairs, and com-
putes their actual similarity without any supporting indexing structure.
This approach is overly expensive in the presence of sparse data such
as bag of documents. It is very common to have documents thad to not
share any term, leading to a similarity of zero. In this case, smart pruning
strategies can be exploited to discard early such document pairs.
Broder et al. (1997) adopt a simple signature scheme, that we refer
to as Term-Filtering. The signature of a document is given by the terms
occurring in that document. These signatures are stored in an inverted
index, which is then used to compare only pairs of documents sharing at
least one item in their signature, i.e. one term.
Prex-Filtering (Chaudhuri et al., 2006) is an extension of the Term-
Filtering idea. Let
d be an articial document such that
d[t] = max
dD
d[t].
Given a document d, let its boundary b(d) be the largest integer such that
0t<b(d)
d[t]
d[t] < . The signature of the document d is the set of
terms occurring in the document vector beginning from position b(d),
S(d) = b(d) t < [/[ [ d[t] ,= 0. It is easy to show that if the signatures
of two documents d
i
, d
j
have empty intersection, then the two docu-
ments have similarity below the threshold. Without loss of generality let
us assume that b(d
i
) b(d
j
).
(S(d
i
) S(d
j
) = )
b(d
i
)t<|L|
d
i
[t] d
j
[t] = 0
(d
i
, d
j
) =
0t<b(d
i
)
d
i
[t] d
j
[t]
0t<b(d
i
)
d
i
[t]
d[t] <
Therefore, only document signatures are stored in the inverted index
and later used to generate candidate document pairs. Eventually, the full
documents need to be retrieved in order to compute the actual similarity.
This technique was initially proposed for set-similarity joins and later
extended to documents with real-valued vectors.
In particular, Bayardo et al. (2007) adopt an online indexing and match-
ing approach. Index generation, candidate pair generation and similarity
score computation are all performed simultaneously and incrementally.
41
Each document is matched against the current index to nd potential
candidates. The similarity scores are computed by retrieving the candi-
date documents directly from the collection. Finally, a signature is ex-
tracted from the current document and added to the index.
This approach results in a single full scan of the input data, a small
number of random accesses to single documents, and a smaller (on av-
erage) index to be queried. Prex-ltering outperforms alternative tech-
niques such as LSH (Gionis et al., 1999), PartEnum (Arasu et al., 2006)
and ProbeCount-Sort (Sarawagi and Kirpal, 2004), which we therefore
do not considered in our work. Unfortunately, an incremental approach
that leverages a global shared index is not practical on a large parallel
system because of contention on the shared resource.
Finally, Xiao et al. (2008) present some additional pruning techniques
based on positional and sufx information. However these techniques
are specically tailored for set-similarity joins and cannot be directly ap-
plied to document similarity.
3.3 Related work
In this section we describe two MR algorithm for the similarity self-join
problem. Each exploits one of the pruning techniques discussed above.
3.3.1 MapReduce Term-Filtering (ELSA)
Elsayed et al. (2008) present a MapReduce implementation of the Term-
Filtering method. Hereafter we refer to it as ELSA for convenience. The
authors propose an algorithm for computing the similarity of every doc-
ument pair, which can be easily adapted to our setting by adding a sim-
ple post-ltering. The algorithm runs two consecutive MR jobs, the rst
builds an inverted index and the second computes the similarities.
Indexing: Given a document d
i
, for each term, the mapper emits the term
as the key, and a tuple i, d
i
[t] consisting of document ID and weight as
the value. The shufe phase of MR groups these tuples by term and
delivers these inverted lists to the reducers, that write them to disk.
42
Map: i, d
i
[t, i, d
i
[t] [ d
i
[t] > 0]
Reduce: t, [i, d
i
[t], j, d
j
[t], . . .] [t, [i, d
i
[t], j, d
j
[t], . . .]]
Similarity: Given the inverted list of term t, the mapper produces the
contribution w
ij
[t] = d
i
[t] d
j
[t] for every pair of documents where the
termt co-occurs. This value is associated with a key consisting of the pair
of document IDs i, j. For any document pair the shufe phase will pass
to the reducer the contribution list J
ij
= w
ij
[t] [ w
ij
[t] > 0, t /
from the various terms, which simply need to be summed up.
Map: t, [i, d
i
[t], j, d
j
[t], . . .] [i, j, w
ij
[t]]
Reduce: i, j, J
ij
_
_
i, j, (d
i
, d
j
) =
wW
ij
w
_
_
The Term-Filtering is exploited implicitly: the similarity of two doc-
uments that do not share any term is never evaluated because none of
their terms ever occurs in the same inverted list.
The main drawback of this approach is caused by the large number
of candidate document pairs that it generates. In fact, there are many
document pairs that share only a few terms but have anyway a simi-
larity largely below the threshold. The number of candidates directly
affects completion time. More importantly it determines the shufe size,
the volume of intermediate data to be moved from mappers to reducers.
Minimizing shufe size is critical because network bandwidth is the only
resource that cannot be easily added to a cluster.
Finally, recall that reducers may start only after every mapper has
completed. Therefore, the presence of a few long inverted lists for com-
mon terms induces a signicant load imbalance. Most of the resources
will remain idle until the processing of the longest inverted list is com-
pleted. To alleviate this issue Elsayed et al. remove the top 1% most fre-
quent terms, thus trading efciency for precision. Our goal is to provide
an efcient algorithm that computes the exact similarity.
43
Figure 3.1 illustrates an example of how ELSA works. For ease of
explanation, the document vectors are not normalized. Term-Filtering
avoids computing similarity scores of documents that do not share any
term. For the special case in which an inverted list contains only one doc-
ument, the similarity Map function does not produce any output. The
two main problems of ELSA result evident from looking at this image.
First, long inverted lists may produce a load imbalance and slow down
the algorithm considerably, as for term B in the gure. Second, the
algorithm computes low similarity scores which are not useful for the
typical applications, as for document pair d
1
, d
2
in the gure.
shufe
map
map
map
map
shufe
<B, [(d
1
,1),
(d
2
,1),
(d
3
,2)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
<A, [(d
1
,2),
(d
3
,1)]>
<A, (d
1
,2)>
<B, (d
1
,1)>
<C, (d
1
,1)>
<B, (d
2
,1)>
<D, (d
2
,2)>
<A, (d
3
,1)>
<B, (d
3
,2)>
<C, (d
3
,1)>
<B, [(d
1
,1),
(d
2
,1),
(d
3
,2)]>
<A, [(d
1
,2),
(d
3
,1)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
reduce
reduce
reduce
reduce
Indexing Similarity
<(d
1
,d
2
), 1>
<(d
1
,d
3
), 2>
<(d
2
,d
3
), 2>
<(d
1
,d
3
), 1>
<(d
1
,d
3
), 2>
<(d
1
,d
3
), [2,2,1]>
<(d
1
,d
2
), [1]>
<(d
2
,d
3
), [2]>
reduce
reduce
reduce
<(d
1
,d
3
), 5>
<(d
1
,d
2
), 1>
<(d
2
,d
3
), 2>
map d
1
"A A B C"
map d
3
"A B B C"
map d
2
"B D D"
Figure 3.1: ELSA example.
3.3.2 MapReduce Prex-Filtering (VERN)
Vernica et al. (2010) present a MapReduce algorithm based on Prex-
Filtering that uses only one MR step. Indeed, the authors discuss several
algorithm for set-similarity join operations. Here we describe the best
performing variant for set-similarity self-join. Hereafter we refer to it as
VERN for convenience. For each term in the signature of a document
t S(d
i
) as dened by Prex-Filtering, the map function outputs a tuple
with key the term t itself and value the whole document d
i
. The shufe
phase delivers to each reducer a small sub-collection of documents that
share at least one term in their signatures. This process can be thought
as the creation of an inverted index of the signatures, where each post-
ing is the document itself rather than a simple ID. Finally, each reducer
44
nds similar pairs among candidates by using state-of-the-art serial algo-
rithms (Xiao et al., 2008). The Map and Reduce functions are as follows.
Map: i, d
i
[t, d
i
[ t S(d
i
)]
Reduce: t, T
v
= [d
i
, d
j
, . . .] [i, j, (d
i
, d
j
) [ d
i
, d
j
T
v
]
Nonetheless, this approach does not bring a signicant improvement
over the previous strategy. First, a document with a signature of length
n is replicated n times, even if there are no other documents with an
overlapping signature. Second, pairs of similar documents that have m
common terms in their signatures are produced m times at different re-
ducers. Computing these duplicates is a overhead of parallelization, and
their presence requires a second scan of the output to get rid of them.
Figure 3.2 illustrates an example run of VERN. Light gray terms are
pruned from the document by using Prex-Filtering. Each document is
replicated once for each non-pruned term it contains. Finally, the reducer
computes the similarity of the bag of documents it receives by employ-
ing a serial SSJ algorithm. As shown in Figure 3.2, VERN computes the
similarity of the pair d
1
, d
3
multiple times at different reducers.
We propose here a simple strategy that solves the issue of duplicate
similarity computation. We use this modied version of the original
VERN algorithm in our evaluation. This modied version runs slightly
but consistently faster than the original one, and it does not require post-
processing. The duplicate similarities come from pair of documents that
share more than one term in their signatures. Let S
ij
= S(d
i
) S(d
j
)
be the set of such shared terms in document pair d
i
, d
j
. Let
t be the
last of such terms in the order imposed to the lexicon by Prex-Filtering
t = t
x
[ t
x
_ t
y
t
x
, t
y
S
ij
. The reducer that receives each pair can
simply check if its key corresponds to
t and compute the similarity only
in this case. The term
t for each pair of documents can be efciently com-
puted by a simple backwards scan of the two arrays that represent the
documents. This strategy allows to avoid computing duplicate similari-
ties and does not impose any additional overhead the original algorithm.
45
shufe
<(d
1
,d
3
), 5>
<(d
1
,d
3
), 5>
<A, (d
1
, A A B C)>
<B, (d
1
, A A B C)>
<C, (d
1
, A A B C)>
<D, (d
2
, B D D)>
<B, (d
3
, A B B C)>
<C, (d
3
, A B B C)>
map d
1
"A A B C"
map d
2
"B D D"
map d
3
"A B B C"
<B, [(d
1
, A A B C),
(d
3
, A B B C)]>
<A, (d
1
, A A B C)>
<C, [(d
1
, A A B C),
(d
3
, A B B C)]>
<D, [(d
2
, B D D)]>
reduce
reduce
reduce
reduce
Figure 3.2: VERN example.
3.4 SSJ Algorithms
In this section we describe two algorithms based on Prex-Filtering that
overcome the weak points of the algorithms presented above. The rst
algorithm, named SSJ-2, performs a Prex-Filtering in two MR steps.
Note that in the rst step we use a nave indexing algorithmbut if needed
more advanced techniques can be used (McCreadie et al., 2009b, 2011).
The second algorithm, named SSJ-2R, additionally uses a remainder le
to broadcast data that is likely to be used by every reducer. It also effec-
tively partitions the search space to reduce the memory footprint. Code
for both algorithm is open source and available online.
2
3.4.1 Double-Pass MapReduce Prex-Filtering (SSJ-2)
This algorithm is and extention of ELSA, and also consists of an indexing
phase followed by a similarity computation phase.
SSJ-2 shortens the inverted lists by employing Prex-Filtering. As
shown in Figure 3.3, the effect of Prex-Filtering is to reduce the portion
of document indexed. The terms occurring in d
i
up to position b(d
i
), or
b
i
for short, need not be indexed. By sorting terms in decreasing order of
frequency, the most frequent terms are discarded. This pruning shortens
the longest inverted lists and brings a signicant performance gain.
2
https://github.com/azaroth/Similarity-Self-Join
46
To improve load balancing we employ a simple bucketing technique
During the indexing phase we randomly hash the inverted lists to differ-
ent buckets. This spreads the longest lists uniformly among the buckets.
Each bucket is then consumed by a different mapper in the next phase.
Pruned Indexed
Pruned Indexed
d
i
d
j
b
i
b
j
|L|
0
Figure 3.3: Pruned document pair: the left part (orange/light) has been
pruned, the right part (blue/dark) has been indexed.
Unfortunately with this schema the reducers do not have enough in-
formation to compute the similarity between d
i
and d
j
. They receive only
the contribution from terms t _ b
j
, assuming b
i
b
j
. For this reason, the
reducers need two additional remote I/O operations to retrieve the two
documents d
i
and d
j
from the underlying distributed le system.
SSJ-2 improves over both ELSA and VERN in terms of the amount
of intermediate data produced. Thanks to Prex-Filtering, the inverted
lists are shorter than in ELSA and therefore the number of candidate
pairs generated decreases quadratically. In VERN each document is al-
ways sent to a number of reducers equal to the size of its signature, even
if no other document signature shares any term with it. In SSJ-2 a pair
47
of documents is sent to the reducer only if they share at least one term
in their signature. Furthermore, while SSJ-2 computes the similarity be-
tween any two given documents just once, VERN computes the similarity
of two documents in as many reducers as the size of the intersection of
their signatures. Therefore, SSJ-2 has no computational overhead due to
parallelization, and signicantly reduces communication costs.
Figure 3.4 presents an example for SSJ-2. As before, light gray terms
have been pruned and documents are not normalized to ease the expla-
nation. Pairs of documents that are below the similarity threshold are
discarded early. However the reducer of the similarity phase needs to
access the remote le system to retrieve the documents in order to com-
pute the correct nal similarity score.
shufe
<(d
1
,d
3
), 2>
<(d
1
,d
3
), 1>
<(d
1
,d
3
), [2,1]>
reduce
<(d
1
,d
3
), 5>
HDFS
d
3
"A B B C"
d
1
"A A B C"
Similarity
map
map
map
shufe
<B, [(d
1
,1),
(d
3
,2)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
<B, (d
1
,1)>
<C, (d
1
,1)>
<D, (d
2
,2)>
<B, (d
3
,2)>
<C, (d
3
,1)>
map d
1
"A A B C"
map
d
2
"B D D"
map d
3
"A B B C"
<B, [(d
1
,1),
(d
3
,2)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
reduce
reduce
reduce
Indexing
Figure 3.4: SSJ-2 example.
48
3.4.2 Double-Pass MapReduce Prex-Filtering with Re-
mainder File (SSJ-2R)
For any given pair of documents d
i
, d
j
, the reducers of SSJ-2 receive
only partial information. Therefore, they need to retrieve the full docu-
ments in order to correctly compute their similarity. Since a node runs
multiple reducers over time, each of them remotely accessing two doc-
uments, in the worst case this is equivalent to broadcasting the full col-
lection to every node, with obvious limitations to the scalability of the
algorithm. We thus propose an improved algorithm named SSJ-2R that
does not perform any remote random access, and that leverages the non-
indexed portion of the documents.
Let us make a few interesting observations. First, some of the terms
are very common, and therefore used to compute the similarity of most
document pairs. For those terms, it would be more efcient to broad-
cast their contributions rather than pushing them through the MR frame-
work. Indeed, such piece of information is the one pruned via Prex-
Filtering during the indexing phase. We thus propose to store the pruned
portion of each document in a remainder le T
R
, which can be later re-
trieved by each reducer from the underlying distributed le system.
Second, the remainder le T
R
does not contain all the information
needed to compute the nal similarity. Consider Figure 3.3, each reducer
receives the contributions w
ij
[t] = d
i
[t] d
j
[t] [ t _ b
j
. T
R
contains in-
formation about the terms d
i
[t] [ t b
i
and d
j
[t] [ t b
j
. But, subtly,
no information is available for those terms d
i
[t] [ b
i
_ t b
j
. On the
one hand, those term weights are not in the remainder le because they
have been indexed. On the other hand, the corresponding inverted lists
contain the frequency of the terms in d
i
but not of those occurring in d
j
,
and therefore, the weights w
ij
[t] = d
i
[t] d
j
[t] cannot be produced.
We thus propose to let one of the two documents be delivered to
the reducer through the MR framework by shufing it together with the
weights of the document pairs. Given two documents d
i
and d
j
as shown
in Figure 3.3, if b
i
b
j
we call d
i
the Least Pruned Document and d
j
the
Most Pruned Document. Our goal is to group at the reducer each docu-
49
ment d with the contributions w of every document pair d
i
, d
j
for which
d is the least pruned document between d
i
and d
j
. This document con-
tains the pieces of information that we are missing in order to compute
the nal similarity score. This can be achieved by properly dening the
keys produced by the mappers and their sorting and grouping operators.
More formally, we dene two functions LPD(d
i
, d
j
) and MPD(d
i
, d
j
).
LPD(d
i
, d
j
) =
_
i, if b
i
b
j
j, otherwise
MPD(d
i
, d
j
) =
_
j, if b
i
b
j
i, otherwise
First, we slightly modify the similarity Map function such that for ev-
ery document pair of the given inverted list, it produces as keys a couple
of document IDs where the rst is always the LPD, and as values the
MPD and the usual weight w.
Map: [t, [i, d
i
[t], j, d
j
[t], . . .]]
[LPD(d
i
, d
j
), MPD(d
i
, d
j
) , MPD(d
i
, d
j
), w
ij
[t]]
Second, we take advantage of the possibility of running independent
Map functions whose outputs are then shufed together. We dene a
Map function that takes the input collection and outputs its documents.
Map: i, d
i
[i, , d
i
]
where is a special value such that < i d
i
T.
Third, we make use of the possibility offered by Hadoop to rede-
ne parts of its communication pattern. The process we describe here
is commonly known as secondary sort in MR parlance. We redene the
key partitioning function, which selects which reducer will process the
key. Our function takes into account only the rst document ID in the
key. This partitioning scheme guarantees that all the key-value pairs that
share the same LPD will end up in the same partition, together with one
copy of the LPD document itself coming from the second Map function.
We instruct the shufe phase of MR to sort the keys in each partition
in ascending order. This order takes into account both IDs, the rst as
primary key and the second as secondary key. Consequently, for each
50
document d
i
MR builds a list of key-value pairs such that the rst key is
i, , followed by every pair of document IDs for which d
i
is the LPD.
Finally, we override the grouping function, which determines whether
two keys belong to the same equivalence class. If two keys are equiva-
lent, their associated values will be processed by the same single call to
the Reduce function. Concretely, the grouping function builds the argu-
ment list for a call to the Reduce function by selecting values from the
partition such that the corresponding keys are equivalent. Therefore our
grouping function must consistent with our partitioning scheme: two
keys i, j and i
, j
.
The input of the Reduce function is thus as follows.
i, , [d
i
, j, w
ij
[t
] , j, w
ij
[t
] , . . . , k, w
ik
[t
] , k, w
ik
[t
] , . . .]
The key is just the rst key in the group, as they are all equivalent to
MR. The rst value is the LPD document d
i
. Thanks to the sorting of the
keys, it is followed by a set of contiguous stripes of weights. Each stripe
is associated to the same document d
j
[ j = MPD(d
i
, d
j
).
SSJ-2R can now compute the similarity (d
i
, d
j
) in one pass. Before
the algorithm starts, each reducer loads the remainder le T
R
in mem-
ory. For each Reduce call, SSJ-2R starts by caching d
i
, which is at the
beginning of the list of values. Then, for each d
j
in the list, it sums up all
the corresponding weights w
ij
to compute a partial similarity. Finally, it
retrieves the pruned portion of d
j
from T
R
, computes the similarity of
d
i
to the pruned portion of d
j
and adds this to the partial similarity com-
puted before. This two step process correctly computes the nal similar-
ity (d
i
, d
j
) because all the terms b
i
_ t b
j
are available both in d
i
and
in the pruned portion of d
j
. This process is repeated for each stripe of
weights belonging to the same document pair in the list of values.
Differently from SSJ-2, SSJ-2R performs no remote access to the doc-
ument collection. Given a document pair, one is delivered fully through
the MapReduce framework, and the other is partially retrieved from the
remainder le T
R
. The advantage in terms of communication costs is
given by the remainder le. Indeed, T
R
is much smaller than the input
collection. We show experimentally that its size is one order of magni-
51
tude smaller than the input. Therefore, T
R
can be efciently broadcast to
every node through the distributed le system. Each reducer can load it
in main memory so that the similarity computation can be fully accom-
plished without any additional disk access.
Figure 3.5 illustrates the full data ow of SSJ-2R with an example. As
before, light gray terms have been pruned and documents are not nor-
malized. SSJ-2R stores the pruned part of each document in the remain-
der le, which is put in the distributed cache in order to be broadcast.
The reducer of the similarity phase reads the remainder le in memory
before starting to process its value list. In the example, d
1
is delivered
through the MR framework, while the pruned part of d
3
is recovered
from the remainder le. This procedure allows to correctly compute the
contributions of the term A which was pruned from the documents.
The contributions from terms B and C are computed by the mapper
of the similarity phase and delivered normally through MR.
3.4.3 Partitioning
Even though the remainder le T
R
is much smaller than the original col-
lection, its size will grow when increasing the collection size, and there-
fore it may hinder the scalability of the algorithm.
To overcome this issue, we introduce a partitioning scheme for the
remainder le. Given a user dened parameter K, we split the range of
document identiers into K equally sized chunks, so that the unindexed
portion of document d
i
falls into the
i
/K|-th chunk. Consequently, we
modify the partitioning function which maps a key emitted by the map-
per to a reducer. We map each key i, j to the
j
/K|-th reducer instance,
i.e. the mapping is done on the basis of the MPD. This means that each
reducer will receive only weights associated to a portion of the document
space. Therefore, the reducer needs to retrieve and load in memory only
1
/K-th of the remainder le T
R
.
This new partitioning scheme spreads the weights associated to the
same LPD document over K different reducers. Therefore, to correctly
compute the nal similarity SSJ-2R needs a copy of the LPD document
52
shufe
<B, [(d
1
,1),
(d
3
,2)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
<B, (d
1
,1)>
<C, (d
1
,1)>
<D, (d
2
,2)>
<B, (d
3
,2)>
<C, (d
3
,1)>
map d
1
"A A B C"
map
d
2
"B D D"
map d
3
"A B B C"
<B, [(d
1
,1),
(d
3
,2)]>
<C, [(d
1
,1),
(d
3
,1)]>
<D, [(d
2
,2)]>
reduce
reduce
reduce
Indexing
shufe
<(d
1
,d
3
), 2>
<(d
1
,d
3
), 1>
<(d
1
,!),"A A B C">
<(d
1
,d
3
), 2>
<(d
1
,d
3
), 1>
reduce
<(d
1
,d
3
), 5>
Similarity
map
map
map
map
Remainder
File
d
1
"A A"
d
3
"A"
d
2
"B"
Distributed Cache
<(d
1
,!),
"A A B C">
<(d
3
,!),
"A B B C">
Figure 3.5: SSJ-2R example.
d
i
at each of these reducers. For this reason, we replicate K times the
special key i, and its associated value d
i
, once for each partition.
The parameter K allows to tune the memory usage of SSJ-2R and
controls the tradeoff with the communication cost due to replication.
3.5 Complexity analysis
In Section 2.2.1 we presented a few proposals for modeling MapReduce
algorithms (Afrati and Ullman, 2009; Karloff et al., 2010). Indeed, esti-
mating the cost of a MapReduce algorithm is quite difcult, because of
53
Table 3.1: Symbols and quantities.
Symbol Description
d
d Avg. and Max. document length
s s Avg. and Max. signature length
p p Avg. and Max. inverted list length with Prex-Filtering
l Avg. inverted list length without pruning
r Cost of a remote access to a document
R Cost of retrieving the remainder le T
R
Table 3.2: Complexity analysis.
ALGORITHM Map Shufe Reduce
ELSA
l
2
[/[ l
2
d
VERN s
d [T[ s d p
2
d
SSJ-2 p
2
[/[ p
2
d +r
SSJ-2R p
2
[/[ p
2
+[D[ d +R
d
the inherent parallelism, the hidden cost of the shufing phase, the over-
lap among computation and communication managed implicitly by the
framework, the non-determinism introduced by combiners and so on.
We prefer to model separately the three main steps of a MapReduce
jobs: the Map and Reduce functions, and the volume of the data to be
shufed. In particular, we take into consideration the cost associated
to the function instance with the largest input. This way, we roughly
estimate the maximum possible degree of parallelism and we are able to
compare the different algorithms in deeper detail.
We refer to Table 3.1 for the symbols used in this section. The quantity
l can also be seen as the average frequency of the terms in the lexicon /.
Similarly, p can also be seen as the number of signatures containing any
given term. Clearly, it holds that p l.
The running time of ELSA is dominated by the similarity phase. Its
54
Map function generates every possible document pair for each inverted
list, with a cost of O(
l
2
). Most of this large number of candidate pairs
has a similarity below the threshold. Furthermore, the presence of long
inverted lists introduces stragglers. Since reducers may start only after
every mapper has completed, mappers processing the longest lists keep
resources waiting idly. Since the shufing step handles a single datum
for every generated pair, we can estimate the cost by summing over the
various inverted lists as O([/[ l
2
). The Reduce function just sums up the
contributions of the various terms, with a cost O(
d).
VERN runs only one MR job. The Map replicates each document d
[S(d)[ times, with a cost of O( s
d). Similarly, we can estimate the shufe
size as O([T[ d s). The Reduce evaluates the similarity of every pair
of documents in its input, with cost O( p
2
d). Due to Prex-Filtering, the
maximum reducer input size is p, which is smaller than
l.
By comparing the two algorithms, we notice the superiority of VERN
in the rst two phases of the computation. The Map function generates a
number of replicas of each document, rather then producing a quadratic
number of partial contributions. Considering that [/[ l is equal to [T[ d,
and that l is likely to be much larger than s, it is clear that ELSA has the
largest shufe size. Conversely, the Reduce cost for VERN is quadratic
in the size of the pruned lists p.
Our proposed algorithms SSJ-2 and SSJ-2R have a structure simi-
lar to ELSA. The Map cost O( p
2
) depends on the pruned inverted lists.
For SSJ-2R, we are disregarding the cost of creating replicas of the input
in the similarity Map function, since it is much cheaper than processing
the longest inverted list. They have similar shufes sizes, respectively
O([/[ p
2
) and O([/[ p
2
+ [T[ d + R). In this case we need to con-
sidered the cost of shufing the input documents and the remainder le
generated by SSJ-2R. Note that we are considering the remainder le as
a shufe cost even though it is not handled by MapReduce. Finally, the
Reduce cost is different. In addition to the contributions coming from
the pruned inverted index, SSJ-2 needs to access documents remotely
with a total cost of O(
d) since
the unindexed portion of the document is already loaded in memory at
55
the local node, delivered via the distributed le system.
Thanks to Prex-Filtering, the Map cost of our proposed algorithm
smaller than ELSA, yet larger than VERN. For the same reason, the shuf-
e size of SSJ-2 is smaller than ELSA but still larger than VERN, if we
assume that p is not signicantly smaller than l. For SSJ-2R, the shufe
size is probably its weakest point. Shufing the whole collection within
the MapReduce data owincreases the volume of intermediate data. The
Reduce function of SSJ-2 needs to remotely recover the documents be-
ing processed to retrieve their unindexed portions. This dramatically
increases its cost beyond ELSA but can hardly be compared with VERN.
SSJ-2R has about the same cost as ELSA when assuming that the remain-
der le is already loaded in memory.
It is difcult to combine the costs of the various phases of the four
algorithms. Therefore, it is not possible to understand which is the best.
We can conclude that Elsayed et al. is the worst of the four, due to the non
exploitation of the Prex-Filtering, and that the impact of the shufing
phase will determine the goodness of our proposed algorithms. In the
experimental section, we evaluate empirically the efciency of the four
algorithms in the three steps of the MapReduce framework.
3.6 Experimental evaluation
In this section we describe the performance evaluation of the algorithms.
We used several subsets of the TREC WT10G Web corpus. The original
dataset has 1,692,096 english language documents. The size of the entire
uncompressed collection is around 10GiB. In Table 3.3 we describe the
samples of the collection we used, ranging from 17k to 63k docu-
ments. We preprocessed the data to prepare it for analysis. We parsed
the dataset stripping HTML, removed stop-words, performed stemming
and vectorization of the input and extracted the lexicon. We also sorted
the features inside each document in decreasing order of term frequency
in order to effectively utilize Prex-Filtering.
We ran the experiments on a 5-node cluster. Each node was equipped
with two Intel Xeon E5520 CPUs @2.27GHz, with 8 virtual cores (for a
56
Table 3.3: Samples from the TREC WT10G collection.
D17K D30K D63K
# documents 17,024 30,683 63,126
# terms 183,467 297,227 580,915
# all pairs 289,816,576 941,446,489 3,984,891,876
# similar pairs 94,220 138,816 189,969
total of 80 virtual cores), a 2TiB disk, 8GiB of RAM, and Gigabit Ethernet.
We used one of the nodes to run Hadoops master daemons (NameN-
ode and JobTracker), the rest were congured as slaves running DataN-
ode and TaskTracker daemons. Two of the virtual cores on each slave
machine were reserved to run the daemons, the rest were equally split
among map and reduce slots (7 each), for a total of 28 slots for each phase.
We tuned Hadoops conguration as follows: we allocated 1GiB of
memory to each daemon and 400MiB to each task, changed the block
size of HDFS to 256MiB and the le buffer size to 128KiB, disabled spec-
ulative execution and enabled JVM reuse and map output compression.
For each algorithm, we wrote an appropriate combiner to reduce the
shufe size (a combiner is a reduce-like function that runs inside the
mapper to aggregate partial results). For SSJ-2 and SSJ-2R the combiner
perform the sums of partial scores in the values, according to the same
logic used in the reducer. We also implemented raw comparators for ev-
ery key type used in order to get better performance (raw comparators
compare keys during sorting without deserializing them into objects).
3.6.1 Running time
We compared the four algorithms described in the previous sections. The
two baselines ELSA and VERN and the two algorithms proposed by us:
SSJ-2 and SSJ-2R. VERN would require an additional step for removing
duplicate similar pairs. We did not need to implement this step thanks
to the strategy described in Section 3.3.2 Therefore our analysis does not
take into account the overhead induced by such duplicate removal.
For SSJ-2R we used a partitioning factor of the remainder le K = 4.
57
0
10000
20000
30000
40000
50000
60000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
T
i
m
e
(
s
e
c
o
n
d
s
)
Number of vectors
ELSA
VERN
SSJ-2
SSJ-2R
Figure 3.6: Running time.
We enabled partitioning just for the sake of completeness, given that the
mappers were not even close to ll the available memory.
We use 56 mappers and 28 reducers, so that the mappers nish in
two waves and all the reducers can start copying and sorting the partial
results while the second wave of mappers is running. For all the experi-
ments, we set the similarity threshold = 0.9.
Figure 3.6 shows the running times of the four different algorithms.
All of them have a quadratic step, so doubling the size of the input
roughly multiplies by 4 the running time. Unexpectedly, VERN does not
improve signicantly over ELSA. This means that Prex-Filtering is not
fully exploited. Recall ELSA uses a simple Term-Filtering and compares
any document pair sharing at least one term. Both our proposed algo-
rithms outperform the two baselines. In particular, SSJ-2 is more than
twice faster than VERN and SSJ-2R is about 4.5 times faster. Notice that
58
SSJ-2R is twice as fast compared to SSJ-2, which means that the im-
provement given by the use of the remainder le is signicant.
We tried to t a simple power law model f(x) = ax
b
to the running
time of the algorithms. For all algorithms but VERN the parameter b
2.5, while a 10
8
with ELSA having the largest one. VERN has a smaller
constant a 10
10
with a larger b 2.9. However, given the small
number of data points available the analysis is not conclusive.
3.6.2 Map phase
0
1000
2000
3000
4000
5000
6000
7000
8000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
T
i
m
e
(
s
e
c
o
n
d
s
)
Number of vectors
ELSA
VERN
SSJ-2
SSJ-2R
Figure 3.7: Average mapper completion time.
In Figure 3.7 we report the average map times for each algorithm.
Clearly VERN is the fastest. The mapper only replicates input docu-
ments. The time required by the other algorithms grows quadratically
as expected. The two algorithms we propose are two times faster than
ELSA for two main reasons.
59
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 10 20 30 40 50
T
i
m
e
(
s
e
c
o
n
d
s
)
Mapper ID
ELSA
VERN
SSJ-2R without bucketing
SSJ-2R with bucketing
Figure 3.8: Mapper completion time distribution.
First, Prex-Filtering reduces the maximuminverted list length. Even
a small improvement here brings a quadratic gain. Since Prex-Filtering
removes the most frequent terms, the longest inverted lists are short-
ened or even removed. For D17K, the length of the longest inverted list
is reduced from 6600 to 1729 thanks to Prex-Filtering, as shown in Fig-
ure 3.9. The time to process such inverted list decreases dramatically, and
also the amount of intermediate data generated is reduced. Interestingly,
Prex-Filtering does not affect signicantly the index size. Only 10% of
the index is pruned as reported in Table 3.4, meaning that its effects are
limited to shortening the longest inverted lists.
Second, we employ a bucketing technique to improve load balancing.
Since mappers process the inverted lists in chunks, the longest inverted
lists might end up in the same chunk and be processed by a single map-
per. SSJ-2 and SSJ-2R randomly hash the inverted lists into different
60
Table 3.4: Statistics for the four algorithms on the three datasets.
D
a
t
a
s
e
t
A
l
g
o
r
i
t
h
m
#
e
v
a
l
u
a
t
e
d
p
a
i
r
s
(
M
)
I
n
d
e
x
s
i
z
e
(
M
B
)
R
e
m
a
i
n
d
e
r
s
i
z
e
(
M
B
)
S
h
u
f
e
s
i
z
e
(
G
B
)
R
u
n
n
i
n
g
t
i
m
e
(
s
)
A
v
g
.
m
a
p
t
i
m
e
(
s
)
S
t
d
.
m
a
p
t
i
m
e
(
%
)
A
v
g
.
r
e
d
u
c
e
t
i
m
e
(
s
)
S
t
d
.
r
e
d
u
c
e
t
i
m
e
(
%
)
D
1
7
K
ELSA 109 46 3.3 2,187 276 127.50 49 16.68
VERN 401 3.7 1,543 42 68.10 892 22.20
SSJ-2 65 41 2.1 971 148 37.92 575 3.58
SSJ-2R 65 41 4.7 2.6 364 122 26.50 49 15.50
D
3
0
K
ELSA 346 92 11.3 8,430 1,230 132.08 82 15.91
VERN 1,586 8.3 5,313 112 68.10 3,847 13.94
SSJ-2 224 82 8.1 3,862 635 32.06 2,183 5.81
SSJ-2R 224 82 8.2 10.5 1,571 560 23.41 155 15.50
D
6
3
K
ELSA 1,519 189 49.2 51,540 7,013 136.39 1,136 11.84
VERN 6,871 20.7 28,534 338 59.61 16,849 9.36
SSJ-2 1,035 170 35.8 20,944 3,908 24.32 11,328 2.76
SSJ-2R 1,035 170 15.6 49.7 9,745 3,704 20.26 846 12.11
buckets, so that the longest lists are likely spread among all the map-
pers. Figure 3.8 shows the completion time of the mappers for the D17K
dataset. VERN has the lowest average mapper completion time, as all
the work is done in the reducer. However, due to skew in the input, the
slowest mapper in VERN is slower than in SSJ-2R with bucketing. This
can happen if there are a few very dense documents in the input, so that
VERN needs to create a large number of copies of each of these docu-
ment. ELSA is in any case slower than SSJ-2R without bucketing just
because of the maximum inverted list length. Nevertheless, completion
times are not evenly distributed in ELSA. This skew induces a strong
load imbalance, forcing all the reducers to wait the slowest mapper be-
fore starting. Bucketing solves this issue by evenly spreading the load
among mappers. As a result, the running time of the slowest mapper is
almost halved. The standard deviation of map completion times for all
algorithms is reported in Table 3.4.
61
1
10
100
1000
100 1000
N
u
m
b
e
r
o
f
l
i
s
t
s
Inverted list length
max=6600
ELSA
1
10
100
1000
N
u
m
b
e
r
o
f
l
i
s
t
s
max=1729
SSJ-2R
Figure 3.9: Effect of Prex-ltering on inverted list length distribution.
3.6.3 Shufe size
Figure 3.10 shows the shufes sizes for the various algorithms when
varying the input size. The shufe size is affected by the combiners, that
compact the key-value pairs emitted by the mappers into partial results
by applying a reduce-like function. VERN has the smallest shufe size.
We intuitively expected that the replication performed during the map
phase would generate a large amount of intermediate data. Conversely,
the algorithms based on the inverted index generate the largest volume
of data. This is consistent with the analysis in the previous section as
VERN is the only algorithm without a quadratic term in the shufe size.
More specically, SSJ-2 produces less data thanks to Prex-Filtering.
However, SSJ-2R shufes about the same data as ELSA due to the addi-
tional information needed the replication introduced by the partitioning
of the remainder le. The shufes size does not include the remainder
62
0
5
10
15
20
25
30
35
40
45
50
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
S
h
u
f
f
l
e
s
i
z
e
(
G
B
)
Number of vectors
ELSA
VERN
SSJ-2
SSJ-2R
Figure 3.10: Shufe size.
le, which is anyway small enough to be negligible as discussed below.
Although SSJ-2R and ELSA have the same shufe size, our algorithm is
almost ve times faster overall.
3.6.4 Reduce phase
Figure 3.11 shows the average running time of the reduce function. VERN
is the most expensive as expected. While the other algorithms take ad-
vantage of the MR infrastructure to perform partial similarity computa-
tions, VERN just delivers to the reducer a collection of potentially similar
documents. The similarity computation is entirely done at the reducer,
with a cost quadratic in the number of documents received. In addition,
the same document pair is evaluated several times at different reducers
at least to check if the reducer is responsible for computing the similarity,
thus increasing the total cost of the reduce phase.
63
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
T
i
m
e
(
s
e
c
o
n
d
s
)
Number of vectors
ELSA
VERN
SSJ-2
SSJ-2R
Figure 3.11: Average reducer completion time.
SSJ-2 is the second most expensive due to the remote retrieval of doc-
uments from the distributed le system.
Both SSJ-2R and ELSA are way faster compared to the other two algo-
rithms. The reducer performs very little computation, the contributions
from the various terms are simply summed up. The use of the remainder
le signicantly speeds up SSJ-2R compared to SSJ-2. The remainder
le is quickly retrieved from the distributed le system and no random
remote access is performed.
The performance of the reducer depends heavily on the number of
candidate pairs evaluated. As shown in Table 3.4, both SSJ-2 and SSJ-2R
evaluate about 33% less candidate than ELSA. VERN evaluates a number
of pairs which is more than 5 times larger than SSJ-2 and SSJ-2R be-
cause of the replication of the input sent to the reducers. Even though
all but one of these pairs are discarded, they still need to be checked.
64
As reported in Table 3.4, all reduce times have small standard deviation,
which means that the load is well balanced across the reducers.
3.6.5 Partitioning the remainder le
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7
S
i
z
e
(
k
B
)
Number of partitions K
Remainder file
Ideal remainder file (1/x)
Shuffle
Figure 3.12: Remainder le and shufe size varying K.
We have seen that using the remainder le in SSJ-2R almost halves
the running time of SSJ-2. This result leverages the fact that the portion
of information which is needed by every node can be easily broadcast
via the distributed le system rather than inside the MR data ow. In
Table 3.4, we report the size of the remainder le which is always about
10% of the inverted index when using a similarity threshold = 0.9.
Such limited size was not an issue in our experimental setting. How-
ever, this might become an issue with larger collections, or when many
reducers run on the same node and share the same memory.
65
Figure 3.12 illustrates a small experiment on a single node with the
D17Kdataset, where we varied the number K of chunks of the remainder
le. Since partitioning is done in the document identier space, the size
of each chunk is not exactly the same and depends on the distribution of
document sizes. However, in our experiments the variance is quite low.
This proves experimentally that a reducer needs to load in memory only
a small chunk, thus allowing SSJ-2R to scale to very large collections.
Figure 3.12 also shows the trade-off between shufe size and remainder
le size when increasing K. For each additional chunk we need to shufe
one extra copy of the input, so shufe size increases linearly with K.
3.7 Conclusions
Finding similar items in a bag of documents is a challenging problemthat
arises in many applications in the areas of Web mining and information
retrieval. The size of Web-related problems mandates the use of parallel
approaches in order to achieve reasonable computing times.
In this chapter we presented two novel algorithms for the MapRe-
duce framework. We showed that designing efcient algorithms for MR
is not trivial and requires careful blending of several factors to effectively
capitalize on the available parallelism. We analyzed the complexity of
our algorithms, and compared it with the state-of-the-art by examining
the map, shufe and reduce phases. We validated this theoretical analy-
sis with experimental evidence collected on a sample of the Web.
SSJ-2R borrows from previous approaches based on inverted index,
and embeds pruning strategies that have been used only in sequential
algorithms so far. We exploited the underlying distributed lesystem to
support communication patterns that do not naturally t the MR frame-
work. We also described a partitioning strategy that overcomes memory
limitations at the cost of an increased volume of intermediate data. SSJ-
2R achieves scalability without sacricing precision by exploiting each
MR phase to perform useful work. Thanks to a careful design that takes
into account the specic properties of MR, SSJ-2R outperforms the state-
of-the-art by a factor of about 4.5.
66
Chapter 4
Social Content Matching in
MapReduce
Matching is a classical problem in graph theory. It entails nding a sub-
set of edges that satisfy a bound on the number of shared vertexes. This
kind of problem falls in the second category of the taxonomy dened in
Chapter 1 and requires iterative solutions. Matching problems are ubiq-
uitous. They occur in economic markets and Web advertising, to name
a few. In this chapter we focus on an application of matching for social
media. Our goal is to distribute content from information suppliers to in-
formation consumers. We seek to maximize the overall relevance of the
matched content from suppliers to consumers while regulating the over-
all activity, e.g., ensuring that no consumer is overwhelmed with data
and that all suppliers have chances to deliver their content.
We propose two MapReduce matching algorithms: GREEDYMR and
STACKMR. Both algorithms have provable approximation guarantees,
and in practice they produce high-quality solutions. While both algo-
rithms scale extremely well, we can show that STACKMR requires only
a poly-logarithmic number of MapReduce steps, making it an attrac-
tive option for applications with very large datasets. We experimentally
show the trade-offs between quality and efciency of our solutions on
two large datasets coming from real-world social media Web sites.
67
4.1 Introduction
The last decade has witnessed a radical paradigm shift on how informa-
tion content is distributed among people. Traditionally, most of the in-
formation content has been produced by few specialized agents and con-
sumed by the big masses. Nowadays, an increasing number of platforms
allow everyone to participate both in information production and in in-
formation consumption. The phenomenon has been coined as democrati-
zation of content. The Internet, and its younger children, user-generated
content and social media, have had a major role in this paradigm shift.
Blogs, micro-blogs, social-bookmarking sites, photo-sharing systems,
and question-answering portals, are some of the social media that people
participate in as both information suppliers and information consumers.
In such social systems, not only consumers have many opportunities to
nd relevant content, but also suppliers have opportunities to nd the
right audience for their content and receive appropriate feedback. How-
ever, as the opportunities to nd relevant information and relevant au-
dience increase, so does the complexity of a system that would allow
suppliers and consumers to meet in the most efcient way.
Our motivation is building a featured item component for social-
media applications. Such a component would provide recommendations
to consumers each time they log in the system. For example, ickr
1
dis-
plays photos to users when they enter their personal pages, while Yahoo!
Answers
2
displays questions that are still open for answering. For con-
sumers it is desirable that the recommendations are of high quality and
relevant to their interests. For suppliers it is desirable that their content is
delivered to consumers who are interested in it and may provide useful
feedback. In this way, both consumers and suppliers are more satised
by using the system and they get the best out of it.
Naturally, we model this problem as a matching problem. We asso-
ciate a relevance score to each potential match of an item t to a user u.
This score can be seen as the weight of the edge (t, u) of the bipartite
1
http://flickr.com
2
http://answers.yahoo.com
68
graph between items and users. For each item t and each user u we also
consider constraints on the maximum number of edges that t and u can
participate in the matching. These capacity constraints can be estimated
by the activity of each user and the relative frequency with which items
need to be delivered. The goal is to nd a matching that satises all
capacity constraints and maximizes the total weight of the edges in the
matching. This problem is known as b-matching .
The b-matching problem can be solved exactly in polynomial time by
max-ow techniques. However, the fastest exact algorithms today have
complexity
O(nm) (Gabow, 1983; Goldberg and Rao, 1998), for graphs
with n nodes and m edges, and thus do not scale to large datasets. In-
stead, in this work we focus on approximation algorithms that are scal-
able to very large datasets. We propose two algorithms for b-matching,
STACKMR and GREEDYMR, which can be implemented efciently in the
MapReduce paradigm. While both our algorithms have provable ap-
proximation guarantees, they have different properties.
We design the STACKMR algorithm by drawing inspiration from ex-
isting distributed algorithms for matching problems (Garrido et al., 1996;
Panconesi and Sozio, 2010). STACKMR is allowed to violate capacity con-
straints by a factor of (1 + ), and yields an approximation guarantee of
1
6+
, for any > 0. We show that STACKMR requires a poly-logarithmic
number of MapReduce steps. This makes STACKMR appropriate for re-
alistic scenarios with large datasets. We also study a variant of STACK-
MR, called STACKGREEDYMR, in which we incorporate a greedy heuris-
tic in order to obtain higher-quality results.
On the other hand, GREEDYMR is simpler to implement and has
the desirable property that it can be stopped at any time and provide
the current best solution. GREEDYMR is a
1
/2-approximation algorithm,
so it has a better quality guarantee than STACKMR. GREEDYMR also
yields better solutions in practice. However, it cannot guarantee a poly-
logarithmic number of steps. A simple example shows that GREEDYMR
may require a linear number of steps. Although GREEDYMR is theo-
retically less attractive than STACKMR, in practice it is a very efcient
algorithm, and its performance is way far from the worst case.
69
Finally, we note that the b-matching algorithm takes as input the set
of candidate edges weighted by their relevance scores. In some cases,
this set of candidate edges is small, for instance when items are recom-
mended only among friends in a social network. In other applications,
any item can be delivered to any user, e.g., a user in ickr may view a
photo of any other user. In the latter case, the graph is not explicit, and
we need to operate on a bag of items and consumers. In this case, materi-
alizing all pairs of item-user edges is an unfeasible task. Thus, we equip
our framework with a scheme that nds all edges with score greater than
some threshold , and we restrict the matching to those edges. We use
the SSJ-2R algorithmpresented in Chapter 3 to solve the problemof nd-
ing all similar item-user pairs efciently in MapReduce.
Our main contributions are the following.
We investigate the problem of b-matching in the context of social con-
tent distribution, and devise a MapReduce framework to address it.
We develop STACKMR, an efcient variant of the algorithmpresented
by Panconesi and Sozio (2010). We demonstrate how to adapt such
an algorithm in MapReduce, while requiring only a poly-logarithmic
number of steps. Our experiments show that STACKMR scales excel-
lently to very large datasets.
We introduce GREEDYMR, a MapReduce adaptation of a classical
greedy algorithm. It has a
1
/2-approximation guarantee, and is very
efcient in practice.
We employ SSJ-2R to build the input graph for the b-matching.
We perform a thorough experimental evaluation using large datasets
extracted from real-world scenarios.
The rest of the chapter is organized as follows. In Section 4.3 we for-
mally dene the graph matching problem. In Section 4.4 we present the
application scenario to social content distribution that we consider. In
Section 4.5 we discuss the algorithms and their MR implementation. Fi-
nally, we present our experimental evaluation in Section 4.6.
70
4.2 Related work
The general problem of assigning entities to users so to satisfy some con-
straints on the overall assignment arises in many different research areas
of computer science. Entities could be advertisements (Charles et al.,
2010), items in an auction (Penn and Tennenholtz, 2000), scientic pa-
pers (Garg et al., 2010) or media content, like in our case. The b-matching
problem nds applications also in machine learning (Jebara et al., 2009)
and in particular in spectral clustering (Jebara and Shchogolev, 2006).
The weighted b-matching problem can be solved in polynomial time
by employing maximum ow techniques (Gabow, 1983; Goldberg and
Rao, 1998). In any case, the time complexity is still superlinear in the
worst case. Christiano et al. (2010) have recently developed a faster ap-
proximation algorithm based on electrical ows.
In a distributed environment, there are some results for the unweighted
version of the (simple) matching problem (Fischer et al., 1993; Garrido
et al., 1996), while for the weighted case the approximation guarantee has
progressively improved from
1
/5 (Wattenhofer and Wattenhofer, 2004)
to
1
/2 (Lotker et al., 2008). For distributed weighted b-matching, a
1
/2-
approximation algorithm was developed by Koufogiannakis and Young
(2009). However, a MapReduce implementation is non-obvious. Lattanzi
et al. (2011) recently proposed a MapReduce algorithm for b-matching.
4.3 Problem denition
In this section we introduce our notation and provide our problem for-
mulation. We are given a set of content items T = t
1
, . . . , t
n
, which are
to be delivered to a set of consumers C = c
1
, . . . , c
m
. For each t
i
and
c
j
, we assume we are able to measure the interest of consumer c
j
in item
t
i
with a positive weight w(t
i
, c
j
). The distribution of the items T to the
consumers C can be clearly seen as a matching problem on the bipartite
graph with nodes T and C, and edge weights w(t
i
, c
j
).
In order to avoid that each consumer c
j
receive too many items, we
enforce a capacity constraint b(c
j
) on the number of items that are matched
71
to c
j
. Similarly, we would like to avoid the scenario when only a few
items (e.g. the most popular ones) participate in the matching. To this
end, we introduce a capacity constraint b(t
i
) on the number of consumers
that each item t
i
is matched to.
This variant of the matching problem is well known in the theoretical
computer science community as the b-matching problem. This is dened
as follows. We are given an undirected graph G = (V, E), a function
b : V N expressing node capacities (or budgets) and another function
w : E R
+
expressing edge weights. A b-matching in G is a subset of E
such that for each node v V at most b(v) edges incident to v are in the
matching. We wish to nd a b-matching of maximum weight.
Although our algorithms work with any undirected graph, we focus
on bipartite graphs which are relevant to our application scenarios. The
problem we shall consider in the rest of the chapter is dened as follows.
Problem2 (Social Content Matching Problem). We are given an undirected
bipartite graph G = (T, C, E), where T represents a set of items and C repre-
sents a set of consumers, a weight function w : E R
+
, as well as a capacity
function b : T C N. A b-matching in G is a subset of E such that for each
node v T C at most b(v) edges incident to v are in the matching. We wish
to nd a b-matching of maximum weight.
4.4 Application scenarios
To instantiate the problem we just dened, we need to (i) dene the
weights w(t
i
, c
j
) between items t
i
and consumers c
j
, (ii) decide the set
of potential edges that participates in the matching, and (iii) dene the
capacity constraints b(t
i
) and b(c
j
). In our work we focus only on the
matching algorithm and we assume that addressing the details of the
above questions depends on the application. However, for completeness
we discuss our thoughts on these issues.
Scenario. We envision a scenario in which an application operates in
consecutive phases. Depending on the dynamics of the application, the
duration of each phase may range from hours to days. Before the be-
ginning of the i-th phase the application makes a tentative allocation of
72
which items will be delivered to which consumers during the i-th phase.
The items that participate in this allocation, i.e., the set T of Problem 2,
are those that have been produced during the (i 1)-th phase, and per-
haps other items that have not been distributed in previous phases.
Edge weights. A simple approach is to represent items and consumers in
a vector space, i.e., items t
i
and consumers c
j
are represented by feature
vectors v(t
i
) and v(c
j
). Then we can dene the edge weight w(t
i
, c
j
)
using the cosine similarity w(t
i
, c
j
) = v(t
i
) v(c
j
)/(|v(t
i
)| |v(c
j
)|).
Potentially, more complex similarity functions can be used. Borrowing
ideas from information retrieval, the features in the vector representation
can be weighted by tfidf scores. Alternatively, the weights w(t
i
, c
j
)
could be the output of a recommendation system that takes into account
user preferences and user activities.
Candidate edges. With respect to deciding which edges to consider for
matching, the simplest approach is to consider all possible pairs (t
i
, c
j
).
This is particularly attractive, since we let the decision of selecting edges
entirely to the matching algorithm. However, considering O([T[[C[) edges
makes the systemhighly inefcient. Thus, we opt for methods that prune
the number of candidate edges. Our approach is to consider as candi-
dates only edges whose weight w(t
i
, c
j
) is above a threshold . The ra-
tionale is that since the matching algorithm will seek to maximize the
total edge weight, we preferably discard low-weight edges.
We note that depending on the application, there may be other ways
to dene the set of candidate edges. For example, in social-networking
sites it is common for consumers to subscribe to suppliers they are inter-
ested in. In such an application, we consider only candidate edges (t
i
, c
j
)
for which c
j
has subscribed to the supplier of t
i
.
Capacity constraints. The consumer capacity constraints express the
number of items that need to be displayed to each consumer. For ex-
ample, if we display one different item to a consumer each time they
access the application, b(c
j
) can be set to an estimate of the number of
times that consumer c
j
will access the application during the i-th phase.
Such an estimate can be obtained from log data.
For the item capacity constraints, we observe that B =
cC
b(c) is
73
an upper bound on the total number of distributed items, so we require
B =
tT
b(t) as well. Now we distinguish two cases, depending on
whether there is a quality assessment on the items T or not. If there is
no quality assessment, all items are considered equivalent, and the total
distribution bandwidth B can be divided equally among all items, so
b(t) = max1,
B
|T|
, for all t in T.
If on the other hand there is a quality assessment on the items T,
we assume a quality estimate q(t) for each item t. Such an estimate can
be computed using a machine-learning approach, as the one proposed
by Agichtein et al. (2008), which involves various features like content,
links, and reputation. Without loss of generality we assume normalized
scores, i.e.,
tT
q(t) = 1. We can then divide the total distribution
bandwidth B among all items in proportion to their quality score, so
b(t) = max1, q(t) B. In a real-application scenario, the designers of
the application may want to control the function q(t) so that it satises
certain properties, for instance, it follows a power-law distribution.
4.5 Algorithms
4.5.1 Computing the set of candidate edges
The rst step of our algorithm is to compute the set of candidate edges,
which in Section 4.4 were dened to be the edges with weight w(t
i
, c
j
)
above a threshold . This step is crucial in order to avoid considering
O([T[[C[) edges, which would make the algorithm impractical.
As we have already seen in the previous chapter, the problem of nd-
ing all the pairs of t
i
T and c
j
C so that w(t
i
, c
j
) is known as the
similarity join problem. Since we aim at developing the complete system
in the MapReduce framework, we obviously take advantage of SSJ-2R.
In particular, we adapt SSJ-2R to compute the similarity between
item-consumer pairs. First, we build a document bag by interpreting
the items t
i
and the consumers c
j
as documents via their vector repre-
sentation. Second, SSJ-2R can be trivially modied to join the two sets T
and C without considering pairs between two items or two consumers.
74
4.5.2 The STACKMR algorithm
Our rst matching algorithm, STACKMR, is a variant of the algorithm
developed by Panconesi and Sozio (2010). Panconesi and Sozio propose
a complex mechanism to ensure that capacity constraints are satised,
which unfortunately does not seem to have an efcient implementation
in MapReduce. Here we devise a more practical variant that allows node
capacities to be violated by a factor of at most (1 +), for any > 0. This
approximation is a small price to pay in our application scenarios, where
small capacity violations can be tolerated.
For the sake of presentation, we describe our algorithm rst in a cen-
tralized environment and then in a parallel environment. Pseudocode
for STACKMR and for the algorithm by Panconesi and Sozio are shown
in Algorithms 1 and 2, respectively.
The latter one has been slightly changed so to take into account im-
plementation issues. However, we do not include an evaluation of the
latter algorithm as it does not seem to be efcient. In the next section we
describe in detail howto implement the former algorithmin MapReduce.
Our algorithm is based on the primal-dual schema, a successful and
established technique to develop approximation algorithms. The primal-
dual schema has proved to play an important role in the design of se-
quential and distributed approximation algorithms. We hold the belief
that primal-dual algorithms bear the potential of playing an important
role in the design of MapReduce algorithms as well.
The rst step of any primal-dual algorithm is to formulate the prob-
lem at hand as an integer linear program (IP). Each element that might
be included in a solution is associated with a 0-1 variable. The combina-
torial structure of the problem is captured by an objective function and a
set of constraints, both being a linear combination of the binary variables.
Then, integrality constraints are relaxed so that variables can take any
value in the range [0, 1]. This linear program is called primal. From the
primal programwe derive the so-called dual. There is a direct correspon-
dence between the variables of the primal and the constraints of the dual,
as well as, the variables of the dual and the constraints of the primal.
75
Algorithm 1 STACKMR violating capacity constraints by a factor of at
most (1+)
1: /* Pushing Stage */
2: while E is non empty do
3: Compute a maximal ,b|-matching M (each vertex v has capacity
,b(v)|), using the procedure in Garrido et al. (1996);
4: Push all edges of M on the distributed stack (M becomes a layer
of the stack);
5: for all e M in parallel do
6: Let (e) = (w(e) y
u
/b(u) y
v
/b(v)) /2;
7: increase y
u
and y
v
by (e);
8: end for
9: Update E by eliminating all edges that have become weakly cov-
ered
10: end while
11: /* Popping Stage */
12: while the distributed stack is nonempty do
13: Pop a layer M out of the distributed stack.
14: In parallel include all edges of M in the solution.
15: For each vertex v: let
eE
w(e)x
e
(IP)
such that
eE, ve
x
e
b(v) v V, (4.1)
where x
e
0, 1 is associated to edge e, and a value of 1 means that e
belongs to the solution. The dual program is as follows.
minimize
vV
y
v
(DP)
such that
y
u
b(u)
+
y
v
b(v)
w(e) e = (u, v) E, (4.2)
y
v
0 v V. (4.3)
Dual constraints (4.2) are associated with edges e. An edge is said to
be covered if its corresponding constraint is satised with equality. The
variables occurring in such a constraint are referred as es dual variables
and play an important role in the execution of the algorithm.
The centralized algorithmconsists of two phases: a push phase where
edges are pushed on a stack in arbitrary order, and a pop phase where
edges are popped from the stack and a feasible solution is computed.
When an edge e(u, v) is pushed on the stack, each of its dual variables
is increased by the same amount (e) so to satisfy Equation (4.2) with
equality. The amount (e) is derived from Equation (4.2) as
(e) =
(w(e) y
u
/b(u) y
v
/b(v))
2
. (4.4)
Whenever edges become covered they are deleted from the input graph.
The push phase terminates when no edge is left. In the pop phase, edges
are successively popped out of the stack and included in the solution if
feasibility is maintained.
In a parallel environment, we wish to parallelize as many operations
as possible so to ensure poly-logarithmic running time. Thus, we need
a mechanism to bound the number of push and pop steps, which in the
78
centralized algorithm may be linear in the number of edges. This is done
by computing at each step a maximal ,b|-matching. Note the difference
between maximum and maximal: a b-matching is maximal if and only
if it is not properly contained in any other b-matching. All edges in a
maximal set, called a layer of the stack, are pushed on the stack in parallel.
In the popping phase, all edges within the same layer are popped out of
the stack and included in the solution in parallel. Edges of nodes whose
capacity constraints are satised or violated are deleted from the stack
and ignored from further consideration. A maximal b-matching can be
computed efciently in MapReduce as we will discuss in Section 4.5.3.
Unfortunately, the total number of layers may still be linear in the
maximum degree of a node. To circumvent this problem, we introduce
the denition of weakly covered edges. Roughly speaking, a weakly cov-
ered edge is an edge whose constraint is only partially satised and
thus gets covered after a small number of iterations.
Denition 1. (Weakly covered edges) Given > 0, at any time during the
execution of our algorithm we say that an edge e E is weakly covered if
constraint (4.2) for e = uv is such that
y
u
b(u)
+
y
v
b(v)
1
3 + 2
w(e), (4.5)
where y denotes the current value of y.
Observe that Equation (4.5) is derived from Equation (4.2).
To summarize, our parallel algorithm proceeds as follows. At each
step of the push phase, we compute a maximal ,b|-matching using the
procedure by Garrido et al.. All the edges in the maximal matching are
then pushed on the stack in parallel forming a layer of the stack. For each
of these edges we increase each of its dual variable by (e) in parallel.
Some edges might then become weakly covered and are deleted from
the input graph. The push phase is executed until no edge is left.
At the end of the push phase, layers are iteratively popped out of
the stack and edges within the same layer are included in the solution in
parallel. This can violate node capacities by a factor of at most (1 +), as
every layer contains at most b(v) edges incident to any node v. Edges
79
Input Push Pop
Output
Figure 4.1: Example of a STACKMR run.
of nodes whose capacity constraints are satised or violated are deleted
from the stack and ignored from further consideration. This phase is
iterated until the stack becomes empty.
Figure 4.1 shows an example of a run of the STACKMR algorithm.
On the top left we have the input: an undirected weighted graph. In
the middle column the push phase builds the stack layer by layer, going
upwards. Each layer is a maximal ,b|-matching. On the right column
the pop phase consumes the stack in reverse order and merges each layer
to create the nal solution. Edges in lower layers might get discarded if
capacities are saturated. The output of the algorithm is shown on the
bottom left, where dashed edges have been excluded by STACKMR.
We can show that the approximation guarantee of our algorithm is
1
6+
, for every > 0. Moreover, we can show that the push phase is iter-
ated at most O(log
w
max
w
min
) steps, where w
max
and w
min
are the maximum
and minimum weight of any edge in input, respectively. This fact, to-
gether with the fact that the procedure by Garrido et al. (1996) requires
O(log
3
n) rounds imply the following theorem.
Theorem 1. Algorithm 1 has an approximation guarantee of
1
6+
and violates
capacity constraints by a factor of at most 1 +. It requires O(
log
3
n
2
log
w
max
w
min
)
80
communication rounds, with high probability.
The non-determinismfollows fromthe algorithmthat computes max-
imal b-matchings. The proof of Theorem 1 is similar to the one given
by Panconesi and Sozio. STACKMR is a factor
1
/ faster than the original.
4.5.3 Adaptation in MapReduce
The distributed algorithm described in the previous section works in an
iterative fashion. In each iteration we rst compute a maximal matching,
then we push it on a stack, we update edges, and we pop all levels from the
stack. Below we describe how to implement these steps in MapReduce.
Maximal matching. To nd maximal b-matchings we employ the al-
gorithm of Garrido et al. which is an iterative probabilistic algorithm.
Each iteration consists of four stages: (i) marking, (ii) selection, (iii)
matching, and (iv) cleanup.
In the marking stage, each node v marks randomly ,
1
2
b(v)| of its
incident edges. In the selection stage, each node v selects randomly
max
1
2
b(v)|, 1 edges from those marked by its neighbors. Call F this
set of selected edges. In the matching stage, if some node v has ca-
pacity b(v) = 1 and two incident edges in F, it randomly deletes one of
them. At this point the set F is a valid b-matching. The set F is added
to the solution and removed from the original graph. In the cleanup
stage, each node updates its capacity in order to take into consideration
the edges in F and saturated nodes are removed from the graph. These
stages are iterated until there are no more edges left in the original graph.
The process requires, on expectation, O(log
3
n) iterations to terminate.
To adapt this algorithmin MapReduce, we need one job for each stage
of the algorithm. The input and output of each MapReduce job is al-
ways of the same format: a consistent view of the graph represented as
adjacency lists. We maintain a node-based representation of the graph
because we need to make decisions based on the local neighborhood of
each node. Assuming the set of nodes adjacent to v
i
is v
j
, . . . , v
k
, the
input and output of each job is a list of pairs v
i
, [(v
j
, T
ij
), . . . , (v
k
, T
ik
)],
where v
i
is the key and [(v
j
, T
j
), . . . , (v
k
, T
k
)] the associated value. The
81
variables T represent the state of each edge. We consider ve possible
states of an edge: E in the main graph; K marked; F selected; D deleted;
and M in the matching. The 4 stages of the algorithm are repeated until
all the edges are in the matching (M) or are deleted (D).
The general signature of each job is as follows.
Map: v
i
, [(v
j
, T
ij
), . . . , (v
k
, T
ik
)]
_
v
i
, T
(i)
ij
, v
j
, T
(i)
ij
, . . . , v
i
, T
(i)
ik
, v
k
, T
(i)
ik
_
Reduce: v
i
, T
(i)
ij
, v
i
, T
(j)
ij
, . . . , v
i
, T
(i)
ik
, v
k
, T
(k)
ik
v
i
, [(v
j
, T
ij
), . . . , (v
k
, T
ik
)]
where T
(i)
ij
is the state of edge (v
i
, v
j
) as locally viewed by v
i
and T
ij
is
the nal state after unication of the views from v
i
and v
j
.
Each Map function alters the state of the graph locally to each node.
Each Reduce function unies the diverging views of the graph at each
node. For each edge (v
i
, v
j
), each Map function will emit both v
i
and v
j
as keys, together with the current state of the edge as value. The Reduce
function will receive the views of the state of the edge from both end-
points, and will unify them, yielding a consistent graph representation.
Unication is performed by a simple precedence rule. If a ~ b the algo-
rithm unies two diverging states (a, b) to state b. Only a subset of all
the possible combinations of states may actually happen, so we need to
specify only a few rules as follows.
E ~ K ~ F; F ~ M; E ~ D;
Each MapReduce job uses the same communication pattern and state
unication rules. They only differ in the way they update the state of
the edges. The communication cost of each job is thus O([E[), while the
achievable degree of parallelism is O([V [). Figure 4.2 illustrates the com-
munication pattern, which is the same for each vertex. Let us take into
account the central node. For each incident edge, the mapper emits the
edge state as viewed by the node twice: once for each end of the edge.
In the picture the state view is represented by the arrows which are color
coded to represent their origin and point to their destination. For each
82
Map
Reduce
Figure 4.2: Communication pattern for iterative graph algorithms on MR.
incident edge, the reducer receives two different views of the state of the
edge, one from each endpoint. The reducer reconciles the two views by
applying the unication rules to obtain a consistent view of the graph.
Push, update, and pop. The basic scheme of communication for the push,
update, and pop phases is the same as the one for computing maximal
matching. We maintain the same invariant of representing the graph
from Map to Reduce functions and in-between consecutive iterations.
For these phases of the algorithm, we maintain a separate state for each
edge. The possible states in which an edge can be are: E, edge in the
graph; S, edge stacked; R, edge removed from the graph; and I, edge
included in the solution. For each edge we also maintain an integer vari-
able that represents the stack level in which the edge has been put.
During the push phase, for each edge included in the maximal match-
ing, we set its state to S and the corresponding stack level. The update
phase is needed to propagate the (e) contributions. Each edge sent to a
node v carries the value of its sending node y
u
/b(u). Thus, each node can
compute the new (e) and update its local y
v
. This phase also removes
83
weakly covered edges by setting their state to R, and updates the capac-
ities of the nodes for the next maximal-matching phase. Removed edges
are not considered for the next maximal-matching phase.
The pop phase starts when all the edges in the graph are either stacked
(S) or removed (R). During the pop phase, each stacked (S) edge in
the current level (starting from the topmost) is included in the solution
by setting its state to I. The capacities are locally updated, and nodes
(and all incident edges) are removed when their capacity becomes non-
positive. Overall, there is one MapReduce step for each stack level.
The decision to change the state of the edge during these phases de-
pends only on the current state of the edge. The only local decision is
edge removal when the vertex capacity gets saturated. Therefore, we
only need to deal with diverging views related to the removed (R) state:
E ~ R; S ~ R; I ~ R;
4.5.4 The GREEDYMR algorithm
In this section we present a second matching algorithmbased on a greedy
strategy: GREEDYMR. As previously, we analyze the centralized version
and then we show how to adapt it in MapReduce.
The centralized greedy algorithm processes sequentially each edge in
order of decreasing weight. It includes an edge e(u, v) in the solution if
b(u) > 0 b(v) > 0. In this case, it subtracts 1 from both b(u) and b(v).
It is immediate that the greedy algorithmproduces a feasible solution.
In addition, it has a factor
1
/2 approximation guarantee. We believe that
this is a well-known result, however, we were not able nd a reference.
Thus, for completeness we include a proof in Section 4.5.5.
GREEDYMR is a MapReduce adaptation of this centralized algorithm.
We note that the adaptation is not straightforward due to the access to the
globally-shared variables b(v) that hold node capacities.
GREEDYMR works as follows. In the map phase each node v proposes
its b(v) edges with maximum weight to its neighbors. In the reduce
phase, each node computes the intersection between its own proposals
and the proposals of its neighbors. The set of edges in the intersection is
84
included in the solution. Then, each node updates its capacity. If it be-
comes 0, the node is removed from the graph. Pseudocode for GREEDY-
MR is shown in Algorithm 3.
In contrast with STACKMR, GREEDYMR is not guaranteed to ter-
minate in a poly-logarithmic number of iterations. As a simple worst-
case input instance, consider a path graph u
1
u
2
, u
2
u
3
, ...u
k1
u
k
such that
w(u
i
, u
i+1
) w(u
i+1
, u
i+2
). GREEDYMR will face a chain of cascading
updates that will cause a linear number of MapReduce iterations. How-
ever, as shown in Section 4.6, in practice GREEDYMR yields quite com-
petitive results compared to STACKMR.
Finally, GREEDYMR maintains a feasible solution at each step. The
advantage is that it can be terminated at any time and return the cur-
rent solution. This property makes GREEDYMR especially attractive in
our application scenarios, where content can be delivered to the users
immediately while the algorithm continues running in the background.
Algorithm 3 GREEDYMR
1: while E is non empty do
2: for all v V in parallel do
3: Let
L
v
be the set of b(v) edges incident to v with maximum weight;
4: Let F be
L
v
L
U
where U = {u V : e(v, u) E} is the set f vertexes
sharing an edge with v;
5: Update M M F;
6: Update E E \ F;
7: Update b(v) b(v) b
F
(v);
8: If b(v) = 0 remove v fromV and remove all edges incident to v fromE;
9: end for
10: end while
11: return M;
4.5.5 Analysis of the GREEDYMR algorithm
The following theorem proves the approximation guarantee of GREEDY-
MR. We believe this result to be well-known, however we could not nd
a reference. Thus, we give a proof for completeness and self contain-
ment. The greedy algorithm can be equivalently described as the follow-
85
ing process: at each step each node v marks an edge with largest weight.
If an edge is marked by both its end nodes then it enters the solution and
residual node capacities are updated. As soon as the residual capacity of
a node becomes zero all its edges are deleted and ignored from further
consideration. These steps are iterated until the set of edges of the input
graph becomes empty.
Theorem 2. The greedy algorithm produces a solution with approximation
guarantee
1
2
for the weighted b-matching problem.
Proof. Let O be an optimum solution for a given problem instance and let
A be the solution yielded by the greedy algorithm. For every node v, let
O
v
and A
v
denote the sets of edges O
G
(v) and A
G
(v), respectively,
where
G
(v) is the set of edges in G incident to v. The total weight of
a set of edges T is denoted by w(T). We say that a node is saturated if
exactly b(v) edges of v are in the greedy solution A and we let S denote
the set of saturated nodes.
For every node v, we consider the sets
O
v
O
v
A, dened as fol-
lows: each edge e(u, v) O A is assigned to a set
O
v
, for which v is
a saturated node and the weight of any edge in A
v
is larger than w(e).
Ties are broken arbitrarily. There must be such a node v, for otherwise e
would be included in A. The idea of the proof is to relate the weight of
edge e with the weights of the edges of A
v
, which prevent e fromentering
the solution. From the denition of the
O
v
s it follows that
w(O A) =
vS
w(
O
v
). (4.6)
For every saturated node v we have that [O
v
[ b(v) = [A
v
[. From this
and from the denition of the
O
v
s we have that
vS
w(A
v
O)
vS
w(
O
v
). (4.7)
From Equations (4.6) and (4.7) we obtain
2w(A) w(A O) +
vS
w(A
v
O) w(A O) +
vS
w(
O
v
) w(O),
which concludes the proof.
86
The analysis is tight as proved by the following example. Consider
a cycle consisting of three nodes u, v, z and three edges uv, vz, zu. Let
b(u) = b(z) = 1 and let b(v) = 2. Moreover, let w(uv) = w(vz) = 1 while
w(zu) = (1+) where > 0. The greedy algorithm would select the edge
whose weight is 1 +, while the weight of the optimum solution is 2.
4.6 Experimental evaluation
Flickr. We extract two datasets from ickr, a photo-sharing Web site.
Table 4.1 shows statistics for the flickr-small and flickr-large
datasets. In these datasets items represent photos and consumers repre-
sent users. In each dataset, each user has posted at least 10 photos, and
each photo has been considered at least 5 times as a favorite.
Recall from our discussion in Section 4.4 that the capacity b(u) of each
user u should be set in proportion to the login activity of the user in the
system. Unfortunately, the login activity is not available in the datasets.
So we decide to use the number of photos n(u) that the user u has posted
as a proxy to his activity. We then use a parameter > 0 to set the ca-
pacity of each user u as b(u) = n(u). Higher values of the parameter
simulate higher levels of activity in the system.
Next we need to specify the capacity of photos. Since our primary
goal is to study the matching algorithm, specifying the actual capacities
is beyond the scope of the work. Thus we use as a proxy, the number
of favorites f(p) that each photo p has received. The intuition is that we
want to favor good photos in order to increase user satisfaction. Follow-
ing Section 4.4, we set the capacity of each photo to
b(p) = f(p)
u
n(u)
q
f(q)
.
In order to estimate edge similarities, we represent each photo by its
tags, and each user by the set of all tags he or she has used. Then we
compute the similarity between a photo and a user as the cosine simi-
larity of the their tag vectors. We compute all edges whose similarity is
larger than a threshold by employing SSJ-2R.
87
0 0.2 0.4 0.6 0.8
10
0
10
2
10
4
10
6
flickr small
0 0.2 0.4 0.6 0.8 1
10
4
10
6
10
8
10
10
flickr large
0 0.2 0.4 0.6 0.8 1
10
0
10
5
10
10
yahoo answers
Figure 4.3: Distribution of edge similarities for the datasets.
88
10
0
10
5
10
0
10
1
10
2
10
3
capacity
flickr small
10
0
10
5
10
10
10
0
10
1
10
2
10
3
10
4
capacity
flickr large
10
0
10
2
10
4
10
0
10
2
10
4
10
6
10
8
capacity
yahoo answers
Figure 4.4: Distribution of capacities for the three datasets.
89
Yahoo! Answers. We extract one dataset from Yahoo! Answers, a Web
question-answering portal. In yahoo-answers, consumers represent
users, while items represent questions. The motivating application is to
propose unanswered questions to users. Matched questions should t
the interests of the user. To identify user interests, we represent users by
the weighted set of words in their answers. We preprocess the answers
to remove punctuation and stop-words, stem words, and apply tfidf
weighting. We treat questions in the same manner.
As before, we extract a bipartite graph with edge weights represent-
ing the similarity between questions and users. We employ again a thresh-
old to sparsify the graph, and present results for different density lev-
els. In this case, we set user capacities b(u) by employing the number of
answers n(u) provided by each user u as a proxy to the activity of the
user. We use the same parameter as for the flickr datasets to set
b(u) = n(u). However, for this dataset we use a constant capacity for
all questions, in order to test our algorithm under different settings. For
each question q we set
b(q) =
u
n(u)
[Q[
.
The distributions of node capacities are shown in Figure 4.4 while the
distributions of edge similarities are shown in Figure 4.3.
Variants. We also experiment with a number of variants of the STACKMR
algorithm. In particular, we vary the edge-selection strategy employed
in the rst phase of the maximal b-matching algorithm (marking). The
STACKMR algorithm proposes to its neighbors edges chosen uniformly
at random. In order to favor heavier edges in the matching, we mod-
ify the selection strategy to propose the ,
1
2
b(v)| edges with the largest
weight. We call this variant STACKGREEDYMR. We also experiment with
a third variant, in which we choose edges randomly but with probability
proportional to their weights. We choose not to show the results for this
third variant because it always performs worse than STACKGREEDYMR
Measures. We evaluate the proposed algorithms in terms of both quality
and efciency. Quality is measured in terms of b-matching value achieved,
and efciency in terms of the number of MapReduce iterations required.
90
Table 4.1: Dataset characteristics. |T|: number of items; |C|: number of
users; |E|: total number of item-user pairs with non zero similarity.
Dataset [T[ [C[ [E[
flickr-small 2 817 526 550 667
flickr-large 373 373 32 707 1 995 123 827
yahoo-answers 4 852 689 1 149 714 18 847 281 236
We evaluate our algorithms by varying the following parameters: the
similarity threshold , which controls the number of edges that partic-
ipate in the matching; the factor in determining capacities; and the
slackness parameter used by STACKMR.
Results. Sample results on the quality and efciency of our matching
algorithms for the three datasets, flickr-small, flickr-large, and
yahoo-answers, are shown in Figures 4.5, 4.6, and 4.7, respectively. For
each plot in these gures, we x the parameters and and we vary the
similarity threshold . Our observations are summarized as follows.
Quality. GREEDYMR consistently produces matchings with higher
value than the two stack-based algorithms. Since GREEDYMR has bet-
ter approximation guarantee, this result is in accordance with theory.
In fact, GREEDYMR achieves better results even though the stack algo-
rithms have the advantage of being allowed to exceed node capacities.
However, as we will see next, the violations of the stack-based algo-
rithms are very small, ranging from practically 0 to at most 6%. In the
flickr-large dataset, GREEDYMR produces solutions that have on
average 31% higher value than the solutions produced by STACKMR. In
flickr-small and yahoo-answers, the improvement of GREEDYMR
is 11% and 14%, respectively. When comparing the two stack algorithms,
we see that STACKGREEDYMR is slightly better than STACKMR. Again
the difference is more pronounced on the flickr-large dataset.
We also observe that in general the b-matching value increases with
the number edges. This behavior is expected, as the number of edges in-
crease the algorithms have more exibility. Since we add edges by lower-
ing the edge weight threshold, the gain in the b-matching value tends to
saturate. The only exception to this rule is for STACKGREEDYMR on the
91
0 5 10 15
x 10
4
0
500
1000
1500
2000
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr small, a=0.2, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10 15
x 10
4
10
0
10
20
30
40
50
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr small, a=0.2, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10 15
x 10
4
0
1000
2000
3000
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr small, a=0.8, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10 15
x 10
4
0
10
20
30
40
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr small, a=0.8, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10 15
x 10
4
0
1000
2000
3000
4000
5000
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr small, a=3.2, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10 15
x 10
4
10
0
10
20
30
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr small, a=3.2, e=1.0
GreedyMR
StackMR
StackGreedyMR
Figure 4.5: flickr-small dataset: matching value and number of itera-
tions as a function of the number of edges.
92
0 5 10
x 10
8
3.5
4
4.5
5
5.5
x 10
6
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr large, a=0.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10
x 10
8
50
100
150
200
250
300
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr large, a=0.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10
x 10
8
0.8
0.9
1
1.1
1.2
1.3
1.4
x 10
7
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr large, a=1.6, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10
x 10
8
50
100
150
200
250
300
350
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr large, a=1.6, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10
x 10
8
1.5
2
2.5
x 10
7
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr large, a=6.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 5 10
x 10
8
50
100
150
200
250
300
350
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
flickr large, a=6.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
Figure 4.6: flickr-large dataset: matching value and number of itera-
tions as a function of the number of edges.
93
0 1 2 3 4
x 10
8
1
1.5
2
2.5
3
3.5
x 10
5
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
yahoo answers, a=0.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 1 2 3 4
x 10
8
0
200
400
600
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
yahoo answers, a=0.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 1 2 3 4
x 10
8
2
4
6
8
10
x 10
5
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
yahoo answers, a=1.6, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 1 2 3 4
x 10
8
0
100
200
300
400
500
600
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
yahoo answers, a=1.6, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 1 2 3 4
x 10
8
0
0.5
1
1.5
2
2.5
3
x 10
6
Number of edges
m
a
t
c
h
i
n
g
v
a
l
u
e
yahoo answers, a=6.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
0 1 2 3 4
x 10
8
0
100
200
300
400
500
Number of edges
m
a
p
r
e
d
u
c
e
i
t
e
r
a
t
i
o
n
s
yahoo answers, a=6.4, e=1.0
GreedyMR
StackMR
StackGreedyMR
Figure 4.7: yahoo-answers dataset: matching value and number of itera-
tions as a function of the number of edges.
94
0 2 4 6
0
0.01
0.02
0.03
0.04
0.05
0.06
e
f
f
e
c
t
i
v
e
v
i
o
l
a
t
i
o
n
flickr large
= 0.02
= 0.04
= 0.08
= 0.16
0 2 4 6
0
2
4
6
8
10
12
x 10
5
e
f
f
e
c
t
i
v
e
v
i
o
l
a
t
i
o
n
yahoo answers
= 0.08
= 0.16
= 0.32
Figure 4.8: Violation of capacities for STACKMR.
flickr-large dataset. We believe this is due to the uneven capacity
distribution for the flickr-large dataset, see Figure 4.4. Our belief is
supported by fact that the decrease is less visible for higher values of .
Efciency. Our ndings validate the theory also in terms of efciency.
In most settings the stack algorithms perform better than GREEDYMR.
The only exception is the flickr-small dataset. This dataset is very
small, so the stack algorithms incur excessive overhead in computing
maximal matchings. However, the power of the stack algorithms is best
proven on the larger datasets. Not only they need less MapReduce steps
than GREEDYMR, but they also scale extremely well. The performance
of STACKMR is almost unaffected by increasing the number of edges.
Capacity violations. As explained in Section 4.5.2, the two stack-based
algorithms can exceed the capacity of the nodes by a factor of (1 + ).
However, in our experiments the algorithms exhibit much lower viola-
tions than the worst case. We compute the average violation as follows.
=
1
[V [
vV
max[M(v)[ b(v), 0
b(v)
,
where [M(v)[ is the degree of node v in the matching M, and b(v) is
the capacity for each node v in V . Figure 4.8 shows capacity violations
for STACKMR. The violations for STACKGREEDYMR are similar. When
95
0 50 100 150 200 250
0.4
0.6
0.8
1
number of iterations
f
r
a
c
t
i
o
n
o
f
m
a
t
c
h
i
n
g
v
a
l
u
e
flickr large
= 0.08 = 0.4
= 0.08 = 1.6
= 0.16 = 0.4
= 0.16 = 1.6
0 200 400 600
0
0.2
0.4
0.6
0.8
1
number of iterations
f
r
a
c
t
i
o
n
o
f
m
a
t
c
h
i
n
g
v
a
l
u
e
yahoo answers
= 0.08 = 0.4
= 0.08 = 1.6
= 0.16 = 0.4
= 0.16 = 1.6
Figure 4.9: Normalized value of the b-matching achieved by the GREEDY-
MR algorithm as a function of the number of MapReduce iterations.
= 1, for the flickr-large dataset the violation is as low as 6% in
the worst case. As expected, more violations occur when more edges are
allowed to participate in the matching, either by increasing the number
of edges (by lowering ) or the capacities of the nodes (by increasing ).
On the other hand, for the yahoo-answers datasets, using the same
= 1, the violations are practically zero for any combination of the other
parameters. One reason for the difference between the violations in these
two datasets may be the capacity distributions, as shown in Figure 4.4.
For all practical purposes in our scenarios, these violations are negligible.
Any-time stopping. An interesting property of GREEDYMR is that it
produces a feasible but suboptimal solution at each iteration. This allows
to stop the algorithm at any time, or to query the current solution and
let the algorithm continue in the background. Furthermore, GREEDY-
MR converges very fast to a good global solution. Figure 4.9 shows
the value of the solution found by GREEDYMR as a function of the it-
eration. For the three datasets, flickr-small, flickr-large, and
yahoo-answers, the GREEDYMR algorithm reaches 95% of its nal b-
matching value within 28.91%, 44.18%, and 29.35% of the total number
of iterations required, respectively. The latter three numbers are averages
over all the parameter settings we tried in each dataset.
96
4.7 Conclusions
Graph problems arise continuously in the context of Web mining. In this
chapter we investigate the graph b-matching problem and its application
to content distribution. Our goal is to help users effectively experience
Web 2.0 sites with large collections of user-generated content.
We described two iterative MapReduce algorithms, STACKMR and
GREEDYMR. Both algorithms rely on the same base communication pat-
tern. This pattern is able to support a variety of computations and pro-
vides a scalable building block for graph mining in MR.
To the best of our knowledge, STACKMR and GREEDYMR are the
rst solutions to graph b-matching in MR. Both algorithms have prov-
able approximation guarantees and scale to realistic-sized datasets. In
addition STACKMR provably requires only a poly-logarithmic number
of MapReduce iterations. On the other hand GREEDYMR has a better
approximation guarantee and allows to query the solution at any time.
We evaluated our algorithms on two large real-world datasets from
the Web and highlighted the tradeoffs between quality and performance
between the two solutions. GREEDYMR is a good solution for practi-
tioners. In our experiments it consistently found the best b-matching
value. Moreover, it is easy to implement and reason about. Neverthe-
less, STACKMR has high theoretical and practical interest because of its
better running time. It scales gracefully to massive datasets and offers
high-quality results.
97
Chapter 5
Harnessing the Real-Time
Web for Personalized News
Recommendation
Real-time Web is an umbrella termthat encompasses several social micro-
blogging services. It can be modeled as a graph of nodes exchanging
streams of updates via publish-subscribe channels. As such it crosses the
boundaries between graphs and streams as dened in the taxonomy of
Chapter 1. Famous examples include Facebooks news feed and Twitter.
In this chapter we tackle the problem of mining the real-time Web
to suggest articles from the news stream. We propose T.REX, a new
methodology for generating personalized recommendations of news ar-
ticles by leveraging the information in users Twitter persona. We use a
mix of signals to model relevance of news articles for users: the content
of the tweet stream of the users, the prole of their social circles, and
recent topic popularity in news and Twitter streams.
We validate our approach on a real-world dataset from Yahoo! news
and one month of tweets that we use to build user proles. We model the
task as click prediction and learn a personalized ranking function from
click-through data. Our results show that a mix of various signals from
the real-time Web is an effective indicator of user interest in news.
98
5.1 Introduction
Information overload is a term referring to the difculty in taking decisions
or understanding an issue when there is more information than one can
handle. Information overload is not a new problem. The term itself pre-
cedes the Internet by several years Tofer (1984). However, digital and
automated information processing has aggravated the problem to un-
precedented levels, transforming it into one of the crucial issues in the
modern information society.
In this work we focus on one of the most common daily activities:
reading news articles online. Every day the press produces thousands
of news articles covering a wide spectrum of topics. For a user, nding
relevant and interesting information in this ocean of news is a daunting
task and a perfect example of information overload.
News portals like Yahoo! news and Google news often resort to rec-
ommender systems to help the user nd relevant pieces of information.
In recent years the most successful recommendation paradigm has been
collaborative ltering (Adomavicius and Tuzhilin, 2005). Collaborative l-
tering requires the users to rate the items they like. The rating can be
explicit (e.g., like, +1, number of stars) or implicit (e.g., click on a link,
download). In both cases, collaborative ltering leverages a closed feed-
back loop: user preferences are inferred by looking only at the user inter-
action with the system itself. This approach suffers from a data sparsity
problem, namely it is hard to make recommendations for users for whom
there is little available information or for brand new items, since little or
no feedback is available for such cold items.
At the same time, at an increasingly rate, web users access news ar-
ticles via micro-blogging services and the so-called real-time web. By
subscribing to feeds of other users, such as friends, colleagues, domain
experts and enthusiasts, as well as to organizations of interest, they ob-
tain timely access to relevant and personalized information. Yet, infor-
mation obtained by micro-blogging services is not complete as highly-
relevant events could easily be missed by users if none of their contacts
posts about these events.
99
We propose to recommend news articles to users by combining two
sources of information, news streams and micro-blogs, and leveraging
the best features of each. News streams have high coverage as they ag-
gregate a very large volume of news articles obtained from many differ-
ent news agencies. On the other hand, information obtained by micro-
blogging services can be exploited to address the problems of information
ltering and personalization, as users can be placed in the context of their
social circles and personal interests.
Our approach has the following advantages. First, we are able to
leverage the information from social circles in order to offer personal-
ized recommendations and overcome the data-sparsity problem. If we
do not have enough information for a user, there should be signicantly
more information from the social circle of the user, which is presumably
relevant. Second, more than once it has been reported that news break
out earlier on the real-time web than in traditional media, and we would
like to harness this property to provide timely recommendations.
With more than 200 million users, Twitter is currently the most pop-
ular real-time web service. Twitter is an emerging agora where users
publish short text messages, also known as tweets, and organize them-
selves into social networks. Interestingly, in many cases, news have been
published and commented on Twitter before any other news agency, as
in the case of Osama Bin-Ladens death in 2011,
1
or Tiger Woods car
crash in 2009.
2
Due to its popularity, traditional news providers, such
as magazines, news agencies, have become Twitter users: they exploit
Twitter and its social network to disseminate their contents.
Example. In Figure 5.1 we show the normalized number of tweets and
news articles regarding Osama Bin Laden during the time period be-
tween May 1st and 4th, 2011. For the news, we also report the num-
ber of clicks in the same period. We notice that the number of relevant
tweets ramps up earlier than news, meaning that the information spread
through Twitter even before any press release. Later on, while the num-
ber of tweets decreases quickly, and stabilizes around a relatively small
1
http://www.bbc.co.uk/news/technology-13257940
2
http://techcrunch.com/2009/11/27/internet-twitter-tiger-woods
100
number of tweets, the number of published news continues to be quite
large. The number of clicks on news articles follows somewhat the num-
ber of published news, even though there is a signicant delay until the
time the users actually click and read an article on the topic. Figure 5.2
presents similar information for the Joplin tornado. In this case, even
though there is a great activity both on Twitter and on news streams,
users start reading the news only after one day. About 60% of the clicks
on the news occur starting from 10 hours after its publication.
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
M
a
y
-
0
1
h
2
0
M
a
y
-
0
2
h
0
0
M
a
y
-
0
2
h
0
4
M
a
y
-
0
2
h
0
8
M
a
y
-
0
2
h
1
2
M
a
y
-
0
2
h
1
6
M
a
y
-
0
2
h
2
0
M
a
y
-
0
3
h
0
0
M
a
y
-
0
3
h
0
4
M
a
y
-
0
3
h
0
8
news
twitter
clicks
Figure 5.1: Osama Bin Laden trends on Twitter and news streams.
The goal of this work is to reduce the delay between the publication
of a news and its access by a user. We aim at helping the users in nding
relevant news as soon as they are published. Our envisioned application
scenario is a feature operating on a news aggregator like Yahoo! news or
Google news. Our objective is to develop a recommendation system that
provides the users with fresh, relevant news by leveraging their tweet
101
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
M
a
y
-
2
2
h
0
0
M
a
y
-
2
2
h
1
2
M
a
y
-
2
3
h
0
0
M
a
y
-
2
3
h
1
2
M
a
y
-
2
4
h
0
0
M
a
y
-
2
4
h
1
2
M
a
y
-
2
5
h
0
0
M
a
y
-
2
5
h
1
2
M
a
y
-
2
6
h
0
0
news
twitter
clicks
Figure 5.2: Joplin tornado trends on Twitter and news streams.
stream. Ideally, the users log into the news portal and link their Twitter
account to their portal account in order to provide access to their tweet
stream. The portal analyzes the tweet stream of the users and provides
them personalized recommendations.
Solving this problem poses a number of research challenges. First,
the volume of tweets and news is signicantly large. It is necessary to
design a scalable recommender system able to handle millions of users,
tweets and news. Second, both tweets and news are unbounded streams
of items arriving in real-time. The set of news from which to choose
recommendations is highly dynamic. In fact, news are published contin-
uously and, more importantly, they are related to events that cannot be
predicted in advance. Also, by nature, news have a short life cycle, since
they are replaced by updated news on the same topic, or because they be-
come obsolete. Therefore, a recommender system should be able to nd
102
relevant news early, before they lose their value over time. Third, the
nature of tweets, short and jargon-like, complicates the task of modeling
the interests of users.
Finally, personalization should leverage user proles to drive the user
to less popular news. But it should not prevent from suggesting news
of general interest, even when unrelated to the user prole, e.g., the
Fukushima accident.
In summary, our contributions are the following.
We propose an adaptive, online recommendation system. In con-
trast, typical recommendation systems operate in ofine mode.
We present a newapplication of usage of tweets. Twitter has mainly
been used to identify trending topics and analyzing information
spread. Its application to recommendation has received little inter-
est in the community. Our main recommendation system is called
T.REX, for twitter-based news recommendation system.
Our system provides personalized recommendations by leverag-
ing information from the tweet stream of users, as well as from
the streams of their social circle. By incorporating social informa-
tion we address the issue of data sparsity. When we do not have
enough information for a user, for example for a new user or if a
user rarely tweets, the available information in the social circle of
the user provides a proxy to his interests.
The rest of this chapter is organized as follows. In Section 5.3 we for-
malize the news recommendation problem, and introduce our proposed
personalized news ranking function. In Section 5.5 we describe the learn-
ing process of the ranking function based on click data. In Section 5.6 we
present the experimental setting used in the evaluation of our algorithms
and our results. Finally, in Section 5.2 we discuss other related work.
103
5.2 Related work
Recommender systems can be roughly divided in two large categories:
content-based and collaborative ltering (Goldberg et al., 1992).
The large-scale system for generating personalized news recommen-
dations proposed by Das et al. (2007) falls in the second category. The au-
thors exploit a linear combination of three different content agnostic ap-
proaches based only on click-through data: clustering of users based on
minhashing, probabilistic latent semantic indexing of user proles and
news, and news co-visitation count. Even tough the system can update
its recommendation immediately after a new click is observed, we have
seen that click information arrives with a signicant delay, and therefore
it may fail in detecting emerging topics early.
Using Twitter to provide fresh recommendations about emerging top-
ics has received a large deal of attention recently. A number of stud-
ies attempts to understand the Twitter social network, the information-
spreading process, and to discover emerging topics of interesest over
such a network (Bakshy et al., 2011; Cataldi et al., 2010; Java et al., 2007).
Chen et al. (2010) propose a URL-recommender system from URLs
posted in Twitter. A user study shows that both the user content-based
prole, and the user social neighborhood plays a role, but, the most im-
portant factor in the recommendation performance is given by the social
ranking, i.e., the number of times a URL is mentioned among the neigh-
borhood of a given user. The work presented by Garcia Esparza et al.
(2010) exploits data from a micro-blogging movie review service similar
to Twitter. The user prole is built on the basis of the users posts. Sim-
ilarly, a movie prole generated from the posts associated to the movie.
The proposed prototype resembles a content-based recommender sys-
tem, where users are matched against recommended items.
A small user study by Teevan et al. (2011) reports that 49% of users
search while looking for information related to news, or to topics gaining
popularity, and in general to to keep up with events. The analysis
of a larger crawl of Twitter shows that about 85% of the Twitter posts
are about headlines or persistent news (Kwak et al., 2010). As a result,
104
this abundance of news-related content makes Twitter the ideal source of
information for the news recommendation task.
Akcora et al. (2010) use Twitter to detect abrupt opinion changes.
Based on an emotion-word corpus, the proposed algorithm detects opinion
changes, which can be related to publication of some interesting news.
No actual automatic linking to news is produced. The system proposed
by Phelan et al. (2011) is a content-based approach that uses tweets to
rank news using tfidf. A given a set of tweets, either public or of a
friend, is used to build a user prole, which is matched against the news
coming from a set of user-dened RSS feeds. McCreadie et al. (2010)
show that, even if useful, URL links present in blog posts may not be suf-
cient to identify interesting news due to their sparsity. Their approach
exploits user posts as if they were votes for the news on a given day. The
association between news and posts is estimated by using a divergence
from randomness model. They show that a gaussian weighting scheme
can be used protably to predict the importance of a news on a given
day, given the posts of a few previous days.
Our approach pursues the same direction, with the goal of exploit-
ing Twitter posts to predict news of interest. To this end, rather than
analyzing the raw text of a tweet, we chose to extract the entities dis-
cussed in tweets. Indeed, Twitter highlights in its Web interfaces the
so called trending topics, i.e., set of words occurring frequently in recent
tweets. Asur et al. (2011) crawled all the tweets containing the keywords
identifying a Twitter trending topic in the 20 minutes before the topic is
detected by Twitter. The authors nd that the popularity of a topic can
be described as a multiplicative growth process with noise. Therefore
the cumulative number of tweets related to a trending topic increases
linearly with time.
Kwak et al. (2010) conduct an interesting study on a large crawl of
Twitter, and analyze a number of features and phenomena across the
social network. We highlight a few interesting ndings. After the initial
break-up, the cumulative number of tweets of a trending topic increases
linearly, as suggested by Asur et al. (2011), and independently of the
number of users. Almost 80% of the trending topics have a single activity
105
period, i.e., they occur in a single burst, and 93% of such activity periods
last less than 10 days. Also, once a tweet is published, half of its re-tweets
occur within an hour and 75% within one day. Finally, once re-tweeted,
tweets quickly spread four hops away. These studies conrm the fast
information spreading occurring on Twitter.
A basic approach for building topic-based user proles from tweets
is proposed by Michelson and Macskassy (2010). Each capitalized non-
stopword is considered an entity. The entity is used to query Wikipedia
and the categories of the retrieved page are used to update the list of
topics of interest for the user who authored the tweet.
Abel et al. (2011) propose a more sophisticated user model to support
news recommendation for Twitter users. They explore different ways of
modeling use proles by using hashtags, topics or entities and conclude
that entity based modeling gives the best results. They employ a simple
recommender algorithm that uses cosine similarity between user proles
and tweets. The authors recommend tweets containing news URLs and
re-tweets are used as ground truth.
Based on these results, we choose to use Wikipedia to build an entity
based user model. We use the SPECTRUM system (Paranjpe, 2009) to map
every single tweet to a bag of entities, each corresponding to a Wikipedia
page. We apply the same entity extraction process to the stream of news
with the goal of overcoming the vocabulary mismatch problem. Our pro-
posed model borrows from the aforementioned works by exploiting a
blend of content-based and social-based prole enrichment. In addition
it is able to discover emerging trends on Twitter by measuring entity
popularity and taking into account aging effects. Finally, we propose to
learn the ranking function directly from available data.
5.3 Problem denition and model
Our goal is to harness the information present in tweets posted by users
and by their social circles in order to make relevant and timely recom-
mendation of news articles. We proceed by introducing our notation and
dene formally the problem that we consider in this work. For quick
106
Table 5.1: Table of symbols.
Symbol Denition
A = n
0
, n
1
, . . . Stream of news
n
i
i-th news article
T = t
0
, t
1
, . . . Stream of tweets
t
i
i-th tweet
| = u
0
, u
1
, . . . Set of users
: = z
0
, z
1
, . . . Set of entities
(n
i
) Timestamp of the i-th news article n
i
(t
i
) Timestamp of the i-th tweet t
i
c(n
i
) Timestamp of the click on n
i
S Social network matrix
S
as the
[|[ [|[ matrix where S
=
_
i=d
i=1
i
S
i
_
,
where S is the row-normalized adjacency matrix of the social network, d is the
maximum hop-distance up to which users may inuence their neighbors, and
is a damping factor.
Next we model the prole of a user based on the content that the user
has generated. We rst dene a binary authorship matrix A to capture
the relationship between users and the tweets they produce.
Denition 5 (Tweet authorship A). Let Abe a [|[[T [ matrix where A(i, j)
is 1 if u
i
is the author of t
j
, and 0 otherwise.
The matrix A can be extended to deal with different types of relation-
ships between users and posts, e.g., weigh differently re-tweets, or likes.
108
In this work, we limit the concept of authorship to the posts actually
written by the user.
It is worth noting that a tweet stream for a user is composed of tweets
authored by the user and by people in the social neighborhood of the
user. This is a generalization of the home timeline as known in Twitter.
We observe that news and tweets often happen to deal with the same
topic. Sometimes a given topic is pushed into Twitter by a news source,
and then it spread throughout the social network. However sometimes
a given topic is rst discussed by Twitter users, and it is later reected in
the news, which may or may not be published back on Twitter. In both
cases, it is important to discover which are the current trending topics
in order to promptly recommend news of interest. We model the rela-
tionship between tweets and news by introducing an intermediate layer
between the two streams. This layer is populated by what we call entities.
Denition 6 (Tweets-to-news model M). Let A be a
stream of news, T a stream of tweets, and : = z
0
, z
1
, . . . a set of entities.
We model the relationship between tweets and news as a [T [ [A[ matrix M,
where M(i, j) is the relatedness of tweet t
i
to news n
j
, and it is computed as
M = T N,
where
T is a [T [ [:[ row-wise normalized matrix with T(i, j) representing the re-
latedness of tweet t
i
to entity z
j
;
N is a [:[ [A[ column-wise normalized matrix with N(i, j) representing the
relatedness of entity z
i
to news n
j
.
The set of entities : introduces a middle layer between the stream of
news and the stream of tweets that allows us to generalize our analysis.
First, this layer allows to overcome any vocabulary mismatch problem
between the two streams, since the streams are mapped onto the entity
space. Second, rather than monitoring the relevance of a specic news or
a tweet, we propose to measure the relevance of an entity.
A number of techniques can be used to extract entities from news
and tweets. A nave approach is to let each term in the dictionary play
109
the role of an entity. In this case T(t, z) can be estimated as the number
of occurrences of the term z in the tweet t, or as a tfidf score, and
similarly for N. Alternatively, probabilistic latent semantic indexing can
map tweets and news onto a set of latent topics (Hofmann, 1999).
In this work we follow a third approach: we use an existing entity-
extraction system. In particular we use the SPECTRUM system, which
was proposed by Paranjpe (2009). Given any fragment of text, the SPEC-
TRUM system identies entities related to Wikipedia articles. Therefore,
we assume that : consists of the set of all titles of Wikipedia articles. This
choice has some interesting advantages. First, once an entity is detected,
it is easy to propose a meaningful label to the user. Second, it allows
to include additional external knowledge into the ranking, such as ge-
ographic position, categorization, number of recent edits by Wikipedia
users. Although we choose to use Spectrum, we note that our model is
independent of the specic entity extraction technique employed.
Example. Consider the following tweet by user KimAKelly: Miss Lib-
erty is closed until further notice. The words Miss Liberty are mapped
to the entity/Wikipedia page Statue of Liberty, which is an interesting
topic, due to the just announced renovation. This allows to rank high
news regarding the Statue of Liberty, e.g. Statue of Liberty to Close for
One Year after 125th Anniversary by Fox news. Potentially, since the
Wikipedia page is geo-referenced, it is possible to boost the ranking of
the news for users living nearby, or interested in the entity New York.
Example. Consider the following tweet by user NASH55GARFIELD:
We dont know how were gonna pay social security benets to the el-
derly & were spending 27.25 mill $ to renovate the Statue of Liberty!!!.
The following entities/Wikipedia pages are extracted: welfare (from so-
cial benets), social security and Statue of Liberty. This entity extraction
suggests that news regarding the announced Statue of Liberty renova-
tion, or the U.S. planned medicare cuts are of interests to the user.
The three matrices S
A M,
where is a [|[ [A[ matrix, where (u
i
, n
j
) is the relevance of news n
j
for
user u
i
.
According to the denition of social-based relatedness , the rele-
vance of a news article is computed by taking into account the tweets
authored by neighboring users.
The matrices and measure content-based similarity, with no ref-
erence to popularity or freshness of tweets, news articles, and entities.
In order to provide timely recommendations and catch up with trending
news, we introduce a popularity component, which combines the hot-
ness of entities in the news stream and the tweet stream.
Denition 9 (Entity-based news popularity ). Given a stream of news A,
a set of entities :, and their relatedness matrix N, the popularity of A is dened
as
= Z N,
where Z is a row-wise vector of length [:[ and Z(i) is a measure of the popular-
ity of entity z
i
. The resulting is a row-wise vector of length [A[, where (j)
measures the popularity of the news article n
j
.
The vector Z holds the popularity of each entity z
i
. The counts of
the popularity vector Z need to be updated as new entities of interest
arise in the news streamA and the tweet streamT . An important aspect
111
in updating the popularity counts is to take into account recency: new
entities of interest should dominate the popularity counts of older enti-
ties. In this work, we choose to update the popularity counts using an
exponential decay rule. We discuss the details in Section 5.3.1. However,
note that the popularity update is independent of our recommendation
model, and any other decaying function can be used.
Finally, we propose a ranking function for recommending news arti-
cles to users. The ranking function is linear combination of the scoring
components described above. We plan to investigate the effect of non-
linear combinations in the future.
Denition 10 (Recommendation ranking R
and
(u, n) =
(u, n) +
(u, n) +
(n),
where , , are coefcients that specify the relative weight of the components.
At any given time, the recommender system produces a set of news
recommendation by ranking a set of candidate news, e.g., the most re-
cent ones, according to the ranking function R. To motivate the pro-
posed ranking function we note similarities with popular recommenda-
tion techniques. When = = 0, the ranking function R resembles
collaborative ltering, where user similarity is computed on the basis
of their social circles. When = = 0, the function R implements a
content-based recommender system, where a user is proled by the bag-
of-entities occurring in the tweets of the user. Finally, when = = 0,
the most popular items recommended, regardless of the user prole.
Note that , , and R are all time dependent. At any given time
the social network and the set of authored tweets vary, thus affecting
and . More importantly, some entities may abruptly become popular,
hence of interest to many user. This dependency is captured by . While
the changes in and derive directly from the tweet stream T and the
social network S, the update of is non-trivial, and plays a fundamental
role in the recommendation system that we describe in the next section.
112
5.3.1 Entity popularity
We complete the description of our model by discussing the update of
the entity popularity counts. We motivate our approach by empirically
observing how the user interest for particular entities decays over time.
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
M
a
y
-
0
1
h
2
0
M
a
y
-
0
2
h
0
0
M
a
y
-
0
2
h
0
4
M
a
y
-
0
2
h
0
8
M
a
y
-
0
2
h
1
2
M
a
y
-
0
2
h
1
6
M
a
y
-
0
2
h
2
0
M
a
y
-
0
3
h
0
0
M
a
y
-
0
3
h
0
4
M
a
y
-
0
3
h
0
8
news
twitter
clicks
Figure 5.3: Cumulative Osama Bin Laden trends (news, Twitter and clicks).
Figure 5.3 shows the cumulative distribution of occurrences for the
same entity of Figure 5.1. By using cumulative distribution, the fact the
entity appears much earlier in Twitter than in the news becomes more
evident. If we consider the number of clicks on news as a surrogate of
user interest, we notice that with a delay of about one day the entity re-
ceives a great deal of attention. This delay is probably due to the fact
that the users have not been informed about the event yet. On the other
hand, we observe that after two days the number of clicks drop, possibly
the interest of users has been saturated or diverted to other events.
113
1
10
100
1 10 100 1000 10000
Minutes
News-click delay
Figure 5.4: News-click delay distribution.
We note that the example above is not an exception, rather it describes
the typical behavior of users with respect to reading news articles online.
Figure 5.4 shows the scatter plot of the distribution of delay between the
time that news articles are published the when they are clicked by users.
The gure considers all the news articles and all the clicks in our dataset
as described in Section 5.6.1.
Only a very small number of news is clicked within the rst hour
from their publication. Nonetheless, 76.7% of the clicks happen within
one day (1440 minutes) and 90.1% within two days (2880 minutes). Anal-
ogous observations can be made by looking at Figure 5.5, which shows
the cumulative distribution of the delay and resembles a typical Pareto
distribution. The slope of the curve is very steep up to about 2000 min-
utes (33 hours), however it attens out quite rapidly soon after.
114
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Minutes
News-click delay
Figure 5.5: Cumulative news-click delay distribution.
From these empirical observations on the delay between news pub-
lishing and user clicking we can draw the following conclusions:
1. The increase in the number of tweets and news related to a given
entity can be used to predict the increase of interest of the users.
2. News become stale after two days.
The rst observation motivates us to inject a popularity component in
our recommendation model. The second observation suggests updating
popularity scores using a decaying function. We choose an exponentially-
decaying function, which has the advantage of allowing to count ef-
ciently frequent items in the data stream model. This choice allows to
monitor entity popularity in high-speed data streams by using only a
limited amount of memory (Cormode et al., 2008).
115
Denition 11 (Popularity-update rule). Given a stream of news A and a
stream of tweets T at time , the popularity vector Z is computed at the end of
every time window of xed width as follows
Z
= Z
1
+w
T
H
T
+w
N
H
N
.
The vectors H
T
and H
N
are estimates of the expected number of mentions of
the entity occurring in tweets and news, respectively, during the latest time
window. They are also called hotness coefcients. The weights w
T
and w
N
measure the relative impact of news and tweets to the popularity count, and
< 1 is an exponential forgetting factor.
The popularity-update rule has the effect of promptly detecting en-
tities becoming suddenly popular, and spreading this popularity in the
subsequent time windows. According to our experimental evidence, we
x so that a signal is becomes negligible after two days. We set both
coefcients w
T
and w
N
to 0.5 to give equal weight to tweets and news.
Asur et al. (2011) show that the number of tweets for a trending topic
grows linearly over time. So we compute the expected counts of the
entity in tweets and news H
T
and H
N
by using the rst order derivative
of their cumulative counts, measured at the last time window 1.
5.4 System overview
Figure 5.6 shows a general overview of the T.REX system. T.REX main-
tains a user model for each user that is composed by the content related-
ness and social relatedness components. Conceptually both components
process tweets at high speed to extract and count entities. This operation
can be done efciently and in parallel at high speed. Efciently keeping
frequency moments in a stream is a classical problem in the literature of
streaming algorithms (Alon et al., 1999).
The social component also holds truncated PageRank values for the
followees of each users. These values are used to weight the importance
of the followee prole when computing recommendation scores for the
user. Truncated PageRank values can be updated periodically, or when
the social graph has been modied more than a given threshold.
116
tweets
User
tweets
Followee
tweets
Followee
tweets
Followee
tweets
twitter
articles
news
T.Rex
User Model
!
"
#
Personalized
ranked list of
news articles
Figure 5.6: Overview of the T.REX system.
PageRank is a typical batch operation and can be easily implemented
in MR. Given that we use a truncated version of PageRank, this compu-
tation is even cheaper because the values can spread only d hops away,
therefore convergence will be faster. Parameters for the ranking function
can be computed ofine and updated likewise in a batch fashion. The
users can possibly be clustered into groups and different parameters can
be learned for each group. We leave the study of the effects of clustering
for future work. Learning the parameters and the potential clustering
are best seen as ofine processes, given that the general preferences of
the user are not supposed to change too rapidly.
However processing high speed streams is a poor t for MR. High
speed counting is necessary both for the user model and for entity popu-
larity. Even though MR has been extended to deal with incremental and
continuous computations (Condie et al., 2009), there are more natural
matches for computing on streams. One reason is that the fault tolerance
guarantees of MR are too strong in this case. We would rather loose some
tweets than wait for them to be reprocessed upon a crash, as long as the
system is able to keep running. Freshness of data is a key property to
preserve when mining streams, even at the cost of some accuracy.
Therefore we propose to use the actors model to parallelize our sys-
117
tem (Agha, 1986). In particular, we describe how to implement it on S4.
We use two different type of PEs to update the model: one for the con-
tent component and one for entity popularity. The social component can
be implemented by aggregating the values from various content compo-
nents with pre-computed weights, so it does not need a separate PE. The
weights can be updated in batches and computed in MR.
A content PE receives tweets keyed on the author, extracts entities
fromthe tweets and updates the counts for the entities seen so far. Then it
sends messages keyed on the entity to popularity PEs in order to update
their entity popularity counts.
The entity extraction operation is performed also on incoming news
by using a keyless PE. This PE just performs entity extraction on the in-
coming news articles and forwards the mention counts downstream. The
news articles with the extracted entities are kept in a pool as long as they
do not become stale, after which they are discarded.
A popularity PE receives entity count updates keyed on the entity. It
updates its own counts by using the timestamp of the mention and an
efcient procedure for exponentially decayed counters Cormode et al.
(2008). As popularity is updated at xed intervals, caching and batch-
ing can be used to reduce the network load of the system. For instance
updates can be sent only at time window boundaries.
Figure 5.7 depicts an overview of the ranking process at query time.
When the user logs into the system, the pool of news articles is ranked
according to the user model model and the entity popularity informa-
tion. The search for relevant news can be speeded up by employing an
inverted index of the entities in each news, and indeed the process is very
similar to answering a query in a Web search engine.
All the standard techniques for query processing found in the infor-
mation retrieval literature can be used. For example, it is possible to
employ early exit optimizations (Cambazoglu et al., 2010). Ranking of
news can proceed in parallel at different PEs, and nally only the top-k
news articles from each PE need to be aggregated again at a single place.
118
Personalized
ranked list of
news articles
Entity-based
user model
at login time
Dynamic
news pool
News index
by entity
Popular
entities
! "
#
Ranking
function
Figure 5.7: T.REX news ranking dataow.
5.5 Learning algorithm
The next step is to estimate the parameters of the relevance model that
we developed in the previous section. The parameters consist of the co-
efcients , , used to adjust the relative weight of the components of
the ranking function R
(u, n
i
) > R
(u, n
j
). (5.1)
The click stream from Yahoo! toolbar identies a large number of con-
straints according to Equation (5.1) that the optimal ranking function
must satisfy. As for the learning-to-rank problem Joachims (2002), nd-
ing the optimal ranking function is an NP-hard problem. Additionally,
considering that some of the constraints could be contradictory, a feasi-
ble solution may not exist. As usual, the learning problem is translated
to a Ranking-SVM optimization problem.
Problem 4 (Recommendation Ranking Optimization).
Minimize: V
_
, , ,
_
=
1
2
|
|
2
+C
ij
Subject to: R
(u, n
i
) > R
(u, n
j
) + 1
ij
for all , n
i
A, n
j
A such that
(n
i
) , (n
j
) , c(n
i
) < c(n
j
)
ij
0
As shown by Joachims (2002), this optimization problemcan be solved
via classication SVM. In the following sections, we show how to gener-
ate the training set of the SVMclassier in order to keep a reasonably low
amount of training instances, so as to speed-up convergence and support
the scalability of the solution.
120
5.5.1 Constraint selection
The formulation of Problem4 includes a potentially huge number of con-
straints. Every click occurring after time on a news n, generates a new
constraints involving n and every other non-clicked news published be-
fore time . Such a large number of constraints also includes relation-
ships on stale news articles, e.g., a non-clicked news article published
weeks or months before . Clicks are the signals driving the learning pro-
cess. So it is important to select pairs of clicked news articles that elim-
inate biases such as the one caused by stale news articles. Clearly, the
more constraints are taken into consideration during the learning pro-
cess, the more robust is the nal model. On the other hand, increasing
the number of constraints affects the complexity of the minimization al-
gorithm. We propose the following strategy in order to select only the
most interesting constraints and thus simplify the optimization problem.
First, we evaluate the ranking function only at the time instants when
a click happens. This selection does not actually change the set of con-
straints of Problem 4, but it helps in the generation of the constraints by
focussing on specic time instants. If the user u clicks at the news article
n
i
at time c(n
i
), then the news article n
i
must be the most relevant at that
time. The following condition must hold.
R
c(n
i
)
(u, n
i
) > R
c(n
i
)
(u, n
j
) + 1
ijc(n
i
)
for all n
j
A such that (n
j
) c(n
i
). (5.2)
Whenever a click occurs at time c(n
i
), we add a set of constraint to
the ranking function such that the clicked news article gets the largest
score among any other news article published before time c().
Second, we restrict the number of news articles to be compared with
the clicked one. If a news article was published a long time before the
click, then it can be easily ltered out, as it would not help the learning
process. As we have shown in Section 5.3.1, users lose interest into news
articles after a time interval of two days. We can make use of this
121
threshold into Equation (5.2) as follows:
R
c(n
i
)
(u, n
i
) > R
c(n
i
)
(u, n
j
) + 1
ijc(n
i
)
for all n
j
A [c(n
i
) , c(n
i
)] , (5.3)
where A [c(n
i
) , c(n
i
)] is the set of news articles published between
time c(n
i
) and c(n
i
).
Finally, we further restrict A [c(n
i
) , c(n
i
)] by considering only the
articles which that are relevant according to at least one of the three score
components
. Let Top(k, ,
a
,
b
) be the set of k news articles
with largest rank in the set A [
a
,
b
] according to the score component .
We include into the optimization Problem 4 only the constraints:
R
c(n
i
)
(u, n
i
) > R
c(n
i
)
(u, n
j
) + 1
ijc(n
i
)
for all n
j
s.t. n
j
Top
_
k,
c(n
i
)
, c(n
i
) , c(n
i
)
_
or
n
j
Top
_
k,
c(n
i
)
, c(n
i
) , c(n
i
)
_
or
n
j
Top
_
k,
c(n
i
)
, c(n
i
) , c(n
i
)
_
. (5.4)
By setting k = 10 we are able to reduce the number of constraints from
more than 25 million to approximately 250 thousand, thus signicantly
reducing the training time of the learning algorithm.
5.5.2 Additional features
The system obtained by learning the model parameters using the SVM-
Rank method is named T.REX, and forms our main recommender. Ad-
ditionally, we attempt to improve the accuracy of T.REX by incorporat-
ing more features. To build this improved recommender we use exactly
the same learning framework: we collect more features from the same
training dataset and we learn their importance using the SVM-Rank al-
gorithm. We choose three additional features: age, hotness and click count,
which we describe below.
The age of a news article n is the time elapsed between the current
time and the time n was published, that is, (n).
The hotness of a news article is a set of features extracted from the
vectors H
T
and H
N
, which keep the popularity counts for the entities in
122
our model. For each news article n we compute the average and stan-
dard deviation of the vectors H
T
and H
N
over all entities extracted from
article n. This process gives us four hotness features per news article.
Finally, the click count of a news article is simply the number of times
that the article has been clicked by any users in the system up to time .
The system that we obtain by training our ranking function on these
additional features is called T.REX+.
5.6 Experimental evaluation
5.6.1 Datasets
1
10
100
1000
10000
100000
1e+06
1e+07
10 100 1000 10000 100000
C
o
u
n
t
Entity
Distribution
Cumulative
Figure 5.8: Distribution of entities in Twitter.
To build our recommendation system we need the following sources
of information: Twitter stream, news stream, the social network of users,
123
1
10
100
1000
10000
100000
1e+06
10 100 1000 10000
C
o
u
n
t
Entity
Distribution
Cumulative
Figure 5.9: Distribution of entities in news.
and click-through data. We extract this information from three different
data sources: Twitter, Yahoo! news, and Yahoo! toolbar, respectively.
Twitter: We obtain tweets from Twitters public API by crawling users
timelines. We collect all tweets posted during May 2011 by our 3, 214
target users (identied as described below). We also collect a random
sample of tweets to track entity popularity across all Twitter. We extract
entities from tweets using the Spectrum method described by Paranjpe
(2009). Overall we obtain about 1 million tweets in English, for which we
are able to extract entities. Figure 5.8 shows the distribution of entities in
the Twitter dataset. The curve has a truncated power law shape.
Yahoo! news: We collect all news articles aggregated in the English site of
Yahoo! news during May 2011. From the news articles we extract entities
by using again the Spectrum algorithm. Overall we have about 28.5
million news articles, from which we keep only the articles that contain
124
at least one entity contained in one of the tweets. In total we obtain about
40 thousand news articles. Figure 5.9 shows the distribution of entities
in the Yahoo! news dataset. The shape of the distribution is similar to the
one of Twitter, a truncated power law distribution.
Yahoo! toolbar: We collect click information from the Yahoo! toolbar
logs. The Yahoo! toolbar is a browser add-on that provides a set of brows-
ing functionalities. The logs contain information about the browsing be-
havior of the users who have installed the toolbar, such as, user cookie
id, url, referral url, event type, and so on. We collect all toolbar data oc-
curring in May 2011. We use the rst 80% in chronological order of the
toolbar clicks to train our system.
Using a simple heuristic we identify a small set of users for whom we
can link their toolbar cookie id, with their Twitter user id. The heuristic
is to identify which Twitter account a user is visiting more often, dis-
carding celebrity accounts and non-bijective mappings. The underlying
assumption is that users visit more often their own accounts. In total we
identify of set U
0
of 3,214 test users. The dataset used in this work is the
projection of all the data sources on the users of the set U
0
, that is, for all
the users in U
0
we collect all their tweets and all their clicks on Yahoo!
news. Additionally, by employing snowball sampling on Twitter we ob-
tain the set of users U
1
who are followed by the set of users U
0
. Then, by
collecting all the tweets of the users in U
1
, we form the social component
for our set of users of interest U
0
.
5.6.2 Test set
Evaluating a recommendation strategy is a complex task. The ideal eval-
uation method is to deploy a live system and gather click-through statis-
tics. While such a deployment gives the best accuracy, it is also very
expensive and not always feasible.
User studies are a commonly used alternative evaluation method.
However, getting judgements from human experts does not scale well
to a large number of recommendations. Furthermore, these judgement
are often biased because of the small sample sizes. Finally, user studies
125
cannot be automated, and they are impractical if more than one strategy
or many parameters need to be tested.
For these reasons, we propose an automated method to evaluate our
recommendation algorithm. The proposed evaluation method exploits
the available click data collected by the Yahoo! toolbar, and it is similar
in spirit with the learning process. Given a stream of news and a stream
of tweets for the user, we identify an event that a user clicks at a news
article n
). Suppose
the user just logged in the system at time . Then the recommendation
strategy should rank the news article n
as high as possible.
To create a test instance, we collect all news articles that have been
published within the time interval [
, and we
then examine what is the ranking of the article n
in the list.
We use the last 20%, in chronological order, of the Yahoo! toolbar log
to test the T.REX and T.REX+ algorithms, as well as all the baselines.
5.6.3 Evaluation measures
Evaluating a ranking of items is a standard problem in information re-
trieval. In our setting we have only one correct answer per ranking
the news article n
i=1
1
r(n
i
)
,
where r(n
i
Z Topics of
Interest on Twitter : A First Look. In AND 10: 4th Workshop on Analytics for
Noisy Unstructured Text Data, pages 7380, 2010. 106
Leonardo Neumeyer, Bruce Robbins, A. Nair, and A. Kesari. S4: Distributed
Stream Computing Platform. In ICDMW 10: 10th International Conference on
Data Mining Workshops, pages 170177. IEEE, 2010. 15, 23, 31
Michael G Noll and Christoph Meinel. Building a Scalable Collaborative Web
Filter with Free and Open Source Software. In SITIS 08: 4th IEEE International
Conference on Signal Image Technology and Internet Based Systems, pages 563571.
IEEE Computer Society, November 2008. 36
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and An-
drew Tomkins. Pig Latin: A not-so-foreign language for data processing. In
SIGMOD 08: 34th International Conference on Management of Data, pages 1099
1110. ACM, June 2008. 15, 24
Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
Christina Lioma. Terrier: A High Performance and Scalable Information Re-
trieval Platform. In OSIR
A
Z06: 2nd International Workshop on Open Source
Information Retrieval, pages 1825, 2006. 34
Rasmus Pagh and Charalampos E. Tsourakakis. Colorful Triangle Counting and
a MapReduce Implementation. Arxiv, pages 19, 2011. 35
Alessandro Panconesi and Mauro Sozio. Fast primal-dual distributed algorithms
for scheduling and matching problems. Distributed Computing, 22(4):269283,
March 2010. ISSN 0178-2770. 69, 70, 75, 81
147
Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu
Vyas. Web-Scale Distributional Similarity and Entity Set Expansion. In EMNLP
09: 2009 Conference on Empirical Methods in Natural Language Processing, page
938. Association for Computational Linguistics, 2009. ISBN 9781932432626. 34
Spiros Papadimitriou and Jimeng Sun. DisCo: Distributed Co-clustering with
Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. In
ICDM 08: 8th International Conference on Data Mining, pages 512521. IEEE
Computer Society, December 2008. 35
Deepa Paranjpe. Learning document aboutness from implicit user feedback and
document structure. In CIKM 09: 18th Conference on Information and Knowledge
Mining, pages 365374, New York, New York, USA, 2009. ACM, ACM Press.
ISBN 9781605585123. 106, 110, 124
M Penn and M Tennenholtz. Constrained multi-object auctions and b-matching.
Information Processing Letters, 75(1-2):2934, 2000. 71
Owen Phelan, K. McCarthy, Mike Bennett, and Barry Smyth. Terms of a Feather
: Content-Based News Recommendation and Discovery Using Twitter. Ad-
vances in Information Retrieval, 6611(07):448459, 2011. 105
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the
data: Parallel analysis with Sawzall. Scientic Programming, 13(4):277298, Oc-
tober 2005. 23
Anand Rajaraman and Jeffrey D Ullman. Mining of Massive Datasets. Stanford
University, 2010. 24, 35
Benjamin Reed and Flavio P. Junqueira. A simple totally ordered broadcast pro-
tocol. In LADIS 08: 2nd Workshop on Large-Scale Distributed Systems and Middle-
ware, LADIS 08, pages 2:1-2:6, New York, NY, USA, September 2008. ACM.
21
Jennifer Rowley. The wisdom hierarchy: representations of the DIKW hierarchy.
Journal of Information Science, 33(2):163180, April 2007. 4
Sunita Sarawagi and Alok Kirpal. Efcient set joins on similarity predicates. In
SIGMOD 04: 30th International Conference on Management of Data, pages 743
754. ACM, 2004. 42
M C Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioin-
formatics, 25(11):1363, June 2009. 36
Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo Kim,
and Seungryoul Maeng. HAMA: An Efcient Matrix Computation with the
148
MapReduce Framework. In CloudCom 10: 2nd International Conference on Cloud
Computing Technology and Science, pages 721726. IEEE, November 2010. ISBN
978-1-4244-9405-7. 23
Y Shi. Reevaluating Amdahls Law and Gustafsons Law. October 1996. URL
http://www.cis.temple.edu/~shi/docs/amdahl/amdahl.html. 12
M Stonebraker, C Bear, U Cetintemel, M Cherniack, T Ge, N Hachem, S Hari-
zopoulos, J Lifter, J Rogers, and S Zdonik. One size ts all? Part 2: Bench-
marking results. In CIDR 07: 3rd Conference on Innovative Data Systems Re-
search, January 2007a. 10
MStonebraker, S Madden, DJ Abadi, S Harizopoulos, NHachem, and P Helland.
The End of an Architectural Era (Its Time for a Complete Rewrite). In VLDB
07: 33rd International Conference on Very Large Data Bases, pages 11501160.
ACM, September 2007b. 10
Michael Stonebraker. The Case for Shared Nothing. IEEE Data Engineering Bul-
letin, 9(1):49, March 1986. 11
Michael Stonebraker and U gur etintemel. "One Size Fits All": An Idea Whose
Time Has Come and Gone. In ICDE 05: 21st International Conference on Data
Engineering, pages 211. IEEE Computer Society, Ieee, April 2005. ISBN0-7695-
2285-8. 10
Michael Stonebraker, Daniel Abadi, David J DeWitt, Sam Madden, Erik Paul-
son, Andrew Pavlo, and Alexander Rasin. MapReduce and Parallel DBMSs:
Friends or Foes? Communications of the ACM, 53(1):6471, January 2010. 13
Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the
last reducer. In WWW 11: 20th International Conference on World Wide Web,
pages 607614. ACM, 2011. ISBN 9781450306324. 35
Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. # TwitterSearch
: A Comparison of Microblog Search and Web Search. In WSDM 11: 4th
International Conference on Web Search and Data Mining, pages 3544, 2011. 104
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive - A
Warehousing Solution Over a Map-Reduce Framework. VLDB Endowment, 2
(2):16261629, August 2009. 24
Alvin Tofer. Future shock. Random House Publishing Group, 1984. 99
Charalampos E Tsourakakis, U Kang, Gary L Miller, and Christos Faloutsos.
DOULION: Counting Triangles in Massive Graphs with a Coin. In KDD 09:
15th International Conference on Knowledge Discovery and Data Mining, pages
837846. ACM, April 2009. 35
149
Leslie G. Valiant. A bridging model for parallel computation. Communications of
the ACM, 33(8):103111, August 1990. ISSN 00010782. 23
Rares Vernica, Michael J. Carey, and Chen Li. Efcient parallel set-similarity joins
using MapReduce. In SIGMOD 10: 36th International Conference on Manage-
ment of Data, pages 495506, New York, New York, USA, 2010. ACM Press.
ISBN 9781450300322. 39, 44
Werner Vogels. Eventually Consistent. ACM Queue, 6(6):14-19, October 2008.
ISSN 15427730. 22
Ellen M Voorhees. The TREC-8 Question Answering Track Report. In TREC 99:
8th Text REtrieval Conference, 1999. 126
Mirjam Wattenhofer and Roger Wattenhofer. Distributed Weighted Matching. In
Distributed Computing, pages 335348, 2004. 71
Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efcient similarity joins
for near duplicate detection. In WWW 08: 17th International Conference on
World Wide Web, pages 131140. ACM, 2008. 42, 45
Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k Set Similarity
Joins. In ICDE 09: 25th International Conference on Data Engineering, pages 916
927. IEEE Computer Society, 2009. 40
Yahoo! and Facebook. Yahoo! Research Small World Experiment, 2011. URL
http://smallworld.sandbox.yahoo.com/. 7
H Yang, A Dasdan, R L Hsiao, and D S Parker. Map-reduce-merge: simplied
relational data processing on large clusters. In SIGMOD 07: 33rd International
Conference on Management of Data, pages 10291040. ACM, June 2007. 28
J.H. Yoon and S.R. Kim. Improved Sampling for Triangle Counting with MapRe-
duce. In ICHIT 11: 5th International Conference on Convergence and Hybrid Infor-
mation Technology, pages 685689. Springer, 2011. 35
Y Yu, M Isard, D Fetterly, M Budiu, U Erlingsson, P K Gunda, and J Currey.
DryadLINQ: A system for general-purpose distributed data-parallel comput-
ing using a high-level language. In OSDI 08: 8th Symposium on Operating
System Design and Implementation, December 2008. 24
Bin Zhou, Daxin Jiang, Jian Pei, and Hang Li. OLAP on search logs: an infrastruc-
ture supporting data-driven applications in search engines. In KDD 09: 15th
International Conference on Knowledge Discovery and Data Mining, pages 1395
1404. ACM, June 2009. 34
Unless otherwise expressly stated, all original material of whatever
nature created by Gianmarco De Francisci Morales and included in
this thesis, is licensed under a Creative Commons Attribution Non-
commercial Share Alike 3.0 License.
See creativecommons.org/licenses/by-nc-sa/3.0 for the legal code
of the full license.
Ask the author about other uses.