Distributed data mining

School of something
Computing
FACULTY OF ENGINEERING
OTHER

Distributed Data Mining for User Sensemaking
in Online Collaborative Spaces

Submitted to:
DicoSyn2012 Workshop @ CSCW’12

Presented By: Ahmad Ammari
RF in User & Community Modelling

OUTLINE
• The Big Data “Problem” in Online Collaborative Spaces
• What is User Sensemaking and How Big Data is affecting it?
• Can Distributed Data Mining Help?
• What is Hadoop & Map / Reduce?
• What is Mahout?
• Proposed Approach to support User Sensemaking in OCS
• Content Pre-Processing
• Content Clustering
• Topic Modelling
• Case Study: Making Sense of Online Forums
• How are Discussions currently Organized? Clusters vs. Categories
• Which Content to Mine? Mining the Right Discussion Parts
2
• How Can This Help Sensemaking? Some Usage Scenarios

How “Big” is Big Data?
• Emails
• 90 Trillion – The Number of Emails Sent on the Internet in 2009
• 107 Trillion – The Number of Emails Sent on the Internet in 2010
• Websites
• 234 Million – The Number of Websites by Dec 2009
• 255 Million – The Number of Websites by Dec 2010
• Social Media
• 152 Million – The Number of Blogs on the Internet in 2010
• 25 Billion – The Number of sent Tweets on Twitter in 2010
• Multi Media
• 5 Billion – The Number of Photos Hosted by Flicker (Sep 2010)
• 2 Billion – The Number of Videos Watched per Day on YouTube

3

What about Online CS?

They are Big Too!
Top 10 biggest Internet forums

4

What about Online CS?

They are Big Too!
Stack Exchange Family of Forums

5

Why is it a Problem?

Where should I post my
programming question to
get relevant replies?

6


Where to find a solution
to my MS Outlook Problem?

7


What are the actual
discussions are
really about?

I cannot make sense
of Big Content!

8

Why Making Sense of Big Data
is not Easy, not Fast?
• Because it’s Big and still increasing!
• Because it’s Diverse!
• Stack Exchange Suite of Forums has more than
50 Different Technical Discussion Forums
• WebProWorld Technical Forums has more than
40 Discussion Categories
• Because it’s Dynamic!
• 294 Billion – The Average Number of Email Messages per Day
• 21.4 Million – The Number of Added Websites in 2010
• 96,101 New Blogs in last 24 hours (8th Dec 2011)
• 190 Million – The Number of Tweets per day
in June 2011
• Because it’s Noisy!
• 200 billion – The number of spam emails per day in 2009
• 262 billion – The number of spam emails per day in 2010
9

But What is “Sensemaking”?!
• Creating a representation of a collection of information [Russell et al, 1993]
• Focused on the context of understanding large document collections. [Paul et al, 2011]
• Transforming Information into Knowledge [Priolli & Card, 2005]
• Seeking, filtering, searching for relations, extracting, schematizing
• Understanding connections among people, places, and events [Klein et al,
2006]

10

Our Solution!

Large-Scale Data Knowledge Discovery in Big
Processing Content
Quick Data Processing Analysis of Unstructured
Scalable Data Processing Data
Robust Data Processing Machine Intelligence to
Support Humans

11

What is Hadoop?
• A framework for storing and processing big data on lots of commodity
machines
• Up to 4,000 machines in a cluster
• Up to 20 PB in a cluster
• Open Source Apache project
• Implemented in Java We focused on distributed
computation with Map/Reduce
• Contains Many Sub-Projects:
• Map/Reduce – Software Framework for Distributed Processing of Large Dataets
• HDFS – Hadoop Distributed File System
• Hadoop Common – Provides Access to the File Systems Supported by Hadoop
• Chukwa – Data Collection System for Managing Large Distributed Systems
• Hbase – Scalable, Distributed Database that Supports Structured Data Storage
• Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying
• Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation
• Zookeeper – High-Performance Coordination Service for Dist. Apps.
12

Why they Use Hadoop?

14

Hadoop Map/Reduce

• Simply: A parallel programming model and an associated
implementation
• Abstract model: hides many system-level details from the
programmer
• Move-code-to-data philosophy: computation on data piece takes
place on the same machine where that piece resides
• Map/Reduce Job runs in Phases, each Phase runs in Parallel
across all Nodes in the Hadoop Cluster
• Main Phases: Mapping, Reducing
• Are there Other Phases? Yes!
• Shuffling & Sorting, Combining, Partitioning
• But .. Programmer writes “Mapper” and “Reducer” functions only!
15

Hadoop Map/Reduce

16

Hadoop Map/Reduce

More formally,
• Map(k1,v1)  list(k2,v2)
• Shuffle & Sort(list(k2,v2))  k2, list(v2)
• Reduce(k2, list(v2))  list(k3, v3)
17

Hadoop Map/Reduce

18

Our Solution!

Large-Scale Data Knowledge Discovery in Big
Processing Content
Quick Data Processing Analysis of Unstructured
Scalable Data Processing Data
Robust Data Processing Machine Intelligence to
Support Humans

19

What is Mahout?
• Open source machine learning library from Apache
• Began life in 2008 as a subproject of Apache’s Lucene Search Engine
• In 2009 absorbed the Taste open source collaborative filtering project
• In 2010 became a stand-alone Project
• Written in Java
• ML algorithms mainly for
• Recommender Engines (CF-based)
• Clustering April 2010
• Classification
• Pre-Processing algorithms for Unstructured Data
• Scalability is achieved by Map/Reduce Implementations of ML Algorithms

We focused on Mahout Clustering
and Pre-Processing
Implementations in Map/Reduce
20

Sensemaking-Support with DDM

INPUT: Collaboration Content (Discussions)

21


Content Pre-Processing: Prepare Content for Mining

22


Content Clustering: Derive Groups of Similar Content

23


Topic Modelling: Identify Fine-Grained Topics and
Generate Topic Clouds
24


OUTPUT: Topic Clouds

25

Content Pre-Processing

• Apache Lucene Text Analysis
• Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal
• TFIDF Weighting: Computing Numerical Weights to Content Terms
• n-gram Collocations
• Multi-Term Phrases having high probability of occurring together
• Examples: “Social Media”, “Data Mining”, “Machine Learning”
• Normalization
• decreasing the magnitude of large document vectors & increasing the magnitude
of small ones
• p-norm
• p depends on similarity measure used
• With Text Content, best similarity measures are Euclidean & Cosine  p = 2
• Example: the 2-norm of a 3-dimensional
vector, [x, y, z], is 26

Content Clustering

Discovering Clusters of “similar” Points

EM algorithm to a
2 component
Gaussian mixture
model on the Old
Faithful Geyser
dataset
http://bit.ly/oldfaithful

27

K-Means Clustering

Map/Reduce Implementation in Mahout
1. Starting with three
random points as
1 2 centroids
2. Map stage: assigns
each point to the cluster
nearest to it
3. Reduce stage: the
associated points are
averaged out to produce
the new location of the
3 4 centroid
4. After each iteration, the
final configuration is fed
back into the same loop
until the centroids come
to rest at their final
28
positions

Canopy Clustering
• Fast approximate clustering technique
• Divide the input set of points into overlapping clusters known as canopies
• In Mahout, it is used to estimate the approximate cluster centroids (or canopy
centroids) using two distance thresholds, T1 and T2, with T1 > T2
1. Start with a point and mark it as part
1 2 of a canopy
2. all the points within distance T2
removed from the data set and
prevented from becoming new
canopies.
3. The points within the outer circle are
also put in the same canopy, but
3 4 they’re allowed to be part of other
canopies. Assignment process is
done in a single pass on a mapper.
4. The reducer computes the average
of the centroid and merges close
canopies
29

Sensemaking in Online Forums

• Illustration of the Approach to support user sensemaking in Online Forums
• Content Collection from WebProWorld Technical Forums
• Large Forum (1000s of Discussion Threads)
• Organize Discussions into Categories (Subforums) Defined by Forum
Designers
• Four subforums were chosen for the experiment:
• Two subforums representing fairly specialized categories – SEO (Search
Engine Optimization) and e-Commerce
• Two subforums representing broad categories – IT and Computer
Assistance
• Objectives for the experiment
• Investigate the extent of sensemaking support needed for the public
technical forum
• Determine which content representation for clustering is more appropriate
to derive topic clouds for the sensemaker
• Illustrate how the output of the approach could provide sensemaking
30
support

Clusters vs Categories

Distribution of Four Categories in Distribution of Four Categories in Four
Four Mahout-based Clusters by Title Mahout-based Clusters by Title and
First Post

31

Content Representation

The smaller the average DBI, the clustering models having item
better the model is for achieving a distribution values closer to 1.0 will
coherent set of similar discussions. derive minor distinct clusters with
topic-specific discussions.

32

Example Topic Clouds

Enabled Discovery of Topic-
Specific Discussions not
Obvious in Category Names:
• Disk & Keyboard Problems
• Security Issues
• Hard Disk Backup
• MS Outlook File Problems
• Certificates and Skills in
Web Design
• Photo features in social
networks (facebook)
• Optimizing Search Engines
for Blog Search
• Design of Datawarehousing
Systems

33

Cross Validated Statistics Forum

34

Conclusion

• Big Data creates a Big Challenge to sensemaking in Online
Collaborative Spaces
• Distributed Data Mining with Hadoop Map/Reduce and Mahout is
exploited to support user sensemaking by summarizing the huge
content found in Large-scale Discussion Forums
• Cluster Analysis shows that Different User-created Categories may
contain similar Collaborative Content, creating difficulty for the users
to find the content that address their problems / interests
• Clustering of content represented by titles produces more coherent
clusters with more ability to uncover fine-grained discussions that are
buried in the huge amount of content
• Mahout is not currently perfect!
• Lack of Clustering Validity Measures
• Lack of Dimension Reduction Algorithms (e.g. LSI) important to
improve clustering results
35
• Lack of GUI Support

School of something
Computing
FACULTY OF ENGINEERING
OTHER

Thank You

Ahmad Ammari
A.Ammari@leeds.ac.uk

Distributed data mining

More Related Content

Distributed data mining