This document outlines a proposed approach to use distributed data mining techniques to help users make sense of large amounts of content in online collaborative spaces. It discusses how "big data" is affecting users' ability to understand discussions. The approach involves preprocessing content, clustering it using Hadoop and Mahout, and generating topic clouds. A case study clusters content from technical forums and finds topic-specific discussions not obvious from category names. The conclusion is that distributed data mining can help summarize huge online discussions and uncover buried topics to support user sensemaking.
Report
Share
Report
Share
1 of 36
More Related Content
Distributed data mining
1. School of something
Computing
FACULTY OF ENGINEERING
OTHER
Distributed Data Mining for User Sensemaking
in Online Collaborative Spaces
Submitted to:
DicoSyn2012 Workshop @ CSCW’12
Presented By: Ahmad Ammari
RF in User & Community Modelling
2. OUTLINE
• The Big Data “Problem” in Online Collaborative Spaces
• What is User Sensemaking and How Big Data is affecting it?
• Can Distributed Data Mining Help?
• What is Hadoop & Map / Reduce?
• What is Mahout?
• Proposed Approach to support User Sensemaking in OCS
• Content Pre-Processing
• Content Clustering
• Topic Modelling
• Case Study: Making Sense of Online Forums
• How are Discussions currently Organized? Clusters vs. Categories
• Which Content to Mine? Mining the Right Discussion Parts
2
• How Can This Help Sensemaking? Some Usage Scenarios
3. How “Big” is Big Data?
• Emails
• 90 Trillion – The Number of Emails Sent on the Internet in 2009
• 107 Trillion – The Number of Emails Sent on the Internet in 2010
• Websites
• 234 Million – The Number of Websites by Dec 2009
• 255 Million – The Number of Websites by Dec 2010
• Social Media
• 152 Million – The Number of Blogs on the Internet in 2010
• 25 Billion – The Number of sent Tweets on Twitter in 2010
• Multi Media
• 5 Billion – The Number of Photos Hosted by Flicker (Sep 2010)
• 2 Billion – The Number of Videos Watched per Day on YouTube
3
4. What about Online CS?
They are Big Too!
Top 10 biggest Internet forums
4
5. What about Online CS?
They are Big Too!
Stack Exchange Family of Forums
5
6. Why is it a Problem?
Where should I post my
programming question to
get relevant replies?
6
7. Why is it a Problem?
Where to find a solution
to my MS Outlook Problem?
7
8. Why is it a Problem?
What are the actual
discussions are
really about?
I cannot make sense
of Big Content!
8
9. Why Making Sense of Big Data
is not Easy, not Fast?
• Because it’s Big and still increasing!
• Because it’s Diverse!
• Stack Exchange Suite of Forums has more than
50 Different Technical Discussion Forums
• WebProWorld Technical Forums has more than
40 Discussion Categories
• Because it’s Dynamic!
• 294 Billion – The Average Number of Email Messages per Day
• 21.4 Million – The Number of Added Websites in 2010
• 96,101 New Blogs in last 24 hours (8th Dec 2011)
• 190 Million – The Number of Tweets per day
in June 2011
• Because it’s Noisy!
• 200 billion – The number of spam emails per day in 2009
• 262 billion – The number of spam emails per day in 2010
9
10. But What is “Sensemaking”?!
• Creating a representation of a collection of information [Russell et al, 1993]
• Focused on the context of understanding large document collections. [Paul et al, 2011]
• Transforming Information into Knowledge [Priolli & Card, 2005]
• Seeking, filtering, searching for relations, extracting, schematizing
• Understanding connections among people, places, and events [Klein et al,
2006]
10
11. Our Solution!
Large-Scale Data Knowledge Discovery in Big
Processing Content
Quick Data Processing Analysis of Unstructured
Scalable Data Processing Data
Robust Data Processing Machine Intelligence to
Support Humans
11
12. What is Hadoop?
• A framework for storing and processing big data on lots of commodity
machines
• Up to 4,000 machines in a cluster
• Up to 20 PB in a cluster
• Open Source Apache project
• Implemented in Java We focused on distributed
computation with Map/Reduce
• Contains Many Sub-Projects:
• Map/Reduce – Software Framework for Distributed Processing of Large Dataets
• HDFS – Hadoop Distributed File System
• Hadoop Common – Provides Access to the File Systems Supported by Hadoop
• Chukwa – Data Collection System for Managing Large Distributed Systems
• Hbase – Scalable, Distributed Database that Supports Structured Data Storage
• Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying
• Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation
• Zookeeper – High-Performance Coordination Service for Dist. Apps.
12
15. Hadoop Map/Reduce
• Simply: A parallel programming model and an associated
implementation
• Abstract model: hides many system-level details from the
programmer
• Move-code-to-data philosophy: computation on data piece takes
place on the same machine where that piece resides
• Map/Reduce Job runs in Phases, each Phase runs in Parallel
across all Nodes in the Hadoop Cluster
• Main Phases: Mapping, Reducing
• Are there Other Phases? Yes!
• Shuffling & Sorting, Combining, Partitioning
• But .. Programmer writes “Mapper” and “Reducer” functions only!
15
19. Our Solution!
Large-Scale Data Knowledge Discovery in Big
Processing Content
Quick Data Processing Analysis of Unstructured
Scalable Data Processing Data
Robust Data Processing Machine Intelligence to
Support Humans
19
20. What is Mahout?
• Open source machine learning library from Apache
• Began life in 2008 as a subproject of Apache’s Lucene Search Engine
• In 2009 absorbed the Taste open source collaborative filtering project
• In 2010 became a stand-alone Project
• Written in Java
• ML algorithms mainly for
• Recommender Engines (CF-based)
• Clustering April 2010
• Classification
• Pre-Processing algorithms for Unstructured Data
• Scalability is achieved by Map/Reduce Implementations of ML Algorithms
We focused on Mahout Clustering
and Pre-Processing
Implementations in Map/Reduce
20
26. Content Pre-Processing
• Apache Lucene Text Analysis
• Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal
• TFIDF Weighting: Computing Numerical Weights to Content Terms
• n-gram Collocations
• Multi-Term Phrases having high probability of occurring together
• Examples: “Social Media”, “Data Mining”, “Machine Learning”
• Normalization
• decreasing the magnitude of large document vectors & increasing the magnitude
of small ones
• p-norm
• p depends on similarity measure used
• With Text Content, best similarity measures are Euclidean & Cosine p = 2
• Example: the 2-norm of a 3-dimensional
vector, [x, y, z], is 26
27. Content Clustering
Discovering Clusters of “similar” Points
EM algorithm to a
2 component
Gaussian mixture
model on the Old
Faithful Geyser
dataset
http://bit.ly/oldfaithful
27
28. K-Means Clustering
Map/Reduce Implementation in Mahout
1. Starting with three
random points as
1 2 centroids
2. Map stage: assigns
each point to the cluster
nearest to it
3. Reduce stage: the
associated points are
averaged out to produce
the new location of the
3 4 centroid
4. After each iteration, the
final configuration is fed
back into the same loop
until the centroids come
to rest at their final
28
positions
29. Canopy Clustering
• Fast approximate clustering technique
• Divide the input set of points into overlapping clusters known as canopies
• In Mahout, it is used to estimate the approximate cluster centroids (or canopy
centroids) using two distance thresholds, T1 and T2, with T1 > T2
1. Start with a point and mark it as part
1 2 of a canopy
2. all the points within distance T2
removed from the data set and
prevented from becoming new
canopies.
3. The points within the outer circle are
also put in the same canopy, but
3 4 they’re allowed to be part of other
canopies. Assignment process is
done in a single pass on a mapper.
4. The reducer computes the average
of the centroid and merges close
canopies
29
30. Sensemaking in Online Forums
• Illustration of the Approach to support user sensemaking in Online Forums
• Content Collection from WebProWorld Technical Forums
• Large Forum (1000s of Discussion Threads)
• Organize Discussions into Categories (Subforums) Defined by Forum
Designers
• Four subforums were chosen for the experiment:
• Two subforums representing fairly specialized categories – SEO (Search
Engine Optimization) and e-Commerce
• Two subforums representing broad categories – IT and Computer
Assistance
• Objectives for the experiment
• Investigate the extent of sensemaking support needed for the public
technical forum
• Determine which content representation for clustering is more appropriate
to derive topic clouds for the sensemaker
• Illustrate how the output of the approach could provide sensemaking
30
support
31. Clusters vs Categories
Distribution of Four Categories in Distribution of Four Categories in Four
Four Mahout-based Clusters by Title Mahout-based Clusters by Title and
First Post
31
32. Content Representation
The smaller the average DBI, the clustering models having item
better the model is for achieving a distribution values closer to 1.0 will
coherent set of similar discussions. derive minor distinct clusters with
topic-specific discussions.
32
33. Example Topic Clouds
Enabled Discovery of Topic-
Specific Discussions not
Obvious in Category Names:
• Disk & Keyboard Problems
• Security Issues
• Hard Disk Backup
• MS Outlook File Problems
• Certificates and Skills in
Web Design
• Photo features in social
networks (facebook)
• Optimizing Search Engines
for Blog Search
• Design of Datawarehousing
Systems
33
35. Conclusion
• Big Data creates a Big Challenge to sensemaking in Online
Collaborative Spaces
• Distributed Data Mining with Hadoop Map/Reduce and Mahout is
exploited to support user sensemaking by summarizing the huge
content found in Large-scale Discussion Forums
• Cluster Analysis shows that Different User-created Categories may
contain similar Collaborative Content, creating difficulty for the users
to find the content that address their problems / interests
• Clustering of content represented by titles produces more coherent
clusters with more ability to uncover fine-grained discussions that are
buried in the huge amount of content
• Mahout is not currently perfect!
• Lack of Clustering Validity Measures
• Lack of Dimension Reduction Algorithms (e.g. LSI) important to
improve clustering results
35
• Lack of GUI Support
36. School of something
Computing
FACULTY OF ENGINEERING
OTHER
Thank You
Ahmad Ammari
A.Ammari@leeds.ac.uk