Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




Distributed Data Mining for User Sensemaking
        in Online Collaborative Spaces


                             Submitted to:
               DicoSyn2012 Workshop @ CSCW’12


Presented By: Ahmad Ammari
RF in User & Community Modelling
OUTLINE
• The Big Data “Problem” in Online Collaborative Spaces
• What is User Sensemaking and How Big Data is affecting it?
• Can Distributed Data Mining Help?
 • What is Hadoop & Map / Reduce?
 • What is Mahout?
• Proposed Approach to support User Sensemaking in OCS
 • Content Pre-Processing
 • Content Clustering
 • Topic Modelling
• Case Study: Making Sense of Online Forums
 • How are Discussions currently Organized? Clusters vs. Categories
 • Which Content to Mine? Mining the Right Discussion Parts
                                                                      2
 • How Can This Help Sensemaking? Some Usage Scenarios
How “Big” is Big Data?
• Emails
  •   90 Trillion – The Number of Emails Sent on the Internet in 2009
  •   107 Trillion – The Number of Emails Sent on the Internet in 2010
• Websites
  •   234 Million – The Number of Websites by Dec 2009
  •   255 Million – The Number of Websites by Dec 2010
• Social Media
  •   152 Million – The Number of Blogs on the Internet in 2010
  •   25 Billion – The Number of sent Tweets on Twitter in 2010
• Multi Media
  •   5 Billion – The Number of Photos Hosted by Flicker (Sep 2010)
  •   2 Billion – The Number of Videos Watched per Day on YouTube


                                           3
What about Online CS?

                They are Big Too!
 Top 10 biggest Internet forums




                              4
What about Online CS?

               They are Big Too!
 Stack Exchange Family of Forums




                            5
Why is it a Problem?




           Where should I post my
          programming question to
            get relevant replies?




                        6
Why is it a Problem?




            Where to find a solution
          to my MS Outlook Problem?




                         7
Why is it a Problem?




                  What are the actual
                   discussions are
                     really about?




     I cannot make sense
        of Big Content!

                              8
Why Making Sense of Big Data
        is not Easy, not Fast?
•       Because it’s Big and still increasing!
•       Because it’s Diverse!
    •     Stack Exchange Suite of Forums has more than
          50 Different Technical Discussion Forums
    •     WebProWorld Technical Forums has more than
          40 Discussion Categories
•       Because it’s Dynamic!
    •     294 Billion – The Average Number of Email Messages per Day
    •     21.4 Million – The Number of Added Websites in 2010
    •     96,101 New Blogs in last 24 hours (8th Dec 2011)
    •     190 Million – The Number of Tweets per day
          in June 2011
•       Because it’s Noisy!
    •     200 billion – The number of spam emails per day in 2009
    •     262 billion – The number of spam emails per day in 2010
                                                     9
But What is “Sensemaking”?!
• Creating a representation of a collection of information [Russell et al, 1993]
    •     Focused on the context of understanding large document collections. [Paul et al, 2011]
•       Transforming Information into Knowledge [Priolli & Card, 2005]
    •     Seeking, filtering, searching for relations, extracting, schematizing
•       Understanding connections among people, places, and events [Klein et al,
        2006]




                                                    10
Our Solution!




      Large-Scale Data            Knowledge Discovery in Big
         Processing                        Content
   Quick Data Processing           Analysis of Unstructured
  Scalable Data Processing                   Data
  Robust Data Processing            Machine Intelligence to
                                      Support Humans




                             11
What is Hadoop?
•       A framework for storing and processing big data on lots of commodity
        machines
    •     Up to 4,000 machines in a cluster
    •     Up to 20 PB in a cluster
•       Open Source Apache project
•       Implemented in Java                         We focused on distributed
                                                  computation with Map/Reduce
•       Contains Many Sub-Projects:
    •     Map/Reduce – Software Framework for Distributed Processing of Large Dataets
    •     HDFS – Hadoop Distributed File System
    •     Hadoop Common – Provides Access to the File Systems Supported by Hadoop
    •     Chukwa – Data Collection System for Managing Large Distributed Systems
    •     Hbase – Scalable, Distributed Database that Supports Structured Data Storage
    •     Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying
    •     Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation
    •     Zookeeper – High-Performance Coordination Service for Dist. Apps.
                                                  12
Who Use Hadoop?




                  13
Why they Use Hadoop?




                   14
Hadoop Map/Reduce

• Simply: A parallel programming model and an associated
  implementation
• Abstract model: hides many system-level details from the
  programmer
• Move-code-to-data philosophy: computation on data piece takes
  place on the same machine where that piece resides
• Map/Reduce Job runs in Phases, each Phase runs in Parallel
  across all Nodes in the Hadoop Cluster
• Main Phases: Mapping, Reducing
• Are there Other Phases? Yes!
  • Shuffling & Sorting, Combining, Partitioning
  • But .. Programmer writes “Mapper” and “Reducer” functions only!
                                  15
Hadoop Map/Reduce




                    16
Hadoop Map/Reduce




        More formally,
        • Map(k1,v1)  list(k2,v2)
          • Shuffle & Sort(list(k2,v2))  k2, list(v2)
        • Reduce(k2, list(v2))  list(k3, v3)
                          17
Hadoop Map/Reduce




                    18
Our Solution!




      Large-Scale Data            Knowledge Discovery in Big
         Processing                        Content
   Quick Data Processing           Analysis of Unstructured
  Scalable Data Processing                   Data
  Robust Data Processing            Machine Intelligence to
                                      Support Humans




                             19
What is Mahout?
•       Open source machine learning library from Apache
•       Began life in 2008 as a subproject of Apache’s Lucene Search Engine
•       In 2009 absorbed the Taste open source collaborative filtering project
•       In 2010 became a stand-alone Project
•       Written in Java
•       ML algorithms mainly for
    •     Recommender Engines (CF-based)
    •     Clustering                                                    April 2010
    •     Classification
•       Pre-Processing algorithms for Unstructured Data
•       Scalability is achieved by Map/Reduce Implementations of ML Algorithms

                             We focused on Mahout Clustering
                                   and Pre-Processing
                             Implementations in Map/Reduce
                                         20
Sensemaking-Support with DDM




INPUT: Collaboration Content (Discussions)

                           21
Sensemaking-Support with DDM




Content Pre-Processing: Prepare Content for Mining

                           22
Sensemaking-Support with DDM




Content Clustering: Derive Groups of Similar Content

                            23
Sensemaking-Support with DDM




Topic Modelling: Identify Fine-Grained Topics and
Generate Topic Clouds
                            24
Sensemaking-Support with DDM




            OUTPUT: Topic Clouds

                    25
Content Pre-Processing

• Apache Lucene Text Analysis
  •   Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal
• TFIDF Weighting: Computing Numerical Weights to Content Terms
• n-gram Collocations
  •   Multi-Term Phrases having high probability of occurring together
  •   Examples: “Social Media”, “Data Mining”, “Machine Learning”
• Normalization
  •   decreasing the magnitude of large document vectors & increasing the magnitude
      of small ones
  •   p-norm
  •   p depends on similarity measure used
  •   With Text Content, best similarity measures are Euclidean & Cosine  p = 2
  •   Example: the 2-norm of a 3-dimensional
         vector, [x, y, z], is           26
Content Clustering

Discovering Clusters of “similar” Points

 EM algorithm to a
   2 component
 Gaussian mixture
 model on the Old
  Faithful Geyser
      dataset
  http://bit.ly/oldfaithful



                                           27
K-Means Clustering

Map/Reduce Implementation in Mahout
                          1.   Starting with three
                               random points as
1            2                 centroids
                          2.   Map stage: assigns
                               each point to the cluster
                               nearest to it
                          3.   Reduce stage: the
                               associated points are
                               averaged out to produce
                               the new location of the
3            4                 centroid
                          4.   After each iteration, the
                               final configuration is fed
                               back into the same loop
                               until the centroids come
                               to rest at their final
                                                       28
                               positions
Canopy Clustering
• Fast approximate clustering technique
• Divide the input set of points into overlapping clusters known as canopies
• In Mahout, it is used to estimate the approximate cluster centroids (or canopy
  centroids) using two distance thresholds, T1 and T2, with T1 > T2
                                           1. Start with a point and mark it as part
1                    2                        of a canopy
                                           2. all the points within distance T2
                                              removed from the data set and
                                              prevented from becoming new
                                              canopies.
                                           3. The points within the outer circle are
                                              also put in the same canopy, but
3                    4                        they’re allowed to be part of other
                                              canopies. Assignment process is
                                              done in a single pass on a mapper.
                                           4. The reducer computes the average
                                              of the centroid and merges close
                                              canopies
                                                                                 29
Sensemaking in Online Forums

•   Illustration of the Approach to support user sensemaking in Online Forums
•   Content Collection from WebProWorld Technical Forums
•   Large Forum (1000s of Discussion Threads)
•   Organize Discussions into Categories (Subforums) Defined by Forum
    Designers
•   Four subforums were chosen for the experiment:
      • Two subforums representing fairly specialized categories – SEO (Search
          Engine Optimization) and e-Commerce
      • Two subforums representing broad categories – IT and Computer
          Assistance
• Objectives for the experiment
     •   Investigate the extent of sensemaking support needed for the public
         technical forum
     •   Determine which content representation for clustering is more appropriate
         to derive topic clouds for the sensemaker
     •   Illustrate how the output of the approach could provide sensemaking
                                                                               30
         support
Clusters vs Categories




Distribution of Four Categories in    Distribution of Four Categories in Four
Four Mahout-based Clusters by Title   Mahout-based Clusters by Title and
                                      First Post


                                                                       31
Content Representation




The smaller the average DBI, the       clustering models having item
better the model is for achieving a    distribution values closer to 1.0 will
coherent set of similar discussions.   derive minor distinct clusters with
                                       topic-specific discussions.


                                                                         32
Example Topic Clouds

                       Enabled Discovery of Topic-
                       Specific Discussions not
                       Obvious in Category Names:
                       • Disk & Keyboard Problems
                       • Security Issues
                       • Hard Disk Backup
                       • MS Outlook File Problems
                       • Certificates and Skills in
                         Web Design
                       • Photo features in social
                         networks (facebook)
                       • Optimizing Search Engines
                         for Blog Search
                       • Design of Datawarehousing
                         Systems

                                             33
Cross Validated Statistics Forum




                                   34
Conclusion

• Big Data creates a Big Challenge to sensemaking in Online
  Collaborative Spaces
• Distributed Data Mining with Hadoop Map/Reduce and Mahout is
  exploited to support user sensemaking by summarizing the huge
  content found in Large-scale Discussion Forums
• Cluster Analysis shows that Different User-created Categories may
  contain similar Collaborative Content, creating difficulty for the users
  to find the content that address their problems / interests
• Clustering of content represented by titles produces more coherent
  clusters with more ability to uncover fine-grained discussions that are
  buried in the huge amount of content
• Mahout is not currently perfect!
   • Lack of Clustering Validity Measures
   • Lack of Dimension Reduction Algorithms (e.g. LSI) important to
       improve clustering results
                                                                          35
   • Lack of GUI Support
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




                          Thank You

                            Ahmad Ammari
                         A.Ammari@leeds.ac.uk

More Related Content

Distributed data mining

  • 1. School of something Computing FACULTY OF ENGINEERING OTHER Distributed Data Mining for User Sensemaking in Online Collaborative Spaces Submitted to: DicoSyn2012 Workshop @ CSCW’12 Presented By: Ahmad Ammari RF in User & Community Modelling
  • 2. OUTLINE • The Big Data “Problem” in Online Collaborative Spaces • What is User Sensemaking and How Big Data is affecting it? • Can Distributed Data Mining Help? • What is Hadoop & Map / Reduce? • What is Mahout? • Proposed Approach to support User Sensemaking in OCS • Content Pre-Processing • Content Clustering • Topic Modelling • Case Study: Making Sense of Online Forums • How are Discussions currently Organized? Clusters vs. Categories • Which Content to Mine? Mining the Right Discussion Parts 2 • How Can This Help Sensemaking? Some Usage Scenarios
  • 3. How “Big” is Big Data? • Emails • 90 Trillion – The Number of Emails Sent on the Internet in 2009 • 107 Trillion – The Number of Emails Sent on the Internet in 2010 • Websites • 234 Million – The Number of Websites by Dec 2009 • 255 Million – The Number of Websites by Dec 2010 • Social Media • 152 Million – The Number of Blogs on the Internet in 2010 • 25 Billion – The Number of sent Tweets on Twitter in 2010 • Multi Media • 5 Billion – The Number of Photos Hosted by Flicker (Sep 2010) • 2 Billion – The Number of Videos Watched per Day on YouTube 3
  • 4. What about Online CS? They are Big Too! Top 10 biggest Internet forums 4
  • 5. What about Online CS? They are Big Too! Stack Exchange Family of Forums 5
  • 6. Why is it a Problem? Where should I post my programming question to get relevant replies? 6
  • 7. Why is it a Problem? Where to find a solution to my MS Outlook Problem? 7
  • 8. Why is it a Problem? What are the actual discussions are really about? I cannot make sense of Big Content! 8
  • 9. Why Making Sense of Big Data is not Easy, not Fast? • Because it’s Big and still increasing! • Because it’s Diverse! • Stack Exchange Suite of Forums has more than 50 Different Technical Discussion Forums • WebProWorld Technical Forums has more than 40 Discussion Categories • Because it’s Dynamic! • 294 Billion – The Average Number of Email Messages per Day • 21.4 Million – The Number of Added Websites in 2010 • 96,101 New Blogs in last 24 hours (8th Dec 2011) • 190 Million – The Number of Tweets per day in June 2011 • Because it’s Noisy! • 200 billion – The number of spam emails per day in 2009 • 262 billion – The number of spam emails per day in 2010 9
  • 10. But What is “Sensemaking”?! • Creating a representation of a collection of information [Russell et al, 1993] • Focused on the context of understanding large document collections. [Paul et al, 2011] • Transforming Information into Knowledge [Priolli & Card, 2005] • Seeking, filtering, searching for relations, extracting, schematizing • Understanding connections among people, places, and events [Klein et al, 2006] 10
  • 11. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 11
  • 12. What is Hadoop? • A framework for storing and processing big data on lots of commodity machines • Up to 4,000 machines in a cluster • Up to 20 PB in a cluster • Open Source Apache project • Implemented in Java We focused on distributed computation with Map/Reduce • Contains Many Sub-Projects: • Map/Reduce – Software Framework for Distributed Processing of Large Dataets • HDFS – Hadoop Distributed File System • Hadoop Common – Provides Access to the File Systems Supported by Hadoop • Chukwa – Data Collection System for Managing Large Distributed Systems • Hbase – Scalable, Distributed Database that Supports Structured Data Storage • Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying • Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation • Zookeeper – High-Performance Coordination Service for Dist. Apps. 12
  • 14. Why they Use Hadoop? 14
  • 15. Hadoop Map/Reduce • Simply: A parallel programming model and an associated implementation • Abstract model: hides many system-level details from the programmer • Move-code-to-data philosophy: computation on data piece takes place on the same machine where that piece resides • Map/Reduce Job runs in Phases, each Phase runs in Parallel across all Nodes in the Hadoop Cluster • Main Phases: Mapping, Reducing • Are there Other Phases? Yes! • Shuffling & Sorting, Combining, Partitioning • But .. Programmer writes “Mapper” and “Reducer” functions only! 15
  • 17. Hadoop Map/Reduce More formally, • Map(k1,v1)  list(k2,v2) • Shuffle & Sort(list(k2,v2))  k2, list(v2) • Reduce(k2, list(v2))  list(k3, v3) 17
  • 19. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 19
  • 20. What is Mahout? • Open source machine learning library from Apache • Began life in 2008 as a subproject of Apache’s Lucene Search Engine • In 2009 absorbed the Taste open source collaborative filtering project • In 2010 became a stand-alone Project • Written in Java • ML algorithms mainly for • Recommender Engines (CF-based) • Clustering April 2010 • Classification • Pre-Processing algorithms for Unstructured Data • Scalability is achieved by Map/Reduce Implementations of ML Algorithms We focused on Mahout Clustering and Pre-Processing Implementations in Map/Reduce 20
  • 21. Sensemaking-Support with DDM INPUT: Collaboration Content (Discussions) 21
  • 22. Sensemaking-Support with DDM Content Pre-Processing: Prepare Content for Mining 22
  • 23. Sensemaking-Support with DDM Content Clustering: Derive Groups of Similar Content 23
  • 24. Sensemaking-Support with DDM Topic Modelling: Identify Fine-Grained Topics and Generate Topic Clouds 24
  • 25. Sensemaking-Support with DDM OUTPUT: Topic Clouds 25
  • 26. Content Pre-Processing • Apache Lucene Text Analysis • Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal • TFIDF Weighting: Computing Numerical Weights to Content Terms • n-gram Collocations • Multi-Term Phrases having high probability of occurring together • Examples: “Social Media”, “Data Mining”, “Machine Learning” • Normalization • decreasing the magnitude of large document vectors & increasing the magnitude of small ones • p-norm • p depends on similarity measure used • With Text Content, best similarity measures are Euclidean & Cosine  p = 2 • Example: the 2-norm of a 3-dimensional vector, [x, y, z], is 26
  • 27. Content Clustering Discovering Clusters of “similar” Points EM algorithm to a 2 component Gaussian mixture model on the Old Faithful Geyser dataset http://bit.ly/oldfaithful 27
  • 28. K-Means Clustering Map/Reduce Implementation in Mahout 1. Starting with three random points as 1 2 centroids 2. Map stage: assigns each point to the cluster nearest to it 3. Reduce stage: the associated points are averaged out to produce the new location of the 3 4 centroid 4. After each iteration, the final configuration is fed back into the same loop until the centroids come to rest at their final 28 positions
  • 29. Canopy Clustering • Fast approximate clustering technique • Divide the input set of points into overlapping clusters known as canopies • In Mahout, it is used to estimate the approximate cluster centroids (or canopy centroids) using two distance thresholds, T1 and T2, with T1 > T2 1. Start with a point and mark it as part 1 2 of a canopy 2. all the points within distance T2 removed from the data set and prevented from becoming new canopies. 3. The points within the outer circle are also put in the same canopy, but 3 4 they’re allowed to be part of other canopies. Assignment process is done in a single pass on a mapper. 4. The reducer computes the average of the centroid and merges close canopies 29
  • 30. Sensemaking in Online Forums • Illustration of the Approach to support user sensemaking in Online Forums • Content Collection from WebProWorld Technical Forums • Large Forum (1000s of Discussion Threads) • Organize Discussions into Categories (Subforums) Defined by Forum Designers • Four subforums were chosen for the experiment: • Two subforums representing fairly specialized categories – SEO (Search Engine Optimization) and e-Commerce • Two subforums representing broad categories – IT and Computer Assistance • Objectives for the experiment • Investigate the extent of sensemaking support needed for the public technical forum • Determine which content representation for clustering is more appropriate to derive topic clouds for the sensemaker • Illustrate how the output of the approach could provide sensemaking 30 support
  • 31. Clusters vs Categories Distribution of Four Categories in Distribution of Four Categories in Four Four Mahout-based Clusters by Title Mahout-based Clusters by Title and First Post 31
  • 32. Content Representation The smaller the average DBI, the clustering models having item better the model is for achieving a distribution values closer to 1.0 will coherent set of similar discussions. derive minor distinct clusters with topic-specific discussions. 32
  • 33. Example Topic Clouds Enabled Discovery of Topic- Specific Discussions not Obvious in Category Names: • Disk & Keyboard Problems • Security Issues • Hard Disk Backup • MS Outlook File Problems • Certificates and Skills in Web Design • Photo features in social networks (facebook) • Optimizing Search Engines for Blog Search • Design of Datawarehousing Systems 33
  • 35. Conclusion • Big Data creates a Big Challenge to sensemaking in Online Collaborative Spaces • Distributed Data Mining with Hadoop Map/Reduce and Mahout is exploited to support user sensemaking by summarizing the huge content found in Large-scale Discussion Forums • Cluster Analysis shows that Different User-created Categories may contain similar Collaborative Content, creating difficulty for the users to find the content that address their problems / interests • Clustering of content represented by titles produces more coherent clusters with more ability to uncover fine-grained discussions that are buried in the huge amount of content • Mahout is not currently perfect! • Lack of Clustering Validity Measures • Lack of Dimension Reduction Algorithms (e.g. LSI) important to improve clustering results 35 • Lack of GUI Support
  • 36. School of something Computing FACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari A.Ammari@leeds.ac.uk