Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
TunUp: A Distributed Cloud-based
Genetic Evolutionary Tuning for Data
Clustering

Gianmario Spacagna
gm.spacagna@gmail.com

March 2013



AgilOne, Inc.
1091 N Shoreline Blvd. #250
Mountain View, CA 94043
Agenda
1.   Introduction
2.   Problem description
3.   TunUp
4.   K-means
5.   Clustering evaluation
6.   Full space tuning
7.   Genetic algorithm tuning
8.   Conclusions
Big Data
Business Intelligence
        Why ? Where? What? How?
         Insights of customers, products and companies




   Can someone else know your customer better than you?
  Do you have the domain knowledge and proper computation
                      infrastructure?
Big Data as a Service (BDaaS)
Problem Description




          income   cost




                      customers
Tuning of Clustering
Algorithms
We need tuning when:
    ➢
        New algorithm or version is released
    ➢
        We want to improve accuracy and/or performance
    ➢
        New customer comes and the system must be adapted for the new
        dataset and requirements




9
TunUp
Java framework integrating JavaML and Watchmaker

Main features:

➢
    Data manipulation (loading, labelling and normalization)
➢
    Clustering algorithms (k-means)
➢
    Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)
➢
    Evaluation techniques validation (Pearson Correlation t-test)
➢
    Full search space tuning
➢
    Genetic Algorithm tuning (local and parallel implementation)
➢
    RESTful API for web ser vice deployment (tomcat in Amazon EC2)

    Open-source: http://github.com/gm-spacagna/tunup
k-means
Geometric hard-assigning Clustering algorithm:
   It partitions n data points into k clusters in which each point belongs to
   the cluster with the nearest mean centroid.
     If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified
     cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:




      Algorithm:
1.    Initialization : a set of k random centroids are generated
2.    Assignment: each point is assigned to the closest centroid
3.    Update: the new centroids are calculated as the mean of the new clusters
4.    Go to 2 until the convergence (centroids are stable and do not change)
k-means tuning
     Input parameters required:        0.   Angular
                                       2.   Chebyshev
1.   K = (2,...,40)                    3.   Cosine
                                       4.   Euclidean
2.   Distance measure                  5.   Jaccard Index
                                       6.   Manhattan
3.   Max iterations = 20 (fixed)       7.   Pearson Correlation Coefficient
                                       8.   Radial Basis Function Kernel
                                       9.   Spearman Footrule




                                   Different input parameters


                                   Ver y different outcomes!!!
Clustering Evaluation
Definition of cluster:
“A group of the same or similar elements gathered or occurring closely
together”

    How do we evaluate if a set of clusters is good or not?

          “Clustering is in the eye of the beholder” [E. Castro, 2002]


    Two main categories:
➢
    Internal criterion : only based on the clustered data itself
➢
    External criterion : based on benchmarks of pre-classified items
Internal Evaluation
Common goal is assigning better scores when:
➢
  High intra-cluster similarity
➢
  Low inter-cluster similarity

 The choice of the evaluation technique depends on the
nature of the data and the cluster model of the algorithm.


    Cluster models:
➢
    Distance-based (k-means)
➢
    Density-based (EM-clustering)
➢
    Distribution-based (DBSCAN)
➢
    Connectivity-based (linkage clustering)
Proposed techniques
AIC: measure of the relative quantity of lost information of a statistical
model. The clustering algorithm is modelled as a Gaussian Mixture Process.
(inverted function)




Dunn: ratio between the minimum inter-clusters similarity and maximum
cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most
similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates
if the object is correctly clustered or if it would be more appropriate into the
neighbouring cluster. (natural fn.)
External criterion:
AdjustedRand
Given a a set of n elements S = {o1,...,on} and two partitions to compare:
X={X1,...,Xr} and Y={Y1,...,Ys}

               number of agreements between X and Y
 RandIndex =
               total number of possible pair combinations


                       RandIndex−ExpectedIndex
AdjustedRandIndex=
                       MaxIndex−ExpectedIndex



We can use AdjustedRand as reference of the best clustering evaluation and
use it as validation for the internal criterion.
Correlation t-test
                       Pearson correlation over a set of 120
                         random k-means configuration
                                  evaluations:




                     Average correlations:

                     AIC : 0.77
                     Dunn: 0.49
                     Davies-Bouldin: 0.51
                     Silhouette: 0.49
Dataset
                                                D31
                                                3100 vectors
                                                2 dimensions
                                                31 clusters




S1
5000 vectors
2 dimensions
15 clusters

               Source: http://cs.joensuu.fi/sipu/datasets/
Initial Centroids issue
N. observations = 200
Input Configuration: k = 31 , Distance Measure = Eclidean

        AdjustedRand                                   AIC




We can consider the median value!
Full space evaluation
N executions averaged = 20




                             Global optimal is for:
                             K = 36
                             DistanceMeasure = Euclidean
Genetic Algorithm Tuning
                                        Crossovering:
                                             [x1,x2,x3,x4,...,xm]

                                            [y1,y2,y3,y4,...,ym]
                  Elitism
                     +
               Roulette wheel

                                             [x1,x2,x3,y4,...,ym]
                                            [y1,y2,y3,x4,...,xm]


                                        Mutation:
                                                                1
                                Pr (mutate k i →k j )∝
                                                         distance ( k i , k j )

                                                              1
                                Pr (mutate d i →d j )=
                                                          N dist −1
Tuning parameters:
Fitness Evaluation : AIC
Prob. mutation: 0.5
Prob. Crossovering: 0.9
Population size: 6
Stagnation limit: 5
Elitism: 1
N executions averaged: 10




    Relevant results:
➢
    Best fitness value always decreasing
➢
    Mean fitness value trend decreasing
➢
    High standard deviation in the previous
    population often generates a better mean
    population in the next one
Results

Test1:
k = 39, Distance Measure = Manhattan

Test2:
k = 33, Distance Measure = RBF Kernel

Test3:
k = 36, Distance Measure = Euclidean




Different results due to:
1. Early convergence
2. Random initial centroids
Parallel GA
 Simulation:                               Amazon Elastic Compute Cloud EC2
 10 evolutions, POP_SIZE = 5, no elitism   10 x Micro instances




Optimal n. of ser vers = POP_SIZE – ELITISM

E[T single evolution] ≤
Conclusions
We developed, tested and analysed TunUp, an open-solution for:
Evaluation, Validation , Tuning of Data Clustering Algorithms

Future applications :
➢
  Tuning of existing algorithms
➢
  Supporting new algorithms design
➢
  Evaluation and comparison of different algorithms

Limitations:
➢
  Single distance measure
➢
  Equal normalization
➢
  Master / slave parallel execution
➢
  Random initial centroids
Questions?
Thank you! Tack! Grazie!

More Related Content

TunUp final presentation

  • 1. TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering Gianmario Spacagna gm.spacagna@gmail.com March 2013 AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043
  • 2. Agenda 1. Introduction 2. Problem description 3. TunUp 4. K-means 5. Clustering evaluation 6. Full space tuning 7. Genetic algorithm tuning 8. Conclusions
  • 4. Business Intelligence Why ? Where? What? How? Insights of customers, products and companies Can someone else know your customer better than you? Do you have the domain knowledge and proper computation infrastructure?
  • 5. Big Data as a Service (BDaaS)
  • 6. Problem Description income cost customers
  • 7. Tuning of Clustering Algorithms We need tuning when: ➢ New algorithm or version is released ➢ We want to improve accuracy and/or performance ➢ New customer comes and the system must be adapted for the new dataset and requirements 9
  • 8. TunUp Java framework integrating JavaML and Watchmaker Main features: ➢ Data manipulation (loading, labelling and normalization) ➢ Clustering algorithms (k-means) ➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand) ➢ Evaluation techniques validation (Pearson Correlation t-test) ➢ Full search space tuning ➢ Genetic Algorithm tuning (local and parallel implementation) ➢ RESTful API for web ser vice deployment (tomcat in Amazon EC2) Open-source: http://github.com/gm-spacagna/tunup
  • 9. k-means Geometric hard-assigning Clustering algorithm: It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid. If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares: Algorithm: 1. Initialization : a set of k random centroids are generated 2. Assignment: each point is assigned to the closest centroid 3. Update: the new centroids are calculated as the mean of the new clusters 4. Go to 2 until the convergence (centroids are stable and do not change)
  • 10. k-means tuning Input parameters required: 0. Angular 2. Chebyshev 1. K = (2,...,40) 3. Cosine 4. Euclidean 2. Distance measure 5. Jaccard Index 6. Manhattan 3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient 8. Radial Basis Function Kernel 9. Spearman Footrule Different input parameters Ver y different outcomes!!!
  • 11. Clustering Evaluation Definition of cluster: “A group of the same or similar elements gathered or occurring closely together” How do we evaluate if a set of clusters is good or not? “Clustering is in the eye of the beholder” [E. Castro, 2002] Two main categories: ➢ Internal criterion : only based on the clustered data itself ➢ External criterion : based on benchmarks of pre-classified items
  • 12. Internal Evaluation Common goal is assigning better scores when: ➢ High intra-cluster similarity ➢ Low inter-cluster similarity The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm. Cluster models: ➢ Distance-based (k-means) ➢ Density-based (EM-clustering) ➢ Distribution-based (DBSCAN) ➢ Connectivity-based (linkage clustering)
  • 13. Proposed techniques AIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function) Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.) Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.) Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)
  • 14. External criterion: AdjustedRand Given a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys} number of agreements between X and Y RandIndex = total number of possible pair combinations RandIndex−ExpectedIndex AdjustedRandIndex= MaxIndex−ExpectedIndex We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.
  • 15. Correlation t-test Pearson correlation over a set of 120 random k-means configuration evaluations: Average correlations: AIC : 0.77 Dunn: 0.49 Davies-Bouldin: 0.51 Silhouette: 0.49
  • 16. Dataset D31 3100 vectors 2 dimensions 31 clusters S1 5000 vectors 2 dimensions 15 clusters Source: http://cs.joensuu.fi/sipu/datasets/
  • 17. Initial Centroids issue N. observations = 200 Input Configuration: k = 31 , Distance Measure = Eclidean AdjustedRand AIC We can consider the median value!
  • 18. Full space evaluation N executions averaged = 20 Global optimal is for: K = 36 DistanceMeasure = Euclidean
  • 19. Genetic Algorithm Tuning Crossovering: [x1,x2,x3,x4,...,xm] [y1,y2,y3,y4,...,ym] Elitism + Roulette wheel [x1,x2,x3,y4,...,ym] [y1,y2,y3,x4,...,xm] Mutation: 1 Pr (mutate k i →k j )∝ distance ( k i , k j ) 1 Pr (mutate d i →d j )= N dist −1
  • 20. Tuning parameters: Fitness Evaluation : AIC Prob. mutation: 0.5 Prob. Crossovering: 0.9 Population size: 6 Stagnation limit: 5 Elitism: 1 N executions averaged: 10 Relevant results: ➢ Best fitness value always decreasing ➢ Mean fitness value trend decreasing ➢ High standard deviation in the previous population often generates a better mean population in the next one
  • 21. Results Test1: k = 39, Distance Measure = Manhattan Test2: k = 33, Distance Measure = RBF Kernel Test3: k = 36, Distance Measure = Euclidean Different results due to: 1. Early convergence 2. Random initial centroids
  • 22. Parallel GA Simulation: Amazon Elastic Compute Cloud EC2 10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instances Optimal n. of ser vers = POP_SIZE – ELITISM E[T single evolution] ≤
  • 23. Conclusions We developed, tested and analysed TunUp, an open-solution for: Evaluation, Validation , Tuning of Data Clustering Algorithms Future applications : ➢ Tuning of existing algorithms ➢ Supporting new algorithms design ➢ Evaluation and comparison of different algorithms Limitations: ➢ Single distance measure ➢ Equal normalization ➢ Master / slave parallel execution ➢ Random initial centroids
  • 25. Thank you! Tack! Grazie!