This document summarizes a distributed cloud-based genetic algorithm framework called TunUp for tuning the parameters of data clustering algorithms. TunUp integrates existing machine learning libraries and implements genetic algorithm techniques to tune parameters like K (number of clusters) and distance measures for K-means clustering. It evaluates internal clustering quality metrics on sample datasets and tunes parameters to optimize a chosen metric like AIC. The document outlines TunUp's features, describes how it implements genetic algorithms and parallelization, and concludes it is an open solution for clustering algorithm evaluation, validation and tuning.
1 of 25
Download to read offline
More Related Content
TunUp final presentation
1. TunUp: A Distributed Cloud-based
Genetic Evolutionary Tuning for Data
Clustering
Gianmario Spacagna
gm.spacagna@gmail.com
March 2013
AgilOne, Inc.
1091 N Shoreline Blvd. #250
Mountain View, CA 94043
2. Agenda
1. Introduction
2. Problem description
3. TunUp
4. K-means
5. Clustering evaluation
6. Full space tuning
7. Genetic algorithm tuning
8. Conclusions
4. Business Intelligence
Why ? Where? What? How?
Insights of customers, products and companies
Can someone else know your customer better than you?
Do you have the domain knowledge and proper computation
infrastructure?
7. Tuning of Clustering
Algorithms
We need tuning when:
➢
New algorithm or version is released
➢
We want to improve accuracy and/or performance
➢
New customer comes and the system must be adapted for the new
dataset and requirements
9
8. TunUp
Java framework integrating JavaML and Watchmaker
Main features:
➢
Data manipulation (loading, labelling and normalization)
➢
Clustering algorithms (k-means)
➢
Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)
➢
Evaluation techniques validation (Pearson Correlation t-test)
➢
Full search space tuning
➢
Genetic Algorithm tuning (local and parallel implementation)
➢
RESTful API for web ser vice deployment (tomcat in Amazon EC2)
Open-source: http://github.com/gm-spacagna/tunup
9. k-means
Geometric hard-assigning Clustering algorithm:
It partitions n data points into k clusters in which each point belongs to
the cluster with the nearest mean centroid.
If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified
cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:
Algorithm:
1. Initialization : a set of k random centroids are generated
2. Assignment: each point is assigned to the closest centroid
3. Update: the new centroids are calculated as the mean of the new clusters
4. Go to 2 until the convergence (centroids are stable and do not change)
10. k-means tuning
Input parameters required: 0. Angular
2. Chebyshev
1. K = (2,...,40) 3. Cosine
4. Euclidean
2. Distance measure 5. Jaccard Index
6. Manhattan
3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient
8. Radial Basis Function Kernel
9. Spearman Footrule
Different input parameters
Ver y different outcomes!!!
11. Clustering Evaluation
Definition of cluster:
“A group of the same or similar elements gathered or occurring closely
together”
How do we evaluate if a set of clusters is good or not?
“Clustering is in the eye of the beholder” [E. Castro, 2002]
Two main categories:
➢
Internal criterion : only based on the clustered data itself
➢
External criterion : based on benchmarks of pre-classified items
12. Internal Evaluation
Common goal is assigning better scores when:
➢
High intra-cluster similarity
➢
Low inter-cluster similarity
The choice of the evaluation technique depends on the
nature of the data and the cluster model of the algorithm.
Cluster models:
➢
Distance-based (k-means)
➢
Density-based (EM-clustering)
➢
Distribution-based (DBSCAN)
➢
Connectivity-based (linkage clustering)
13. Proposed techniques
AIC: measure of the relative quantity of lost information of a statistical
model. The clustering algorithm is modelled as a Gaussian Mixture Process.
(inverted function)
Dunn: ratio between the minimum inter-clusters similarity and maximum
cluster diameter. (natural fn.)
Davies-Bouldin : average similarity between each cluster and its most
similar one. (inverted fn.)
Silhouette: measure of how well each point lies within its cluster. Indicates
if the object is correctly clustered or if it would be more appropriate into the
neighbouring cluster. (natural fn.)
14. External criterion:
AdjustedRand
Given a a set of n elements S = {o1,...,on} and two partitions to compare:
X={X1,...,Xr} and Y={Y1,...,Ys}
number of agreements between X and Y
RandIndex =
total number of possible pair combinations
RandIndex−ExpectedIndex
AdjustedRandIndex=
MaxIndex−ExpectedIndex
We can use AdjustedRand as reference of the best clustering evaluation and
use it as validation for the internal criterion.
15. Correlation t-test
Pearson correlation over a set of 120
random k-means configuration
evaluations:
Average correlations:
AIC : 0.77
Dunn: 0.49
Davies-Bouldin: 0.51
Silhouette: 0.49
17. Initial Centroids issue
N. observations = 200
Input Configuration: k = 31 , Distance Measure = Eclidean
AdjustedRand AIC
We can consider the median value!
18. Full space evaluation
N executions averaged = 20
Global optimal is for:
K = 36
DistanceMeasure = Euclidean
19. Genetic Algorithm Tuning
Crossovering:
[x1,x2,x3,x4,...,xm]
[y1,y2,y3,y4,...,ym]
Elitism
+
Roulette wheel
[x1,x2,x3,y4,...,ym]
[y1,y2,y3,x4,...,xm]
Mutation:
1
Pr (mutate k i →k j )∝
distance ( k i , k j )
1
Pr (mutate d i →d j )=
N dist −1
20. Tuning parameters:
Fitness Evaluation : AIC
Prob. mutation: 0.5
Prob. Crossovering: 0.9
Population size: 6
Stagnation limit: 5
Elitism: 1
N executions averaged: 10
Relevant results:
➢
Best fitness value always decreasing
➢
Mean fitness value trend decreasing
➢
High standard deviation in the previous
population often generates a better mean
population in the next one
21. Results
Test1:
k = 39, Distance Measure = Manhattan
Test2:
k = 33, Distance Measure = RBF Kernel
Test3:
k = 36, Distance Measure = Euclidean
Different results due to:
1. Early convergence
2. Random initial centroids
22. Parallel GA
Simulation: Amazon Elastic Compute Cloud EC2
10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instances
Optimal n. of ser vers = POP_SIZE – ELITISM
E[T single evolution] ≤
23. Conclusions
We developed, tested and analysed TunUp, an open-solution for:
Evaluation, Validation , Tuning of Data Clustering Algorithms
Future applications :
➢
Tuning of existing algorithms
➢
Supporting new algorithms design
➢
Evaluation and comparison of different algorithms
Limitations:
➢
Single distance measure
➢
Equal normalization
➢
Master / slave parallel execution
➢
Random initial centroids