TunUp final presentation

TunUp: A Distributed Cloud-based
Genetic Evolutionary Tuning for Data
Clustering

Gianmario Spacagna
gm.spacagna@gmail.com

March 2013

AgilOne, Inc.
1091 N Shoreline Blvd. #250
Mountain View, CA 94043

Agenda
1. Introduction
2. Problem description
3. TunUp
4. K-means
5. Clustering evaluation
6. Full space tuning
7. Genetic algorithm tuning
8. Conclusions

Business Intelligence
Why ? Where? What? How?
Insights of customers, products and companies

Can someone else know your customer better than you?
Do you have the domain knowledge and proper computation
infrastructure?

Problem Description

income cost

customers

Tuning of Clustering
Algorithms
We need tuning when:
➢
New algorithm or version is released
➢
We want to improve accuracy and/or performance
➢
New customer comes and the system must be adapted for the new
dataset and requirements

9

TunUp
Java framework integrating JavaML and Watchmaker

Main features:

➢
Data manipulation (loading, labelling and normalization)
➢
Clustering algorithms (k-means)
➢
Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)
➢
Evaluation techniques validation (Pearson Correlation t-test)
➢
Full search space tuning
➢
Genetic Algorithm tuning (local and parallel implementation)
➢
RESTful API for web ser vice deployment (tomcat in Amazon EC2)

Open-source: http://github.com/gm-spacagna/tunup

k-means
Geometric hard-assigning Clustering algorithm:
It partitions n data points into k clusters in which each point belongs to
the cluster with the nearest mean centroid.
If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified
cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:

Algorithm:
1. Initialization : a set of k random centroids are generated
2. Assignment: each point is assigned to the closest centroid
3. Update: the new centroids are calculated as the mean of the new clusters
4. Go to 2 until the convergence (centroids are stable and do not change)

k-means tuning
Input parameters required: 0. Angular
2. Chebyshev
1. K = (2,...,40) 3. Cosine
4. Euclidean
2. Distance measure 5. Jaccard Index
6. Manhattan
3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient
8. Radial Basis Function Kernel
9. Spearman Footrule

Different input parameters

Ver y different outcomes!!!

Clustering Evaluation
Definition of cluster:
“A group of the same or similar elements gathered or occurring closely
together”

How do we evaluate if a set of clusters is good or not?

“Clustering is in the eye of the beholder” [E. Castro, 2002]

Two main categories:
➢
Internal criterion : only based on the clustered data itself
➢
External criterion : based on benchmarks of pre-classified items

Internal Evaluation
Common goal is assigning better scores when:
➢
High intra-cluster similarity
➢
Low inter-cluster similarity

The choice of the evaluation technique depends on the
nature of the data and the cluster model of the algorithm.

Cluster models:
➢
Distance-based (k-means)
➢
Density-based (EM-clustering)
➢
Distribution-based (DBSCAN)
➢
Connectivity-based (linkage clustering)

Proposed techniques
AIC: measure of the relative quantity of lost information of a statistical
model. The clustering algorithm is modelled as a Gaussian Mixture Process.
(inverted function)

Dunn: ratio between the minimum inter-clusters similarity and maximum
cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most
similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates
if the object is correctly clustered or if it would be more appropriate into the
neighbouring cluster. (natural fn.)

External criterion:
AdjustedRand
Given a a set of n elements S = {o1,...,on} and two partitions to compare:
X={X1,...,Xr} and Y={Y1,...,Ys}

number of agreements between X and Y
RandIndex =
total number of possible pair combinations

RandIndex−ExpectedIndex
AdjustedRandIndex=
MaxIndex−ExpectedIndex

We can use AdjustedRand as reference of the best clustering evaluation and
use it as validation for the internal criterion.

Correlation t-test
Pearson correlation over a set of 120
random k-means configuration
evaluations:

Average correlations:

AIC : 0.77
Dunn: 0.49
Davies-Bouldin: 0.51
Silhouette: 0.49

Dataset
D31
3100 vectors
2 dimensions
31 clusters

S1
5000 vectors
2 dimensions
15 clusters

Source: http://cs.joensuu.fi/sipu/datasets/

Initial Centroids issue
N. observations = 200
Input Configuration: k = 31 , Distance Measure = Eclidean

AdjustedRand AIC

We can consider the median value!

Full space evaluation
N executions averaged = 20

Global optimal is for:
K = 36
DistanceMeasure = Euclidean

Genetic Algorithm Tuning
Crossovering:
[x1,x2,x3,x4,...,xm]

[y1,y2,y3,y4,...,ym]
Elitism
+
Roulette wheel

[x1,x2,x3,y4,...,ym]
[y1,y2,y3,x4,...,xm]

Mutation:
1
Pr (mutate k i →k j )∝
distance ( k i , k j )

1
Pr (mutate d i →d j )=
N dist −1

Tuning parameters:
Fitness Evaluation : AIC
Prob. mutation: 0.5
Prob. Crossovering: 0.9
Population size: 6
Stagnation limit: 5
Elitism: 1
N executions averaged: 10

Relevant results:
➢
Best fitness value always decreasing
➢
Mean fitness value trend decreasing
➢
High standard deviation in the previous
population often generates a better mean
population in the next one

Results

Test1:
k = 39, Distance Measure = Manhattan

Test2:
k = 33, Distance Measure = RBF Kernel

Test3:
k = 36, Distance Measure = Euclidean

Different results due to:
1. Early convergence
2. Random initial centroids

Parallel GA
Simulation: Amazon Elastic Compute Cloud EC2
10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instances

Optimal n. of ser vers = POP_SIZE – ELITISM

E[T single evolution] ≤

Conclusions
We developed, tested and analysed TunUp, an open-solution for:
Evaluation, Validation , Tuning of Data Clustering Algorithms

Future applications :
➢
Tuning of existing algorithms
➢
Supporting new algorithms design
➢
Evaluation and comparison of different algorithms

Limitations:
➢
Single distance measure
➢
Equal normalization
➢
Master / slave parallel execution
➢
Random initial centroids

TunUp final presentation

More Related Content

TunUp final presentation