Clustering
Clustering
Clustering
Mining
Clustering I
Vasileios Megalooikonomou
(based on notes by Jiawei Han and Micheline Kamber)
Agenda
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
Image Processing
Economic Science (especially market research) e.g.,
marketing, insurance
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
Agenda
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
x11
...
x
i1
...
x
n1
(two modes)
n objects, p variables
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
Dissimilarity matrix
d(3,1) d ( 3,2) 0
(one mode)
:
:
:
... 0
Interval-valued variables
Continuous measurements of a roughly linear scale (e.g., weight,
height, temperature, etc)
Standardize data (to avoid dependence on the measurement units)
Calculate the mean absolute deviation of a variable f with n measurements:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f
...
xnf )
xif m f
zif
sf
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two pdimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i),
symmetry
Binary Variables
A contingency table for binary data
Object j
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
bc
a bc d
d (i, j)
bc
a b c
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Nominal Variables
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches (# of variables for which i and j are in the same
state), p: total # of variables
m
d (i, j) p
p
Ordinal Variables
An ordinal variable can be discrete or continuous
Resembles nominal var but order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
r {1,..., M }
if
where ordinal variable f has Mf states
and xif is fthe value of
rif 1
zif using
methods for interval-scaled variables
compute the dissimilarity
M f 1
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt where A
and B are positive constants (e.g., decay of radioactive elements)
Methods:
treat them like interval-scaled variables not a good choice!
(why?)
apply logarithmic transformation
yif = log(xif)
and treat them as interval-valued
treat them as continuous ordinal data and treat their rank as
interval-scaled.
f 1 ij
p
f 1
ij
(f)
ij
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and z r 1
if
and treat zif as interval-scaled M 1
if
Agenda
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
Clustering Approaches
Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion (k-means, k-medoids)
Hierarchical algorithms: Create a hierarchical decomposition
(agglomerative or divisive) of the set of data (or objects) using some
criterion (CURE, Chameleon, BIRCH)
Density-based: based on connectivity and density functions grow a
cluster as long as density in the neighborhood exceeds a threshold
(DBSCAN, CLIQUE)
Grid-based: based on a multiple-level grid structure (i.e., quantized
space) (STING, CLIQUE)
Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of the data to the given model (EM)
Agenda
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
Example
10
10
0
0
10
10
10
10
0
0
10
10
Weaknesses
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
10
10
Agenda
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters k as an
input, but needs a termination condition
Step 0
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
10
10
0
0
10
0
0
10
10
10
10
10
0
0
10
0
0
10
10
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
SS: Ni=1Xi2
Clustering features are additive
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
s/pq = 5
y
y
x
y
y
x
x
x
x
nm
m
n
log n)
m
a
Computational complexity:
Basic ideas:
Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Sim( T 1, T 2)
T1 T2
Sim( T1 , T2 )
T1 T2
{3}
1
0.2
{1,2,3,4,5}
5
CHAMELEON
CHAMELEON: hierarchical clustering using
dynamic modeling, by G. Karypis, E.H. Han and V.
Kumar99 (O(n2))
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters
and closeness of items within the clusters
Sparse Graph
Data Set
Merge Partition
Final Clusters