(PML ITS - Week 10) - Clustering
(PML ITS - Week 10) - Clustering
Clustering
8
https://pubdata.tistory.com/141
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function, typically metric: d(i, j)
• The definitions of distance functions are usually rather different for interval-scaled,
boolean, categorical, ordinal ratio, and vector variables
• Weights should be associated with different variables based on applications and data
semantics
• Quality of clustering:
• There is usually a separate “quality” function that measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
9
The use of Clustering: Clustering as a Preprocessing Tool
• Summarization:
• Preprocessing for regression, PCA, classification,
and association analysis
• Compression:
• Image processing: vector quantization
• Finding K-nearest Neighbors
• Localizing search to one or a small number of
clusters
• Outlier detection
• Outliers are often viewed as those “far away” from
any cluster
10
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-
level hierarchical partitioning is desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one
region) vs. non-exclusive (e.g., one document may
belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidean, road network, vector)
vs. connectivity-based (e.g., density or contiguity)
Exclusive Non-Exclusive
• Clustering space
• Full space (often when low dimensional) vs. subspaces
(often in high-dimensional clustering)
11
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture
of these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
Major Clustering Approaches (I)
• Partitioning approach:
• Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data
(or objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSCAN, OPTICS, DenClue
• Grid-based approach:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches (II)
• Model-based:
• A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
• Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
• Based on the analysis of frequent patterns
• Typical methods: p-Cluster
• User-guided or constraint-based:
• Clustering by considering user-specified or application-
specific constraints
• Typical methods: COD (obstacles), constrained clustering
• Link-based clustering:
• Objects are often linked together in various ways
• Massive links can be used to cluster objects: SimRank,
LinkClus
Learning Check
• Cluster analysis is to find similarities between data according to the characteristics found in
the data and grouping similar data objects into clusters. (True / False)
• What is a good clustering?
• What are the consideration for cluster analysis?
• Mention two of the major clustering approaches!
Cluster Analysis
1 5
x1 x2
0.8
1 0.7 0.0
3
0.6 4
2 1.0 0.4
0.4
3 0.0 0.7 2
0.2
4 0.3 0.6
0.0 1
5 0.4 1.0
0.0 0.2 0.4 0.6 0.8 1
1
Centroid 2 [(3,5)] :
0.0
= [(0.0 + 0.4) / 2, (0.7 + 1.0) / 2]
0.0 0.2 0.4 0.6 0.8 1
= [0.2, 0.85]
E ik1 pCi ( p ci ) 2
= (0.7 – 2/3)2 + (1.0 – 2/3)2 + (0.3 – 2/3)2 + (0.0 – 1/3)2 + (0.4 – 1/3)2 + (0.6 – 1/3)2
= 0.0016 + 0.1089 + 0.1296 + 0.1089 + 0.0049 + 0.0729 = 0.4268
0.4268 + 0.125
= (0.0 – 0.2)2 + (0.4 – 0.2)2 + (0.7 – 0.85)2 + (1.0 – 0.85)2 = 0.5518
= 0.04 + 0.04 + 0.0225 + 0.0225 = 0.125
Simple Example
Iteration 2
x1 x2 Iteration 2
1 5
1 0.7 0.0 Centroid 1 [(1, 2)] :
0.8
2 1.0 0.4 = [(0.7 + 1.0) / 2 , (0.0 + 0.4) / 2]
3 0.0 0.7 0.6
3
4
= [0.85, 0.2]
4 0.3 0.6 0.4
2 Centroid 2 [(3,4,5)] :
5 0.4 1.0 = [(0.0 + 0.3 + 0.4) / 3, (0.7 + 0.6 + 1.0) / 2]
0.2
= [0.23, 0.76]
0.0 1
E pCi ( p ci )
k
i 1
2 Since the value is lower than the
previous iteration, we select this option
= (0.7 – 0.85)2 + (1.0 – 0.85)2 + (0.0 – 0.2)2 + (0.4 – 0.2)2 (select the minimum)
= 0.0225 + 0.0225 + 0.04 + 0.04 = 0.125
0.2215 + 0.125
= (0.0 – + (0.3 –
0.23)2 + (0.4 –
0.23)2 + (0.7 –0.23)2 (0.6 – 0.76)2 + 0.76)2 + (1.0 – 0.76)2 = 0.3465
= 0.0529 + 0.0529 + 0.0289 + 0.0036 + 0.0256 + 0.0576 = 0.2215
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to automatically determine
the best k (see Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
22
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
Simple Example
1 5
dimension1 dimension2
0.8
1 0.7 0.0
2 1.0 0.4 3 4
0.6
Let n = 5, and we will cluster with k = 2 0.0 0.2 0.4 0.6 0.8 1
1
1 5
1 5 5 Iterate until we
0.8 0.8 0.8 reach the
3 4 3 4 0.6
3 4
minimum
0.6 0.6
distance from a
0.4 0.4 0.4
2 2 2 medoid in a
0.2 0.2
0.2 particular k.
0.0 1 1 0.0 1
0.0
0.0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5
each 5
4 object as 4 remaining 4
3 initial 3 object to 3
2 medoids 2
nearest 2
1 1
medoids 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no change
7 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
25
The K-Medoid Clustering Method
• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the
non-medoids if it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large data sets (due to
the computational complexity)
26
Learning Check
• What is the difference between k-means and k-medoids?
• What is the weakness of k-means?
• What is the weakness of k-medoids?
Cluster Analysis
29
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
30
Dendrogram: Shows How Clusters are Merged
Decompose data objects
into a several levels of
nested partitioning (tree of
clusters), called a
dendrogram
31
DIANA (Divisive Analysis)
32
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki,
Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
• Medoid: a chosen, centrally located object in the cluster
33
Simple Example
1 5
dimension1 dimension2
0.8
1 0.7 0.0
3
0.6 4
2 1.0 0.4
0.4
3 0.0 0.7 2
0.2
4 0.3 0.6
0.0 1
5 0.4 1.0
0.0 0.2 0.4 0.6 0.8 1
8.0
Next, (3, 5) and 4 is grouped
Next, (1, 2) and 6 is grouped 6.0
(1, 2) (3,4,5) 6
(1, 2, 6) (3,4,5)
(1, 2) 0 10 8 4.0
(1, 2, 6) 0 8.5
(3, 4, 5) 0 8.5 2.0
(3, 4, 5) 0
6 0
0.0
1 2 6 4 5 3
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
• Do not scale well: time complexity of at least O(n2), where n is the number of total objects
• BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
37
Learning Check
• Explain briefly about the distance between clusters in hierarchical
clustering!
• What does it mean by this dendogram?
Learning Check
• Use Iris data for the example!
Assignment
• Select “Auto MPG”
https://archive.ics.uci.edu/ml/datasets/auto+mpg
Question
• Select the numerical features
• 1. Use k-means
• a. Select the best k! (based on the Silhouette Score)
• b. Explain the meaning of the clustering!
• 2. Use Hierarchical Clustering
• a. Using the Euclidean Distance!
• b. Select the best approach to result the cluster! (Single, Complete, Average, etc.)
• c. Explain the meaning of the clustering!
• 3. Let the mpg which is higher than the average is categorized as “High”, otherwise
“Low”.
• a. Do Classification (Decision Tree)!
• b. Explain the Tree!
• 4. (Self-Supervised Learning) Using the result of k-means, do Classification!
• a. Select the proper features!
• b. Do Decision Tree!
• c. Do you think the tree is the same as Problem 3 (b)?
Summary
• Cluster Analysis: Basic Concepts
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Evaluation of Clustering
• Summary