Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

(PML ITS - Week 10) - Clustering

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. There are several major approaches to cluster analysis including partitioning methods which construct clusters by optimizing cluster quality measures, hierarchical methods which create hierarchical decompositions of clusters, and density-based methods which identify clusters based on density connections between data points. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Evaluation measures, data type considerations, and scalability to large datasets are important factors to consider for cluster analysis.

Uploaded by

HIkma Ramadhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

(PML ITS - Week 10) - Clustering

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. There are several major approaches to cluster analysis including partitioning methods which construct clusters by optimizing cluster quality measures, hierarchical methods which create hierarchical decompositions of clusters, and density-based methods which identify clusters based on density connections between data points. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Evaluation measures, data type considerations, and scalability to large datasets are important factors to consider for cluster analysis.

Uploaded by

HIkma Ramadhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Predictive Analytics and Machine Learning

Clustering

Bernardo Nugroho Yahya


Email: bernardo (at) hufs.ac.kr
Cluster Analysis

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign & Simon Fraser University
© 2011 Han, Kamber & Pei. All rights reserved.
What is Cluster Analysis?
• Cluster: A collection of data objects
• similar (or related) to one another
within the same group
• dissimilar (or unrelated) to the objects
in other groups
• Cluster analysis (or clustering, data
segmentation, …)
• Finding similarities between data
according to the characteristics found
in the data and grouping similar data
objects into clusters
What is Cluster Analysis?
• Unsupervised learning: no predefined
classes (i.e., learning by observations vs.
learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight
into data distribution
• As a preprocessing step for other
algorithms
Pseudo-label
• Self-supervised learning is similar to
unsupervised learning.
• It starts to pseudo-label (create the label based
on the similar properties) and follows with
supervised learning approach
Data Collection My_Collection

Phone ID Battery Camera


Commonly, Clustering could solve a
problem as an unsupervised learning. 1 12 8
We can store features in a database 2 26 16
without any labels. 3 9 9
4 8 7
For example:
5 22 12
- Phone ID 6 10 9
- Battery: hours of battery last 7 24 15
- Camera: camera pixels 8 11 8.5
9 23 17
10 21 14
Using Distance Measure
We can measure the possible clusters based on the distance
Cluster 1 20 Cluster 2
18
16 2
3
14
9
Camera 12
4 10
7
8
6 6
4 10
8
2
5
1 10 20 30
Battery
A new Class defined by the distance
My_Collection

Phone Battery Camera Phone Class


Using the stored features in a ID
database, we can measure the 1 12 4 Cluster 1
distance and create the clusters. 2 26 16 Cluster 2
3 9 9 Cluster 1
The clustering problem can 4 8 7 Cluster 1
now be expressed as: 5 22 12 Cluster 2
6 10 6 Cluster 1
• Given a data in the database (My_Collection), 7 24 15 Cluster 2
create the possible class label according to the
similarity (distances) of the features 8 8 5 Cluster 1
9 23 17 Cluster 2
10 21 14 Cluster 2
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
• high intra-class similarity: cohesive within clusters
• low inter-class similarity: distinctive between clusters

• The quality of a clustering method depends on


• the similarity measure used by the method
• its implementation, and
• its ability to discover some or all of the hidden patterns

8
https://pubdata.tistory.com/141
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function, typically metric: d(i, j)
• The definitions of distance functions are usually rather different for interval-scaled,
boolean, categorical, ordinal ratio, and vector variables
• Weights should be associated with different variables based on applications and data
semantics
• Quality of clustering:
• There is usually a separate “quality” function that measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective

9
The use of Clustering: Clustering as a Preprocessing Tool
• Summarization:
• Preprocessing for regression, PCA, classification,
and association analysis
• Compression:
• Image processing: vector quantization
• Finding K-nearest Neighbors
• Localizing search to one or a small number of
clusters
• Outlier detection
• Outliers are often viewed as those “far away” from
any cluster

10
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-
level hierarchical partitioning is desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one
region) vs. non-exclusive (e.g., one document may
belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidean, road network, vector)
vs. connectivity-based (e.g., density or contiguity)
Exclusive Non-Exclusive
• Clustering space
• Full space (often when low dimensional) vs. subspaces
(often in high-dimensional clustering)
11
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture
of these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
Major Clustering Approaches (I)
• Partitioning approach:
• Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data
(or objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSCAN, OPTICS, DenClue
• Grid-based approach:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches (II)
• Model-based:
• A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
• Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
• Based on the analysis of frequent patterns
• Typical methods: p-Cluster
• User-guided or constraint-based:
• Clustering by considering user-specified or application-
specific constraints
• Typical methods: COD (obstacles), constrained clustering
• Link-based clustering:
• Objects are often linked together in various ways
• Massive links can be used to cluster objects: SimRank,
LinkClus
Learning Check
• Cluster analysis is to find similarities between data according to the characteristics found in
the data and grouping similar data objects into clusters. (True / False)
• What is a good clustering?
• What are the consideration for cluster analysis?
• Mention two of the major clustering approaches!
Cluster Analysis

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign & Simon Fraser University
© 2011 Han, Kamber & Pei. All rights reserved.
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E  i 1 pCi ( p  ci )
k 2

• Given k, find a partition of k clusters that optimizes the chosen partitioning


criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by
the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four


steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest
seed point
• Go back to Step 2, stop when the assignment
does not change
Simple Example

1 5
x1 x2
0.8
1 0.7 0.0
3
0.6 4
2 1.0 0.4
0.4
3 0.0 0.7 2

0.2
4 0.3 0.6
0.0 1
5 0.4 1.0
0.0 0.2 0.4 0.6 0.8 1

Obtained from Carnegie Mellon University


Simple Example
Iteration 1
x1 x2
1 0.7 0.0 1 5 Let n = 5, and we will cluster with k = 2
2 1.0 0.4 0.8
Iteration 1
3 0.0 0.7 3
0.6 4 Centroid 1 [(1, 2, 4)] :
4 0.3 0.6
0.4
= [(0.7 + 1.0 + 0.3) / 3 , (0.0 + 0.4 + 0.6) / 3]
5 0.4 1.0 2
= [2/3, 1/3]
0.2

1
Centroid 2 [(3,5)] :
0.0
= [(0.0 + 0.4) / 2, (0.7 + 1.0) / 2]
0.0 0.2 0.4 0.6 0.8 1
= [0.2, 0.85]
E  ik1 pCi ( p  ci ) 2
= (0.7 – 2/3)2 + (1.0 – 2/3)2 + (0.3 – 2/3)2 + (0.0 – 1/3)2 + (0.4 – 1/3)2 + (0.6 – 1/3)2
= 0.0016 + 0.1089 + 0.1296 + 0.1089 + 0.0049 + 0.0729 = 0.4268
0.4268 + 0.125
= (0.0 – 0.2)2 + (0.4 – 0.2)2 + (0.7 – 0.85)2 + (1.0 – 0.85)2 = 0.5518
= 0.04 + 0.04 + 0.0225 + 0.0225 = 0.125
Simple Example
Iteration 2
x1 x2 Iteration 2
1 5
1 0.7 0.0 Centroid 1 [(1, 2)] :
0.8
2 1.0 0.4 = [(0.7 + 1.0) / 2 , (0.0 + 0.4) / 2]
3 0.0 0.7 0.6
3
4
= [0.85, 0.2]
4 0.3 0.6 0.4
2 Centroid 2 [(3,4,5)] :
5 0.4 1.0 = [(0.0 + 0.3 + 0.4) / 3, (0.7 + 0.6 + 1.0) / 2]
0.2
= [0.23, 0.76]
0.0 1

0.0 0.2 0.4 0.6 0.8 1

E    pCi ( p  ci )
k
i 1
2 Since the value is lower than the
previous iteration, we select this option
= (0.7 – 0.85)2 + (1.0 – 0.85)2 + (0.0 – 0.2)2 + (0.4 – 0.2)2 (select the minimum)
= 0.0225 + 0.0225 + 0.04 + 0.04 = 0.125

0.2215 + 0.125
= (0.0 – + (0.3 –
0.23)2 + (0.4 –
0.23)2 + (0.7 –0.23)2 (0.6 – 0.76)2 + 0.76)2 + (1.0 – 0.76)2 = 0.3465
= 0.0529 + 0.0529 + 0.0289 + 0.0036 + 0.0256 + 0.0576 = 0.2215
Comments on the K-Means Method

• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to automatically determine
the best k (see Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes

22
What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !


• Since an object with an extremely large value may substantially distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids
can be used, which is the most centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
Simple Example
1 5
dimension1 dimension2
0.8
1 0.7 0.0
2 1.0 0.4 3 4
0.6

3 0.0 0.7 0.4


2
4 0.3 0.6
0.2
5 0.4 1.0
0.0 1

Let n = 5, and we will cluster with k = 2 0.0 0.2 0.4 0.6 0.8 1

1
1 5
1 5 5 Iterate until we
0.8 0.8 0.8 reach the
3 4 3 4 0.6
3 4
minimum
0.6 0.6
distance from a
0.4 0.4 0.4
2 2 2 medoid in a
0.2 0.2
0.2 particular k.
0.0 1 1 0.0 1
0.0

0.0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1 0.0 0.2 0.4 0.6 0.8 1
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5
each 5

4 object as 4 remaining 4

3 initial 3 object to 3

2 medoids 2
nearest 2

1 1
medoids 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom

10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no change
7 7

and Oramdom 6
swapping 6

5 5
If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

25
The K-Medoid Clustering Method

• K-Medoids Clustering: Find representative objects (medoids) in clusters

• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the
non-medoids if it improves the total distance of the resulting clustering

• PAM works effectively for small data sets, but does not scale well for large data sets (due to
the computational complexity)

• Efficiency improvement on PAM

• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

• CLARANS (Ng & Han, 1994): Randomized re-sampling

26
Learning Check
• What is the difference between k-means and k-medoids?
• What is the weakness of k-means?
• What is the weakness of k-medoids?
Cluster Analysis

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign & Simon Fraser University
© 2011 Han, Kamber & Pei. All rights reserved.
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require the number
of clusters k as an input, but needs a termination condition

29
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

30
Dendrogram: Shows How Clusters are Merged
Decompose data objects
into a several levels of
nested partitioning (tree of
clusters), called a
dendrogram

A clustering of the data


objects is obtained by cutting
the dendrogram at the
desired level, then each
connected component forms
a cluster

31
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

32
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki,
Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
• Medoid: a chosen, centrally located object in the cluster

33
Simple Example
1 5
dimension1 dimension2
0.8
1 0.7 0.0
3
0.6 4
2 1.0 0.4
0.4
3 0.0 0.7 2

0.2
4 0.3 0.6
0.0 1
5 0.4 1.0
0.0 0.2 0.4 0.6 0.8 1

Using Sum of Squared distance, we 1 2 3 4 5


can get the distance measure 1 0 0.25 0.98 0.52 1.09
2 0.25 0 1.09 0.53 0.72
Ex. Distance between 1 and 2
(0.7 – 1)2 + (0.0 – 0.4)2 3 0.98 1.09 0 0.10 0.25
(0.09) + (0.16) = 0.25 4 0.52 0.53 0.10 0 0.17
5 1.09 0.72 0.25 0.17 0

Obtained from Carnegie Mellon University


Simple Example (1)
Below is a distance matrix Using single link, select the minimum distance and group it together.
1 2 3 4 5 6 Since the minimum distance is 3 and 5, then we group them
1 2 (3,5) 4 6
1 0 4 13 24 12 8
1 0 4 12 24 8
2 0 10 22 11 10
2 0 10 22 10
3 0 7 3 9
(3, 5) 0 6 8.5
4 0 6 18
4 0 18
5 0 8.5
6 0
6 0

The remaining group is calculated by finding the minimum:


d1, (3,5) = min {d13, d15} = 12
d2, (3,5) = min {d23, d25} = 10
d4, (3,5) = min {d43, d45} = 6
d6, (3,5) = min {d13, d15} = 8.5
Simple Example (2)
Using single link, select the minimum distance and group it together.
1 2 (3,5) 4 6 Since the minimum distance is 1 and 2, then we group them

1 0 4 12 24 8 (1, 2) (3,5) 4 6 The remaining group is calculated :


d(1,2), (3,5) = min {d13, d23 , d15 ,d25} = 10
2 0 10 22 10 (1, 2) 0 10 22 8
d(1,2), 4 = min {d14, d24} = 22
(3, 5) 0 6 8.5 (3, 5) 0 6 8.5 d(1,2), 6 = min {d16, d26} = 8
4 0 18 4 0 18 Threshold
Dendogram
distance
6 0 6 0

8.0
Next, (3, 5) and 4 is grouped
Next, (1, 2) and 6 is grouped 6.0
(1, 2) (3,4,5) 6
(1, 2, 6) (3,4,5)
(1, 2) 0 10 8 4.0
(1, 2, 6) 0 8.5
(3, 4, 5) 0 8.5 2.0
(3, 4, 5) 0
6 0
0.0
1 2 6 4 5 3
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods

• Can never undo what was done previously

• Do not scale well: time complexity of at least O(n2), where n is the number of total objects

• Integration of hierarchical & distance-based clustering

• BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters

• CHAMELEON (1999): hierarchical clustering using dynamic modeling

37
Learning Check
• Explain briefly about the distance between clusters in hierarchical
clustering!
• What does it mean by this dendogram?
Learning Check
• Use Iris data for the example!
Assignment
• Select “Auto MPG”
https://archive.ics.uci.edu/ml/datasets/auto+mpg
Question
• Select the numerical features
• 1. Use k-means
• a. Select the best k! (based on the Silhouette Score)
• b. Explain the meaning of the clustering!
• 2. Use Hierarchical Clustering
• a. Using the Euclidean Distance!
• b. Select the best approach to result the cluster! (Single, Complete, Average, etc.)
• c. Explain the meaning of the clustering!
• 3. Let the mpg which is higher than the average is categorized as “High”, otherwise
“Low”.
• a. Do Classification (Decision Tree)!
• b. Explain the Tree!
• 4. (Self-Supervised Learning) Using the result of k-means, do Classification!
• a. Select the proper features!
• b. Do Decision Tree!
• c. Do you think the tree is the same as Problem 3 (b)?
Summary
• Cluster Analysis: Basic Concepts
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Evaluation of Clustering
• Summary

You might also like