ifferent methods of clustering

Clustering is the process of grouping data points based on similarity, with two main types: hard clustering, where each point belongs to one cluster, and soft clustering, where points have probabilities of belonging to multiple clusters. Popular clustering algorithms include K Means and Hierarchical clustering, each with distinct methodologies and applications, such as market segmentation and anomaly detection. K Means is efficient for large datasets but requires prior knowledge of the number of clusters, while Hierarchical clustering provides a visual representation of data relationships through dendrograms.

Uploaded by

19ce069

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

ifferent methods of clustering

Uploaded by

19ce069

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Module-4:Clustering:

Introduction to Clustering and different methods of

clustering
1. Overview
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than those
in other groups. In simple words, the aim is to segregate groups with similar traits and assign
them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look
at details of each costumer and devise a unique business strategy for each one of them? Definitely
not. But, what you can do is to cluster all of your costumers into say 10 groups based on their
purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is
what we call clustering.

Now, that we understand what is clustering. Let’s take a look at the types of clustering.

2. Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :

 Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one
group out of the 10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is assigned. For
example, from the above scenario each costumer is assigned a probability to be in either
of 10 clusters of the retail store.

3. Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are
plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among
data points. In fact, there are more than 100 clustering algorithms known. But few of the
algorithms are used popularly, let’s look at them in detail:

 Connectivity models: As the name suggests, these models are based on the notion that
the data points closer in data space exhibit more similarity to each other than the data
points lying farther away. These models can follow two approaches. In the first approach,
they start with classifying all data points into separate clusters & then aggregating them as
the distance decreases. In the second approach, all data points are classified as a single
cluster and then partitioned as the distance increases. Also, the choice of distance function
is subjective. These models are very easy to interpret but lacks scalability for handling big
datasets. Examples of these models are hierarchical clustering algorithm and its variants.

 Centroid models: These are iterative clustering algorithms in which the notion of
similarity is derived by the closeness of a data point to the centroid of the clusters. K-
Means clustering algorithm is a popular algorithm that falls into this category. In these
models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset. These models run iteratively to
find the local optima.

 Distribution models: These clustering models are based on the notion of how probable is
it that all data points in the cluster belong to the same distribution (For example: Normal,
Gaussian). These models often suffer from overfitting. A popular example of these
models is Expectation-maximization algorithm which uses multivariate normal
distributions.

 Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.

Now I will be taking you through two of the most popular clustering algorithms in detail – K
Means clustering and Hierarchical clustering. Let’s begin.

4. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-
D space.
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red cluster is shown using
red cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid : Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster

5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the
4th and 5th steps until we’ll reach global optima. When there will be no further switching of
data points between two clusters for two successive repeats. It will mark the termination
of the algorithm if not explicitly mentioned.

Here is a live coding window where you can try out K Means Algorithm using scikit-learn
library.

5. Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.
This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is only
a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Two important things that you should know about hierarchical clustering are:

 This algorithm has been implemented above using bottom up approach. It is also possible
to follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
 The decision of merging two clusters is taken on the basis of closeness of these clusters.
There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

6. Difference between K Means and Hierarchical clustering

 Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results produced
by running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide
your data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram

7. Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
popular applications of clustering are:

 Recommendation engines
 Market segmentation
 Social network analysis
 Search result grouping
 Medical imaging
 Image segmentation
 Anomaly detection

Building Superstructure
100% (21)
Building Superstructure
76 pages
English BK-200mini - Service Manual - 2020.02.10 - WZH
100% (1)
English BK-200mini - Service Manual - 2020.02.10 - WZH
151 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
SL 4M Change Management Briefing en
88% (8)
SL 4M Change Management Briefing en
42 pages
Hempel Book Marine PDF
100% (3)
Hempel Book Marine PDF
336 pages
Foa Reference Guide To Fiber Optics Study Guide To Foa Certification PDF Free
No ratings yet
Foa Reference Guide To Fiber Optics Study Guide To Foa Certification PDF Free
27 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Zara
No ratings yet
Zara
47 pages
CLUSTERING
No ratings yet
CLUSTERING
5 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Clustering
No ratings yet
Clustering
10 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
UNIT 4 NOTES
No ratings yet
UNIT 4 NOTES
66 pages
ML Unit 5
No ratings yet
ML Unit 5
50 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Clustering
No ratings yet
Clustering
11 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
clustering
No ratings yet
clustering
6 pages
Assi 1
No ratings yet
Assi 1
27 pages
Hierarchial Clustering
No ratings yet
Hierarchial Clustering
14 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Learneverythingai
No ratings yet
Learneverythingai
12 pages
K - Mean Clustering
No ratings yet
K - Mean Clustering
12 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
DWM Exp8 127 133 137
No ratings yet
DWM Exp8 127 133 137
4 pages
Module 5.Docx Aiml
No ratings yet
Module 5.Docx Aiml
28 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
A Density Clustering Based On Outlier
No ratings yet
A Density Clustering Based On Outlier
6 pages
chapter 3 p4
No ratings yet
chapter 3 p4
18 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Partition
No ratings yet
Partition
52 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
CV UNIT 4
No ratings yet
CV UNIT 4
60 pages
ML extended
No ratings yet
ML extended
25 pages
Unit 4
No ratings yet
Unit 4
74 pages
Week-9-Part-2 Agglomerative Clustering
No ratings yet
Week-9-Part-2 Agglomerative Clustering
40 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
CBSYLLABUS BDA
No ratings yet
CBSYLLABUS BDA
5 pages
Machine Learning_Unit 3
No ratings yet
Machine Learning_Unit 3
9 pages
UNIT IV
No ratings yet
UNIT IV
19 pages
Hamdard Institute of Management Sciences Final Examination - Spring 2020
No ratings yet
Hamdard Institute of Management Sciences Final Examination - Spring 2020
4 pages
Maryam's Essay
No ratings yet
Maryam's Essay
4 pages
Grade 3 DLL English 3 Q4 Week 2
No ratings yet
Grade 3 DLL English 3 Q4 Week 2
4 pages
Honey Powder Extract - LC-232267 - HC - He - F - 2023 - O2267
No ratings yet
Honey Powder Extract - LC-232267 - HC - He - F - 2023 - O2267
2 pages
PDF Recent Applications of Selected Name Reactions in the Total Synthesis of Alkaloids Majid M. Heravi download
100% (2)
PDF Recent Applications of Selected Name Reactions in the Total Synthesis of Alkaloids Majid M. Heravi download
47 pages
In Salah Gas Project Engineering Procurement & Construction Phase
No ratings yet
In Salah Gas Project Engineering Procurement & Construction Phase
17 pages
Sociology As A Science
100% (1)
Sociology As A Science
4 pages
Physics and Chemestry Bimestre 1 Marzo 09
No ratings yet
Physics and Chemestry Bimestre 1 Marzo 09
4 pages
Blavatsky - The Theosophical Glossary
No ratings yet
Blavatsky - The Theosophical Glossary
0 pages
cdn3.digialm.com__per_g28_pub_2083_touchstone_AssessmentQPHTMLMode1__2083O2581_2083O2581S6D41479_17438456258438065_TL02202981_2083O2581S6D41479E3.html#
No ratings yet
cdn3.digialm.com__per_g28_pub_2083_touchstone_AssessmentQPHTMLMode1__2083O2581_2083O2581S6D41479_17438456258438065_TL02202981_2083O2581S6D41479E3.html#
37 pages
9.3 MCQ
No ratings yet
9.3 MCQ
3 pages
SAP AMDP Notes & Interview Questions
No ratings yet
SAP AMDP Notes & Interview Questions
3 pages
Week 2-Tools of Research
No ratings yet
Week 2-Tools of Research
26 pages
Ebook 35 Resume Mistakes by Alex BERGHOFEN 2023-02-12
No ratings yet
Ebook 35 Resume Mistakes by Alex BERGHOFEN 2023-02-12
33 pages
kt2 Manual e
No ratings yet
kt2 Manual e
32 pages
Test Report Part 27 PDF
No ratings yet
Test Report Part 27 PDF
53 pages
Prof. (DR.) Jagbir Singh Chairman Board of School Education Haryana, Bhiwani Phone No. 01664-243525
No ratings yet
Prof. (DR.) Jagbir Singh Chairman Board of School Education Haryana, Bhiwani Phone No. 01664-243525
2 pages
Grade12 Ros
No ratings yet
Grade12 Ros
4 pages
[Ebooks PDF] download Behavior Trees in Robotics and Al: An Introduction 1st Edition Michele Collendanchise full chapters
100% (2)
[Ebooks PDF] download Behavior Trees in Robotics and Al: An Introduction 1st Edition Michele Collendanchise full chapters
65 pages
Lesson-3--Parts-of-a-cell-worksheet
No ratings yet
Lesson-3--Parts-of-a-cell-worksheet
2 pages
Regenerative Bhutan Report Summary - Vclean
100% (1)
Regenerative Bhutan Report Summary - Vclean
15 pages
Office of The Secretary
No ratings yet
Office of The Secretary
6 pages
Benedict 2
No ratings yet
Benedict 2
7 pages
DIDI PDS Finals Information - 2024
No ratings yet
DIDI PDS Finals Information - 2024
5 pages
Science of Cyber Security 1st edition by Wenlian Lu, Kun Sun, Moti Yun, Feng Liu 3030891372 9783030891374 - Download the ebook now for the best reading experience
100% (4)
Science of Cyber Security 1st edition by Wenlian Lu, Kun Sun, Moti Yun, Feng Liu 3030891372 9783030891374 - Download the ebook now for the best reading experience
75 pages