A Tutorial On Clustering Algorithms

K-means clustering is an algorithm that partitions observations into k clusters by minimizing the within-cluster sum of squares. It works by [1] randomly initializing k cluster centroids, [2] assigning each observation to the nearest centroid, and [3] recalculating the centroid positions. This process repeats until convergence. While simple, k-means is sensitive to initialization and may find local optima rather than global optima. It also requires specifying the number of clusters k in advance.

Uploaded by

jczerna

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views

A Tutorial On Clustering Algorithms

Uploaded by

jczerna

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

A Tutorial on Clustering Algorithms

Introduction | K-means | Fuzzy C-means | Hierarchical | Mixture of

Gaussians | Links

K-Means Clustering
The Algorithm
K-means (MacQueen, 1967) is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The procedure
follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to
define k centroids, one for each cluster. These centroids shoud be placed in a
cunning way because of different location causes different result. So, the
better choice is to place them as much as possible far away from each other.
The next step is to take each point belonging to a given data set and associate
it to the nearest centroid. When no point is pending, the first step is completed
and an early groupage is done. At this point we need to re-calculate k new
centroids as barycenters of the clusters resulting from the previous step. After
we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has been generated.
As a result of this loop we may notice that the k centroids change their
location step by step until no more changes are done. In other words centroids
do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a
squared error function. The objective function

where is a chosen distance measure between a data point and

the cluster centre , is an indicator of the distance of the n data points from
their respective cluster centres.

The algorithm is composed of the following steps:

1. Place K points into the space represented by the objects

that are being clustered. These points represent initial
group centroids.
2. Assign each object to the group that has the closest
centroid.
3. When all objects have been assigned, recalculate the
positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups
from which the metric to be minimized can be calculated.

Although it can be proved that the procedure will always terminate, the k-
means algorithm does not necessarily find the most optimal configuration,
corresponding to the global objective function minimum. The algorithm is
also significantly sensitive to the initial randomly selected cluster centres. The
k-means algorithm can be run multiple times to reduce this effect.

K-means is a simple algorithm that has been adapted to many problem

domains. As we are going to see, it is a good candidate for extension to work
with fuzzy feature vectors.

An example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same
class, and we know that they fall into k compact clusters, k < n. Let mi be the
mean of the vectors in cluster i. If the clusters are well separated, we can use a
minimum-distance classifier to separate them. That is, we can say that x is in
cluster i if || x - mi || is the minimum of all the k distances. This suggests the
following procedure for finding the k means:

 Make initial guesses for the means m1, m2, ..., mk

 Until there are no changes in any mean
o Use the estimated means to classify the samples into clusters
o For i from 1 to k
 Replace mi with the mean of all of the samples for cluster
i
o end_for
 end_until

Here is an example showing how the means m1 and m2 move into the centers
of two clusters.
Remarks
This is a simple version of the k-means procedure. It can be viewed as a
greedy algorithm for partitioning the n samples into k clusters so as to
minimize the sum of the squared distances to the cluster centers. It does have
some weaknesses:

 The way to initialize the means was not specified. One popular way to
start is to randomly choose k of the samples.
 The results produced depend on the initial values for the means, and it
frequently happens that suboptimal partitions are found. The standard
solution is to try a number of different starting points.
 It can happen that the set of samples closest to mi is empty, so
that mi cannot be updated. This is an annoyance that must be handled in
an implementation, but that we shall ignore.
 The results depend on the metric used to measure || x - mi ||. A popular
solution is to normalize each variable by its standard deviation, though
this is not always desirable.
 The results depend on the value of k.

This last problem is particularly troublesome, since we often have no way of

knowing how many clusters exist. In the example shown above, the same
algorithm applied to the same data produces the following 3-means clustering.
Is it better or worse than the 2-means clustering?

Unfortunately there is no general theoretical solution to find the optimal

number of clusters for any given data set. A simple approach is to compare the
results of multiple runs with different k classes and choose the best one
according to a given criterion (for instance the Schwarz Criterion -
see Moore's slides), but we need to be careful because increasing k results in
smaller error function values by definition, but also an increasing risk of
overfitting.

Bibliography

 J. B. MacQueen (1967): "Some Methods for classification and Analysis

of Multivariate Observations, Proceedings of 5-th Berkeley Symposium
on Mathematical Statistics and Probability", Berkeley, University of
California Press, 1:281-297
 Andrew Moore: “K-means and Hierarchical Clustering - Tutorial
Slides”
http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html
 Brian T. Luke: “K-Means Clustering”
http://fconyx.ncifcrf.gov/~lukeb/kmeans.html
 Tariq Rashid: “Clustering”
http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_
initial_report/node11.html
 Hans-Joachim Mucha and Hizir Sofyan: “Nonhierarchical Clustering”
http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe149.ht

WEBSITE:

https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html

Social Psychology 2nd Edition Robbie Sutton all chapter instant download
No ratings yet
Social Psychology 2nd Edition Robbie Sutton all chapter instant download
65 pages
Dental Assisting National Board (DANB) Exams Study Guide
No ratings yet
Dental Assisting National Board (DANB) Exams Study Guide
20 pages
Judith Belmont - The Therapist's Ultimate Solution Book-W. W. Norton PDF
100% (7)
Judith Belmont - The Therapist's Ultimate Solution Book-W. W. Norton PDF
324 pages
Management Control Systems and Strategy A Critical Review
No ratings yet
Management Control Systems and Strategy A Critical Review
26 pages
LK Modul 5 English For Practical Use
100% (1)
LK Modul 5 English For Practical Use
5 pages
WRMT Presentation
No ratings yet
WRMT Presentation
51 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
K Means
No ratings yet
K Means
33 pages
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
No ratings yet
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
3 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
na2010
No ratings yet
na2010
5 pages
Unit-4
No ratings yet
Unit-4
46 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Kmean
No ratings yet
Kmean
24 pages
algo
No ratings yet
algo
59 pages
Kmean Clustering
No ratings yet
Kmean Clustering
3 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Wk03 machine learning
No ratings yet
Wk03 machine learning
5 pages
K Mean
No ratings yet
K Mean
7 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
10 Marks Questions
No ratings yet
10 Marks Questions
19 pages
K - Means Clustering
No ratings yet
K - Means Clustering
3 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
K Mean
No ratings yet
K Mean
12 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
K Means
No ratings yet
K Means
23 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
ML - Unit - 2
No ratings yet
ML - Unit - 2
13 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lighthill-Whitham-Richards Traffic Flow Model
No ratings yet
Lighthill-Whitham-Richards Traffic Flow Model
53 pages
T4 - Part B Case Study Examination: Thursday 23 May 2013
No ratings yet
T4 - Part B Case Study Examination: Thursday 23 May 2013
28 pages
Cumulative Sums in Gams
No ratings yet
Cumulative Sums in Gams
7 pages
Collision Case Study
No ratings yet
Collision Case Study
4 pages
Eng4u Scas Course Outline
No ratings yet
Eng4u Scas Course Outline
3 pages
ww2 Lesson
No ratings yet
ww2 Lesson
4 pages
Instant ebooks textbook Teaching and Learning with Technologies in the Primary School 3rd Edition Sarah Younie download all chapters
100% (3)
Instant ebooks textbook Teaching and Learning with Technologies in the Primary School 3rd Edition Sarah Younie download all chapters
51 pages
Lesson Plan - Salad Dressing
No ratings yet
Lesson Plan - Salad Dressing
4 pages
Latihan Soal - Set 3
No ratings yet
Latihan Soal - Set 3
4 pages
Argumentation and Presentation - The Foundantion Crisis - Lyotard
No ratings yet
Argumentation and Presentation - The Foundantion Crisis - Lyotard
27 pages
Zeal Empowerment Workshop
No ratings yet
Zeal Empowerment Workshop
2 pages
Lesson Plan For Year 5
No ratings yet
Lesson Plan For Year 5
6 pages
The Significance of Learner's Errors
No ratings yet
The Significance of Learner's Errors
12 pages
LP Math100 Multiplication GR.4
No ratings yet
LP Math100 Multiplication GR.4
4 pages
Chapter-13 Measurement and Scaling Concepts
No ratings yet
Chapter-13 Measurement and Scaling Concepts
14 pages
AOL Vs AFL
No ratings yet
AOL Vs AFL
18 pages
CHAPTER 3 - More About Nouns: Nominative and Accusative Cases
No ratings yet
CHAPTER 3 - More About Nouns: Nominative and Accusative Cases
16 pages
TLE - HECK9 12KP Ib 2 THURSDAY
No ratings yet
TLE - HECK9 12KP Ib 2 THURSDAY
3 pages
PHY 20 Physics For Engineers
No ratings yet
PHY 20 Physics For Engineers
4 pages
AN ASSIGNMENT Research Methodolog
No ratings yet
AN ASSIGNMENT Research Methodolog
20 pages
Lesson Plan for Grade 7 Diary Writing,Journal writing
No ratings yet
Lesson Plan for Grade 7 Diary Writing,Journal writing
4 pages
The Difference Between A Leader and A Manager
No ratings yet
The Difference Between A Leader and A Manager
3 pages
DLL SPS 7
No ratings yet
DLL SPS 7
4 pages
Wix S Lesson
No ratings yet
Wix S Lesson
7 pages
Chp1 Final
No ratings yet
Chp1 Final
36 pages
Planning For A Health Career: Quarter 4
No ratings yet
Planning For A Health Career: Quarter 4
20 pages
Rubrics
No ratings yet
Rubrics
4 pages
Case Study
No ratings yet
Case Study
2 pages