K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It aims to partition the data space into Voronoi cells based on cluster centers (centroids) such that data points closer to the centroid are assigned to the same cluster. The number of clusters k needs to be determined beforehand. The elbow method and within-cluster sum of squares (WCSS) can help identify the optimal number of clusters, with the elbow point indicating the "right" number where adding more clusters does not significantly improve the model. Random initialization of centroids can impact clustering results, so the algorithm is typically run multiple times.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
215 views
K-Means Clustering Using Python
K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It aims to partition the data space into Voronoi cells based on cluster centers (centroids) such that data points closer to the centroid are assigned to the same cluster. The number of clusters k needs to be determined beforehand. The elbow method and within-cluster sum of squares (WCSS) can help identify the optimal number of clusters, with the elbow point indicating the "right" number where adding more clusters does not significantly improve the model. Random initialization of centroids can impact clustering results, so the algorithm is typically run multiple times.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30
K-means Clustering
Meghana Tribhuwan K-means
• K means clears the confusion as to how many
groups • Till you reach a point where no reassignment is needed 2nd example Random Initialization We take more appropriate clusters End result What will happen if we have a bad (Centroid) random initialization ? • Assume that we perform all of the steps again. Final result after performing the said steps • Before After
Selection of the centroids has a huge impact on the
clusters and assigning centroids is random…..then Algorithm to identify(decide) the right number of Clusters • If we determine the clusters to be 3 than after applying K-means algo How do we know what will perform better weather 3 or 4 or 10 clusters • Formula to choose the right number of clusters • Within-Cluster-Sum-of-Squares (WCSS) Within-Cluster-Sum-of-Squares (WCSS)
• Calculate each points distance from its
centroid and square it • If we take only one big cluster? • the distance between the points and Centroid will be more and so will be the WCSS value • WCSS decreases when we make 2 clusters • WCSS has decrease more • Question is how many clusters we can have? • Max. No of clusters can be as many data points you have, eg 50 points 50 clusters
• Pause the video think and tell me, what will be
the value of WCSS? • Answer is it will be zero • Every point will be its own centroid therefore distance between the point and centroid will be 0. Square them the value will be 0and after adding also it will be 0. • The lesser the WCSS the better our goodness of fit will be. • But how do we find the optimum goodness of fit? Elbow method • But is an arbitrary method, not very particular. • Elbow method is a hint, ultimately you have to choose 2 then 3 and then 4 and you have to decide for yourself as you are the one who is analysing the data