K.means Clustering
K.means Clustering
K.means Clustering
Kmeans Algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only
one group. It tries to make the intra-cluster data points as similar as possible while also
keeping the clusters as different (far) as possible. It assigns data points to a cluster such that
the sum of the squared distance between the data points and the cluster’s centroid (arithmetic
mean of all the data points that belong to that cluster) is at the minimum. The less variation
we have within clusters, the more homogeneous (similar) the data points are within the same
cluster.
The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is
assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.
Below is a break down of how we can solve it mathematically.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed.
Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J
w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and
recompute the centroids after the cluster assignments from previous step (M-step). Therefore,
E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.
Which translates to recomputing the centroid of each cluster to reflect the new assignments.
Implementation
We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we
will use sklearn implementation that is more efficient take care of many things for us.
Python code:
import numpy as np
from numpy.linalg import norm
class Kmeans:
'''Implementing Kmeans algorithm.'''
def __init__(self, n_clusters, max_iter=100, random_state=123):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.random_state = random_state
def initializ_centroids(self, X):
np.random.RandomState(self.random_state)
random_idx = np.random.permutation(X.shape[0])
centroids = X[random_idx[:self.n_clusters]]
return centroids
def compute_centroids(self, X, labels):
centroids = np.zeros((self.n_clusters, X.shape[1]))
for k in range(self.n_clusters):
centroids[k, :] = np.mean(X[labels == k, :], axis=0)
return centroids
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
def find_closest_cluster(self, distance):
return np.argmin(distance, axis=1)
def compute_sse(self, X, labels, centroids):
distance = np.zeros(X.shape[0])
for k in range(self.n_clusters):
distance[labels == k] = norm(X[labels == k] - centroids[k], axis=1)
return np.sum(np.square(distance))
def fit(self, X):
self.centroids = self.initializ_centroids(X)
for i in range(self.max_iter):
old_centroids = self.centroids
distance = self.compute_distance(X, old_centroids)
self.labels = self.find_closest_cluster(distance)
self.centroids = self.compute_centroids(X, self.labels)
if np.all(old_centroids == self.centroids):
break
self.error = self.compute_sse(X, self.labels, self.centroids)
def predict(self, X):
distance = self.compute_distance(X, self.centroids)
return self.find_c
Applications
kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The
goal usually when we undergo a cluster analysis is either:
1. Get a meaningful intuition of the structure of the data we’re dealing with.
2. Cluster-then-predict where different models will be built for different subgroups if we
believe there is a wide variation in the behaviors of different subgroups. An example
of that is clustering patients into different subgroups and build a model for each
subgroup to predict the probability of the risk of having heart attack.
Step-by-Step Implementation
1. Import Required Libraries Ensure you have the necessary libraries installed. We
will use numpy, matplotlib, and sklearn.
2. Load and Preprocess the Data Load your 2D dataset (image) and preprocess it as
needed (e.g., normalization, resizing).
3. Apply K-means Clustering Reshape the data for clustering and apply the K-means
algorithm.
4. Visualize the Segmentation Result Display the original and segmented images for
comparison.
Example Code
Here's a detailed example using a sample image from the skimage library. Replace the
sample image with your geyser dataset.
import numpy as np
plt.figure(figsize=(6, 6))
plt.imshow(data, cmap='gray')
plt.title('Original Image')
plt.axis('off')
plt.show()
X = data.reshape(-1, 1)
kmeans.fit(X)
labels = kmeans.labels_
segmented_image = labels.reshape(data.shape)
plt.figure(figsize=(6, 6))
plt.imshow(segmented_image, cmap='gray')
plt.axis('off')
plt.show()
Explanation
1. Loading the Image: The data.camera() function is used to load a sample grayscale
image. Replace this with your actual geyser data.
2. Reshaping for Clustering: The image is reshaped from a 2D array to a 1D array
where each pixel intensity is a data point for clustering.
3. K-means Clustering: The K-means algorithm is applied to the reshaped data. The
number of clusters (n_clusters) is set to 2, assuming we want to separate the geyser
eruptions from the background. Adjust the number of clusters based on your specific
dataset.
4. Reshaping and Displaying the Segmented Image: The clustered labels are reshaped
back to the original image dimensions and displayed.
Considerations
Number of Clusters: The choice of n_clusters is crucial. If your dataset has more than two
regions (e.g., background, geyser, and another feature), increase the number of clusters.
Initialization: The random state in KMeans ensures reproducibility. You can adjust or remove
it for different runs.
Preprocessing: Depending on your data, consider additional preprocessing steps like
smoothing or noise reduction for better clustering results.
Image compression.
Using K-means clustering for image compression is an interesting approach where the algorithm is
used to reduce the number of colors in an image, which effectively compresses the image.
This is achieved by clustering the pixel colors and then replacing each pixel color with the centroid of
its cluster. Here’s a detailed step-by-step guide to perform image compression using K-means
clustering:
Step-by-Step Implementation
1. Import Required Libraries Ensure you have the necessary libraries installed. We
will use numpy, matplotlib, sklearn, and skimage.
2. Load and Preprocess the Image Load your image and preprocess it by normalizing
the pixel values.
3. Apply K-means Clustering Reshape the image data for clustering, apply K-means,
and replace each pixel with its corresponding cluster centroid.
4. Reconstruct and Display the Compressed Image Reconstruct the compressed
image and display it alongside the original for comparison.
Example Code
image =
io.imread('https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstrati
on_1.png')
plt.figure(figsize=(6, 6))
plt.imshow(image)
plt.title('Original Image')
plt.axis('off')
plt.show()
pixels = image.reshape(-1, 3)
kmeans.fit(pixels)
labels = kmeans.predict(pixels)
compressed_image = kmeans.cluster_centers_[labels]
compressed_image = compressed_image.reshape(image.shape)
plt.imshow(compressed_image)
plt.axis('off')
plt.show()
Explanation
1. Loading the Image: The io.imread() function from skimage is used to load an
image from a URL. Replace the URL with your local image path if needed.
2. Normalizing the Image: The image pixel values are normalized to the range [0, 1] by
dividing by 255. This is important for consistent clustering.
3. Reshaping for Clustering: The image is reshaped from a 3D array (height, width,
channels) to a 2D array where each row represents a pixel and each column represents
the color channels (R, G, B).
4. K-means Clustering: The K-means algorithm is applied to the reshaped data to find
n_colors clusters. Each pixel is then assigned to the nearest cluster centroid.
5. Reconstructing the Compressed Image: The pixels are replaced with their
corresponding cluster centroids, and the image is reshaped back to its original
dimensions.
6. Displaying the Images: Both the original and compressed images are displayed for
comparison.
Considerations
Number of Colors: The choice of n_colors (number of clusters) affects the level of compression.
Fewer colors result in higher compression but lower image quality.
Initialization: The random state in KMeans ensures reproducibility. You can adjust or remove it for
different runs.
Preprocessing: Depending on the image, consider additional preprocessing steps like resizing or
filtering for better clustering results.
Conclusion
Kmeans clustering is one of the most popular clustering algorithms and usually the first thing
practitioners apply when solving clustering tasks to get an idea of the structure of the dataset.
The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a
very good job when the clusters have a kind of spherical shapes. However, it suffers as the
geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn
the number of clusters from the data and requires it to be pre-defined