Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

K.means Clustering

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

N.

LOKESH KUMAR, 22E41A3206, CSBS-II

K-means Clustering: Algorithm,


Applications, Evaluation Methods, and
Drawbacks
Clustering
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups
in the data such that data points in the same subgroup (cluster) are very similar while data
points in different clusters are very different. In other words, we try to find homogeneous
subgroups within the data such that data points in each cluster are as similar as possible
according to a similarity measure such as euclidean-based distance or correlation-based
distance. The decision of which similarity measure to use is application-specific.

Kmeans Algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only
one group. It tries to make the intra-cluster data points as similar as possible while also
keeping the clusters as different (far) as possible. It assigns data points to a cluster such that
the sum of the squared distance between the data points and the cluster’s centroid (arithmetic
mean of all the data points that belong to that cluster) is at the minimum. The less variation
we have within clusters, the more homogeneous (similar) the data points are within the same
cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.


2. Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points
to clusters isn’t changing.
4. Compute the sum of the squared distance between data points and all centroids.
5. Assign each data point to the closest cluster (centroid).
6. Compute the centroids for the clusters by taking the average of the all data points that
belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is
assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.
Below is a break down of how we can solve it mathematically.

The objective function is:


where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the
centroid of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed.
Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J
w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and
recompute the centroids after the cluster assignments from previous step (M-step). Therefore,
E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Implementation
We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we
will use sklearn implementation that is more efficient take care of many things for us.

Python code:

import numpy as np
from numpy.linalg import norm
class Kmeans:
'''Implementing Kmeans algorithm.'''
def __init__(self, n_clusters, max_iter=100, random_state=123):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.random_state = random_state
def initializ_centroids(self, X):
np.random.RandomState(self.random_state)
random_idx = np.random.permutation(X.shape[0])
centroids = X[random_idx[:self.n_clusters]]
return centroids
def compute_centroids(self, X, labels):
centroids = np.zeros((self.n_clusters, X.shape[1]))
for k in range(self.n_clusters):
centroids[k, :] = np.mean(X[labels == k, :], axis=0)
return centroids
def compute_distance(self, X, centroids):
distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance
def find_closest_cluster(self, distance):
return np.argmin(distance, axis=1)
def compute_sse(self, X, labels, centroids):
distance = np.zeros(X.shape[0])
for k in range(self.n_clusters):
distance[labels == k] = norm(X[labels == k] - centroids[k], axis=1)
return np.sum(np.square(distance))
def fit(self, X):
self.centroids = self.initializ_centroids(X)
for i in range(self.max_iter):
old_centroids = self.centroids
distance = self.compute_distance(X, old_centroids)
self.labels = self.find_closest_cluster(distance)
self.centroids = self.compute_centroids(X, self.labels)
if np.all(old_centroids == self.centroids):
break
self.error = self.compute_sse(X, self.labels, self.centroids)
def predict(self, X):
distance = self.compute_distance(X, self.centroids)

return self.find_c
Applications
kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The
goal usually when we undergo a cluster analysis is either:

1. Get a meaningful intuition of the structure of the data we’re dealing with.
2. Cluster-then-predict where different models will be built for different subgroups if we
believe there is a wide variation in the behaviors of different subgroups. An example
of that is clustering patients into different subgroups and build a model for each
subgroup to predict the probability of the risk of having heart attack.

In this post, we’ll apply clustering on two cases:

 Geyser eruptions segmentation (2D dataset).


 Image compression.

Geyser eruptions segmentation (2D dataset).


To segment geyser eruptions in a 2D dataset using K-means clustering, follow these steps. We will
utilize the K-means algorithm to cluster the pixel intensities into different segments, assuming the
dataset is an image where eruptions have distinct intensity values compared to the background

Step-by-Step Implementation

1. Import Required Libraries Ensure you have the necessary libraries installed. We
will use numpy, matplotlib, and sklearn.
2. Load and Preprocess the Data Load your 2D dataset (image) and preprocess it as
needed (e.g., normalization, resizing).
3. Apply K-means Clustering Reshape the data for clustering and apply the K-means
algorithm.
4. Visualize the Segmentation Result Display the original and segmented images for
comparison.

Example Code

Here's a detailed example using a sample image from the skimage library. Replace the
sample image with your geyser dataset.

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from skimage import data, io, color


# Load sample data (replace this with actual geyser data)

# For demonstration, we'll use a sample image from skimage

data = data.camera() # Example image; replace with actual geyser data

# Display the original image

plt.figure(figsize=(6, 6))

plt.imshow(data, cmap='gray')

plt.title('Original Image')

plt.axis('off')

plt.show()

# Reshape data for clustering

X = data.reshape(-1, 1)

# Apply K-means clustering

n_clusters = 2 # Number of clusters (adjust based on your data)

kmeans = KMeans(n_clusters=n_clusters, random_state=42)

kmeans.fit(X)

labels = kmeans.labels_

# Reshape the labels back to the original image shape

segmented_image = labels.reshape(data.shape)

# Plot the segmented image

plt.figure(figsize=(6, 6))

plt.imshow(segmented_image, cmap='gray')

plt.title('K-means Clustering Segmentation')

plt.axis('off')

plt.show()
Explanation

1. Loading the Image: The data.camera() function is used to load a sample grayscale
image. Replace this with your actual geyser data.
2. Reshaping for Clustering: The image is reshaped from a 2D array to a 1D array
where each pixel intensity is a data point for clustering.
3. K-means Clustering: The K-means algorithm is applied to the reshaped data. The
number of clusters (n_clusters) is set to 2, assuming we want to separate the geyser
eruptions from the background. Adjust the number of clusters based on your specific
dataset.
4. Reshaping and Displaying the Segmented Image: The clustered labels are reshaped
back to the original image dimensions and displayed.

Considerations

 Number of Clusters: The choice of n_clusters is crucial. If your dataset has more than two
regions (e.g., background, geyser, and another feature), increase the number of clusters.
 Initialization: The random state in KMeans ensures reproducibility. You can adjust or remove
it for different runs.
 Preprocessing: Depending on your data, consider additional preprocessing steps like
smoothing or noise reduction for better clustering results.

 Image compression.
Using K-means clustering for image compression is an interesting approach where the algorithm is
used to reduce the number of colors in an image, which effectively compresses the image.

This is achieved by clustering the pixel colors and then replacing each pixel color with the centroid of
its cluster. Here’s a detailed step-by-step guide to perform image compression using K-means
clustering:

Step-by-Step Implementation

1. Import Required Libraries Ensure you have the necessary libraries installed. We
will use numpy, matplotlib, sklearn, and skimage.
2. Load and Preprocess the Image Load your image and preprocess it by normalizing
the pixel values.
3. Apply K-means Clustering Reshape the image data for clustering, apply K-means,
and replace each pixel with its corresponding cluster centroid.
4. Reconstruct and Display the Compressed Image Reconstruct the compressed
image and display it alongside the original for comparison.

Example Code

Here's a detailed example using an image from the skimage library:


import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from skimage import io

# Load an image (replace this with your image path)

image =
io.imread('https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstrati
on_1.png')

image = image / 255.0 # Normalize pixel values to [0, 1]

# Display the original image

plt.figure(figsize=(6, 6))

plt.imshow(image)

plt.title('Original Image')

plt.axis('off')

plt.show()

# Reshape the image data to a 2D array of pixels

pixels = image.reshape(-1, 3)

# Apply K-means clustering

n_colors = 16 # Number of colors/clusters

kmeans = KMeans(n_clusters=n_colors, random_state=42)

kmeans.fit(pixels)

labels = kmeans.predict(pixels)

# Replace each pixel with its corresponding cluster centroid

compressed_image = kmeans.cluster_centers_[labels]

compressed_image = compressed_image.reshape(image.shape)

# Display the compressed image


plt.figure(figsize=(6, 6))

plt.imshow(compressed_image)

plt.title(f'Compressed Image with {n_colors} colors')

plt.axis('off')

plt.show()

Explanation

1. Loading the Image: The io.imread() function from skimage is used to load an
image from a URL. Replace the URL with your local image path if needed.
2. Normalizing the Image: The image pixel values are normalized to the range [0, 1] by
dividing by 255. This is important for consistent clustering.
3. Reshaping for Clustering: The image is reshaped from a 3D array (height, width,
channels) to a 2D array where each row represents a pixel and each column represents
the color channels (R, G, B).
4. K-means Clustering: The K-means algorithm is applied to the reshaped data to find
n_colors clusters. Each pixel is then assigned to the nearest cluster centroid.
5. Reconstructing the Compressed Image: The pixels are replaced with their
corresponding cluster centroids, and the image is reshaped back to its original
dimensions.
6. Displaying the Images: Both the original and compressed images are displayed for
comparison.

Considerations
Number of Colors: The choice of n_colors (number of clusters) affects the level of compression.
Fewer colors result in higher compression but lower image quality.

Initialization: The random state in KMeans ensures reproducibility. You can adjust or remove it for
different runs.

Preprocessing: Depending on the image, consider additional preprocessing steps like resizing or
filtering for better clustering results.

Conclusion
Kmeans clustering is one of the most popular clustering algorithms and usually the first thing
practitioners apply when solving clustering tasks to get an idea of the structure of the dataset.
The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a
very good job when the clusters have a kind of spherical shapes. However, it suffers as the
geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn
the number of clusters from the data and requires it to be pre-defined

Source : Chatgpt and Github

You might also like