Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit Iv

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

AIML – UNIT IV

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be
defined as "A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no
similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc.,
and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with
the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can
use this id to simplify the processing of large and complex datasets.

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts
are grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples,
bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The
clustering technique also works in the same way. Other examples of clustering are grouping documents
according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the
movies and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the different fruits are
divided into several groups with similar properties.

By Dr. Megha Mishra


AIML – UNIT IV

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-
based method. The most common example of partitioning clustering is the K-Means Clustering
algorithm

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster
is minimum as compared to another cluster centroid.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm , which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative
algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that
has similar properties.

By Dr. Megha Mishra


AIML – UNIT IV
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering

algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

By Dr. Megha Mishra


AIML – UNIT IV
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:

By Dr. Megha Mishra


AIML – UNIT IV
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
o

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points
to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear
visualization.

By Dr. Megha Mishra


AIML – UNIT IV
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will find
new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process
of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are
right to the line. So, these three points will be assigned to new centroids.

By Dr. Megha Mishra


AIML – UNIT IV

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:

By Dr. Megha Mishra


AIML – UNIT IV
o As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:

By Dr. Megha Mishra


AIML – UNIT IV

K-Medoids clustering

-Medoids (also called Partitioning Around Medoid) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points
in the cluster are minimum. The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi –
Ci|
The cost in K-Medoids algorithm is given as

Algorithm:

1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance metric methods.
3. While the cost decreases: For each medoid m, for each data o point which is not a medoid:
 Swap m and o, associate each data point to the closest medoid, and recompute the cost.
 If the total cost is more than that in the previous step, undo the swap.

By Dr. Megha Mishra


AIML – UNIT IV

Let’s consider the following example: If a graph is drawn using the above data points, we obtain the
following:

Step 1: Let the randomly selected 2 medoids, so select k = 2, and let C1 -(4, 5) and C2 -(8, 5) are the two
medoids.
Step 2: Calculating cost. The dissimilarity of each non-medoid point with the medoids is calculated and
tabulated:

By Dr. Megha Mishra


AIML – UNIT IV

Here we have used Manhattan distance formula to calculate the distance matrices between medoid and
non-medoid points. That formula tell that Distance = |X1-X2| + |Y1-Y2|.
Each point is assigned to the cluster of that medoid whose dissimilarity is less. Points 1, 2, and 5 go to cluster C1
and 0, 3, 6, 7, 8 go to cluster C2. The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
Step 3: randomly select one non-medoid point and recalculate the cost. Let the randomly selected point be
(8, 4). The dissimilarity of each non-medoid point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and
tabulated.

Each point is assigned to that cluster whose dissimilarity is less. So, points 1, 2, and 5 go to cluster C1 and 0, 3, 6,
7, 8 go to cluster C2. The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22 Swap Cost = New Cost – Previous Cost
= 22 – 20 and 2 >0 As the swap cost is not less than zero, we undo the swap. Hence (4, 5) and (8, 5) are the final
medoids. The clustering would be in the following way The time complexity is o(k*(n-k)2)

By Dr. Megha Mishra


AIML – UNIT IV

Advantages:
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:
1. The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical
(arbitrarily shaped) groups of objects. This is because it relies on minimizing the distances between the non-
medoid objects and the medoid (the cluster center) – briefly, it uses compactness as clustering criteria instead
of connectivity.
2. It may obtain different results for different runs on the same dataset because the first k medoids are
chosen randomly.

By Dr. Megha Mishra

You might also like