Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ML Unit 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 50

ML UNIT 5

1.Explain dendrogram in hierarchical cluster analysis.write a


program.

Ans- Hierarchical Clustering in


Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then
why we need hierarchical clustering? So, as we have seen in the K-means clustering
that there are some challenges with this algorithm, which are a predetermined
number of clusters, and it always tries to create the clusters of the same size. To solve
these two challenges, we can opt for the hierarchical clustering algorithm because, in
this algorithm, we don't need to have knowledge about the predefined number of
clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering


The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical


clustering Work?
The working of the AHC algorithm can be explained using the below steps:

ADVERTISEMEN

o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on k-


means clustering

Measure for the distance between two clusters


As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance between


each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.

Woking of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.

The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine


together and form a cluster, correspondingly a dendrogram is created, which
connects P2 and P3 with a rectangular shape. The hight is decided according
to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram
is created. It is higher than of previous, as the Euclidean distance between P5
and P6 is a little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points
together.

We can cut the dendrogram tree structure at any level as per our requirement.

Python Implementation of Agglomerative


Hierarchical Clustering
Now we will see the practical implementation of the agglomerative hierarchical
clustering algorithm using Python. To implement this, we will use the same dataset
problem that we have used in the previous topic of K-means clustering so that we
can compare both concepts easily.

The dataset is containing the information of customers that have visited a mall for
shopping. So, the mall owner wants to find some patterns or some particular
behavior of his customers using the dataset information.

Steps for implementation of AHC using Python:


The steps for implementation will be the same as the k-means clustering, except for
some changes such as the method to find the number of clusters. Below are the
steps:

1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters

Data Pre-processing Steps:


In this step, we will import the libraries and datasets for our model.

o Importing the libraries

1. # Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

The above lines of code are used to import the libraries to perform specific tasks,
such as numpy for the Mathematical operations, matplotlib for drawing the graphs
or scatter plot, and pandas for importing the dataset.

o Importing the dataset

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

As discussed above, we have imported the same dataset


of Mall_Customers_data.csv, as we did in k-means clustering. Consider the below
output:
o Extracting the matrix of features

Here we will extract only the matrix of features as we don't have any further
information about the dependent variable. Code is given below:

1. x = dataset.iloc[:, [3, 4]].values


ADVERTISEMENT
ADVERTISEMENT

Here we have extracted only 3 and 4 columns as we will use a 2D plot to see the
clusters. So, we are considering the Annual income and spending score as the matrix
of features.

Step-2: Finding the optimal number of clusters


using the Dendrogram
Now we will find the optimal number of clusters using the Dendrogram for our
model. For this, we are going to use scipy library as it provides a function that will
directly return the dendrogram for our code. Consider the below lines of code:

1. #Finding the optimal number of clusters using the dendrogram


2. import scipy.cluster.hierarchy as shc
3. dendro = shc.dendrogram(shc.linkage(x, method="ward"))
4. mtp.title("Dendrogrma Plot")
5. mtp.ylabel("Euclidean Distances")
6. mtp.xlabel("Customers")
7. mtp.show()

In the above lines of code, we have imported the hierarchy module of scipy library.
This module provides us a method shc.denrogram(), which takes the linkage() as a
parameter. The linkage function is used to define the distance between two clusters,
so here we have passed the x(matrix of features), and method "ward," the popular
method of linkage in hierarchical clustering.

The remaining lines of code are to describe the labels for the dendrogram plot.

Output:

By executing the above lines of code, we will get the below output:

ADVERTISEMENT

Using this Dendrogram, we will now determine the optimal number of clusters for
our model. For this, we will find the maximum vertical distance that does not cut
any horizontal bar. Consider the below diagram:
In the above diagram, we have shown the vertical distances that are not cutting their
horizontal bars. As we can visualize, the 4 th distance is looking the maximum, so
according to this, the number of clusters will be 5(the vertical lines in this range).
We can also take the 2nd number as it approximately equals the 4 th distance, but we
will consider the 5 clusters because the same we calculated in the K-means
algorithm.

So, the optimal number of clusters will be 5, and we will train the model in the
next step, using the same.

Step-3: Training the hierarchical clustering model


As we know the required optimal number of clusters, we can now train our model.
The code is given below:

1. #training the hierarchical model on dataset


2. from sklearn.cluster import AgglomerativeClustering
3. hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward'
)
4. y_pred= hc.fit_predict(x)
In the above code, we have imported the AgglomerativeClustering class of cluster
module of scikit learn library.

Then we have created the object of this class named as hc. The
AgglomerativeClustering class takes the following parameters:

o n_clusters=5: It defines the number of clusters, and we have taken here 5


because it is the optimal number of clusters.
o affinity='euclidean': It is a metric used to compute the linkage.
o linkage='ward': It defines the linkage criteria, here we have used the "ward"
linkage. This method is the popular linkage method that we have already used
for creating the Dendrogram. It reduces the variance in each cluster.

In the last line, we have created the dependent variable y_pred to fit or train the
model. It does train not only the model but also returns the clusters to which each
data point belongs.

After executing the above lines of code, if we go through the variable explorer option
in our Sypder IDE, we can check the y_pred variable. We can compare the original
dataset with the y_pred variable. Consider the below image:

As we can see in the above image, the y_pred shows the clusters value, which means
the customer id 1 belongs to the 5 th cluster (as indexing starts from 0, so 4 means
5th cluster), the customer id 2 belongs to 4 th cluster, and so on.
Step-4: Visualizing the clusters
As we have trained our model successfully, now we can visualize the clusters
corresponding to the dataset.

Here we will use the same lines of code as we did in k-means clustering, except one
change. Here we will not plot the centroid that we did in k-means, because here we
have used dendrogram to determine the optimal number of clusters. The code is
given below:

1. #visulaizing the clusters


2. mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Clu
ster 1')
3. mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'C
luster 2')
4. mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Clust
er 3')
5. mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cl
uster 4')
6. mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label =
'Cluster 5')
7. mtp.title('Clusters of customers')
8. mtp.xlabel('Annual Income (k$)')
9. mtp.ylabel('Spending Score (1-100)')
10. mtp.legend()
11. mtp.show()

Output: By executing the above lines of code, we will get the below output:
In hierarchical cluster analysis, a dendrogram is a diagram that shows the
arrangement of the clusters produced by the clustering algorithm. It's a tree-
like structure where the leaves represent individual data points, and the
branches represent the merging of clusters at different similarity levels. The
height at which two clusters merge in the dendrogram indicates their similarity.

Here's a simple explanation of how to interpret a dendrogram:

1. Each data point starts as its own cluster at the bottom of the dendrogram.
2. As the algorithm progresses, similar data points or clusters are merged
together, represented by the merging of branches in the dendrogram.
3. The height at which two branches merge indicates the level of dissimilarity
between the clusters being merged. Lower merges imply higher similarity.

Now, let's write a Python program to generate a dendrogram using hierarchical


clustering. We'll use the `scipy` library for clustering and `matplotlib` for
visualization:

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate random data for demonstration


np.random.seed(0)
X = np.random.rand(10, 2) # 10 data points in 2 dimensions

# Perform hierarchical clustering


linked = linkage(X, 'single') # You can choose different linkage methods like
'complete', 'average', etc.

# Plot the dendrogram


plt.figure(figsize=(10, 5))
dendrogram(linked,
orientation='top',
labels=range(1, 11),
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()
```

This program generates random data points in 2 dimensions and performs


hierarchical clustering using the `single` linkage method (you can change this to
other linkage methods as needed). Then, it plots the dendrogram using
`matplotlib`. You can replace the `X` array with your own dataset if you have
one.

2.Explain K-means clustering and write a program?


Ans-K means Clustering – Introduction
Last Updated : 11 Mar, 2024



K-Means Clustering is an Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters. The article aims to explore
the fundamentals and working of k mean clustering along with the
implementation.
Table of Content
 What is K-means Clustering?
 What is the objective of k-means clustering?
 How k-means clustering works?
 Implementation of K-Means Clustering in Python
What is K-means Clustering?
Unsupervised Machine Learning is the process of teaching a computer to use
unlabeled, unclassified data and enabling the algorithm to operate on that data
without supervision. Without any previous data training, the machine’s job in
this case is to organize unsorted data according to parallels, patterns, and
variations.
K means clustering, assigns data points to one of the K clusters depending on
their distance from the center of the clusters. It starts by randomly assigning the
clusters centroid in the space. Then each data point assign to one of the cluster
based on its distance from centroid of the cluster. After assigning each point to
one of the cluster, new cluster centroids are assigned. This process runs
iteratively until it finds good cluster. In the analysis we assume that number of
cluster is given in advanced and we have to put points in one of the group.
In some cases, K is not clearly defined, and we have to think about the optimal
number of K. K Means clustering performs best data is well separated. When
data points overlapped this clustering is not suitable. K Means is faster as
compare to other clustering technique. It provides strong coupling between the
data points. K Means cluster do not provide clear information regarding the
quality of clusters. Different initial assignment of cluster centroid may lead to
different clusters. Also, K Means algorithm is sensitive to noise. It maymhave
stuck in local minima.
What is the objective of k-means clustering?
The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are
more comparable to one another and different from the data points within the
other groups. It is essentially a grouping of things based on how similar and
different they are to one another.
How k-means clustering works?
We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the K-means algorithm, an unsupervised learning
algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the Euclidean distance as a measurement.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster
centroids.
2. We categorize each item to its closest mean, and we update the mean’s
coordinates, which are the averages of the items categorized in that
cluster so far.
3. We repeat the process for a given number of iterations and at the end,
we have our clusters.
The “points” mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a lot
of options. An intuitive method is to initialize the means at random items in the
data set. Another method is to initialize the means at random values between the
boundaries of the data set (if for a feature x, the items have values in [0,3], we
will initialize the means with values for x at [0,3]).
The above algorithm in pseudocode is as follows:
Initialize k means with random values
--> For a given number of iterations:

--> Iterate through items:

--> Find the mean closest to the item by


calculating
the euclidean distance of the item with each of
the means

--> Assign item to mean

--> Update mean by shifting it to the average of


the items in that cluster
Implementation of K-Means Clustering in Python
Example 1
Import the necessary Libraries
We are importing Numpy for statistical computations, Matplotlib to plot
the graph, and make_blobs from sklearn.datasets.
 Python3

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Create the custom dataset with make_blobs and plot it
 Python3

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_


23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
Clustering dataset

Initialize the random centroids


The code initializes three clusters for K-means clustering. It sets a random seed
and generates random cluster centers within a specified range, and creates an
empty list of points for each cluster.
 Python3

k = 3

clusters = {}
np.random.seed(23)

for idx in range(k):


center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters
Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points':
[]},
1: {'center': array([ 1.06183904, -0.87041662]),
'points': []},
2: {'center': array([-1.11581855, 0.74488834]),
'points': []}}
Plot the random initialize center with data points
 Python3

plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
Output:

Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It
also marks the initial cluster centers (red stars) generated for K-means
clustering.
Define Euclidean distance
 Python3

def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Create the function to Assign and Update the cluster center
The E-step assigns data points to the nearest cluster center, and the M-step
updates cluster centers based on the mean of assigned points in K-means
clustering.
 Python3

#Implementing E step
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

#Implementing the M-Step


def update_clusters(X, clusters):
for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters
Step 7: Create the function to Predict the cluster for the datapoints
 Python3

def pred_cluster(X, clusters):


pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Assign, Update, and predict the cluster center
 Python3

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Plot the data points with their predicted cluster center
 Python3

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Output:

K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers
represent the updated cluster centers after the E-M steps in the K-means
clustering algorithm.
Example 2
Import the necessary libraries
 Python3
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
Load the Dataset
 Python3

X, y = load_iris(return_X_y=True)
Elbow Method
Finding the ideal number of groups to divide the data into is a basic stage in any
unsupervised algorithm. One of the most common techniques for figuring out
this ideal value of k is the elbow approach.
 Python3

#Find optimum number of cluster


sse = [] #SUM OF SQUARED ERROR
for k in range(1,11):
km = KMeans(n_clusters=k, random_state=2)
km.fit(X)
sse.append(km.inertia_)
Plot the Elbow graph to find the optimum number of cluster
 Python3

sns.set_style("whitegrid")
g=sns.lineplot(x=range(1,11), y=sse)

g.set(xlabel ="Number of cluster (k)",


ylabel = "Sum Squared Error",
title ='Elbow Method')

plt.show()
Output:
Elbow Method

From the above graph, we can observe that at k=2 and k=3 elbow-like situation.
So, we are considering K=3
Build the Kmeans clustering model
 Python3

kmeans = KMeans(n_clusters = 3, random_state = 2)


kmeans.fit(X)
Output:
KMeans
KMeans(n_clusters=3, random_state=2)
Find the cluster center
 Python3

kmeans.cluster_centers_
Output:
array([[5.006 , 3.428 , 1.462 , 0.246 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])
Predict the cluster group:
 Python3

pred = kmeans.fit_predict(X)
pred
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2,
2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1,
1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2,
1], dtype=int32)
Plot the cluster center with data points
 Python3

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[:2]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")

plt.subplot(1,2,2)
plt.scatter(X[:,2],X[:,3],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[2:4]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Output:

K-means clustering

The subplot on the left display petal length vs. petal width with data points
colored by clusters, and red markers indicate K-means cluster centers. The
subplot on the right show sepal length vs. sepal width similarly.
Conclusion
In conclusion, K-means clustering is a powerful unsupervised machine learning
algorithm for grouping unlabeled datasets. Its objective is to divide data into
clusters, making similar data points part of the same group. The algorithm
initializes cluster centroids and iteratively assigns data points to the nearest
centroid, updating centroids based on the mean of points in each cluster.

3.Explain hierarchical cluster analysis.


Ans-Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a cluster
and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a
tree, and this tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering


may look similar, but they both differ depending on how they work. As
there is no requirement to predetermine the number of clusters as we
did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

Backward Skip 10sPlay VideoForward Skip 10s


1. Agglomerative: Agglomerative is a bottom-up approach, in
which the algorithm starts with taking all data points as single
clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative
algorithm as it is a top-down approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means


Clustering, then why we need hierarchical clustering? So, as we have
seen in the K-means clustering that there are some challenges with this
algorithm, which are a predetermined number of clusters, and it always
tries to create the clusters of the same size. To solve these two
challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the
predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering


algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example


of HCA. To group the datasets into clusters, it follows the bottom-up
approach. It means, this algorithm considers each dataset as a single
cluster at the beginning, and then start combining the closest pair of
clusters together. It does this until all the clusters are merged into a
single cluster that contains all the datasets.
This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below
steps:

o Step-1: Create each data point as a single cluster. Let's say there
are N data points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them
to form one cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a


look on k-means clustering

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial
for the hierarchical clustering. There are various ways to calculate the
distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods. Some of the
popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest


points of the clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two


points of two different clusters. It is one of the popular linkage
methods as it forms tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance


between each pair of datasets is added up and then divided by the
total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated. Consider the
below image:

From the above-given approaches, we can apply any of them according


to the type of problem or business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each


step as a memory that the HC algorithm performs. In the dendrogram
plot, the Y-axis shows the Euclidean distances between the data points,
and the x-axis shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below


diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the
corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3


combine together and form a cluster, correspondingly a
dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean
distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the
Euclidean distance between P5 and P6 is a little bit greater than the
P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and
P3 in one dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data
points together.

We can cut the dendrogram tree structure at any level as per our
requirement.

Python Implementation of Agglomerative Hierarchical


Clustering

Now we will see the practical implementation of the agglomerative


hierarchical clustering algorithm using Python. To implement this, we will
use the same dataset problem that we have used in the previous topic of
K-means clustering so that we can compare both concepts easily.

The dataset is containing the information of customers that have visited


a mall for shopping. So, the mall owner wants to find some patterns or
some particular behavior of his customers using the dataset information.

Steps for implementation of AHC using Python:

The steps for implementation will be the same as the k-means clustering,
except for some changes such as the method to find the number of
clusters. Below are the steps:

1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters

Data Pre-processing Steps:

In this step, we will import the libraries and datasets for our model.

o Importing the libraries

1. # Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

The above lines of code are used to import the libraries to perform
specific tasks, such as numpy for the Mathematical
operations, matplotlib for drawing the graphs or scatter plot,
and pandas for importing the dataset.

o Importing the dataset

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')
As discussed above, we have imported the same dataset
of Mall_Customers_data.csv, as we did in k-means clustering. Consider
the below output:

o Extracting the matrix of features

Here we will extract only the matrix of features as we don't have any
further information about the dependent variable. Code is given below:

1. x = dataset.iloc[:, [3, 4]].values


ADVERTISEMENT
ADVERTISEMENT

Here we have extracted only 3 and 4 columns as we will use a 2D plot to


see the clusters. So, we are considering the Annual income and spending
score as the matrix of features.
Step-2: Finding the optimal number of clusters using the
Dendrogram

Now we will find the optimal number of clusters using the Dendrogram
for our model. For this, we are going to use scipy library as it provides a
function that will directly return the dendrogram for our code. Consider
the below lines of code:

1. #Finding the optimal number of clusters using the dendrogram


2. import scipy.cluster.hierarchy as shc
3. dendro = shc.dendrogram(shc.linkage(x, method="ward"))
4. mtp.title("Dendrogrma Plot")
5. mtp.ylabel("Euclidean Distances")
6. mtp.xlabel("Customers")
7. mtp.show()

In the above lines of code, we have imported the hierarchy module of


scipy library. This module provides us a method shc.denrogram(), which
takes the linkage() as a parameter. The linkage function is used to
define the distance between two clusters, so here we have passed the
x(matrix of features), and method "ward," the popular method of linkage
in hierarchical clustering.

The remaining lines of code are to describe the labels for the
dendrogram plot.

Output:

By executing the above lines of code, we will get the below output:
Using this Dendrogram, we will now determine the optimal number of
clusters for our model. For this, we will find the maximum vertical
distance that does not cut any horizontal bar. Consider the below
diagram:

In the above diagram, we have shown the vertical distances that are not
cutting their horizontal bars. As we can visualize, the 4 th distance is
looking the maximum, so according to this, the number of clusters will
be 5(the vertical lines in this range). We can also take the 2 nd number as
it approximately equals the 4th distance, but we will consider the 5
clusters because the same we calculated in the K-means algorithm.

So, the optimal number of clusters will be 5, and we will train the
model in the next step, using the same.

Step-3: Training the hierarchical clustering model

As we know the required optimal number of clusters, we can now train


our model. The code is given below:

1. #training the hierarchical model on dataset


2. from sklearn.cluster import AgglomerativeClustering
3. hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', link
age='ward')
4. y_pred= hc.fit_predict(x)

In the above code, we have imported


the AgglomerativeClustering class of cluster module of scikit learn
library.

Then we have created the object of this class named as hc. The
AgglomerativeClustering class takes the following parameters:

o n_clusters=5: It defines the number of clusters, and we have taken


here 5 because it is the optimal number of clusters.
o affinity='euclidean': It is a metric used to compute the linkage.
o linkage='ward': It defines the linkage criteria, here we have used
the "ward" linkage. This method is the popular linkage method that
we have already used for creating the Dendrogram. It reduces the
variance in each cluster.

In the last line, we have created the dependent variable y_pred to fit or
train the model. It does train not only the model but also returns the
clusters to which each data point belongs.
After executing the above lines of code, if we go through the variable
explorer option in our Sypder IDE, we can check the y_pred variable. We
can compare the original dataset with the y_pred variable. Consider the
below image:

As we can see in the above image, the y_pred shows the clusters value,
which means the customer id 1 belongs to the 5 th cluster (as indexing
starts from 0, so 4 means 5 th cluster), the customer id 2 belongs to
4th cluster, and so on.

Step-4: Visualizing the clusters

As we have trained our model successfully, now we can visualize the


clusters corresponding to the dataset.

Here we will use the same lines of code as we did in k-means clustering,
except one change. Here we will not plot the centroid that we did in k-
means, because here we have used dendrogram to determine the
optimal number of clusters. The code is given below:

1. #visulaizing the clusters


2. mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue',
label = 'Cluster 1')
3. mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green
', label = 'Cluster 2')
4. mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', la
bel = 'Cluster 3')
5. mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan',
label = 'Cluster 4')
6. mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'mage
nta', label = 'Cluster 5')
7. mtp.title('Clusters of customers')
8. mtp.xlabel('Annual Income (k$)')
9. mtp.ylabel('Spending Score (1-100)')
10. mtp.legend()
11. mtp.show()

Output: By executing the above lines of code, we will get the below
output:

4.Explain cluster analysis


Ans-Clustering in Machine Learning
Last Updated : 20 Mar, 2024



In real world, not every data we work upon has a target variable. This kind of
data cannot be analyzed using supervised learning algorithms. We need the help
of unsupervised algorithms. One of the most popular type of analysis under
unsupervised learning is Cluster analysis. When the goal is to group similar data
points in a dataset, then we use cluster analysis. In practical situations, we can
use cluster analysis for customer segmentation for targeted advertisements, or in
medical imaging to find unknown or new infected areas and many more use
cases that we will discuss further in this article.
Table of Content
 What is Clustering ?
 Types of Clustering
 Uses of Clustering
 Types of Clustering Algorithms
 Applications of Clustering in different fields:
 Frequently Asked Questions (FAQs) on Clustering
What is Clustering ?
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like
Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group
the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The
shape of clusters can be arbitrary. There are many algortihms that work well
with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are
not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
 Hard Clustering: In this type of clustering, each data point belongs to
a cluster completely or not. For example, Let’s say there are 4 data
point and we have to cluster them into 2 clusters. So each data point
will either belong to cluster 1 or cluster 2.
Data Points Clusters

A C1

B C2

C C2

D C1
 Soft Clustering: In this type of clustering, instead of assigning each
data point into a separate cluster, a probability or likelihood of that
point being that cluster is evaluated. For example, Let’s say there are 4
data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters.
This probability is calculated for all data points.
Data
Probability of C1 Probability of C2
Points

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly used
for:
 Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more audience.
 Market Basket Analysis – Shop owners analyze their sales and figure
out which items are majorly bought together by the customers. For
example, In USA, according to a study diapers and beers were usually
bought together by fathers.
 Social Network Analysis – Social media sites use your data to
understand your browsing behaviour and provide you with targeted
friend recommendations or content recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased areas
in diagnostic images like X-rays.
 Anomaly Detection – To find outliers in a stream of real-time dataset
or forecasting fraudulent transactions we can use clustering to identify
them.
 Simplify working with large datasets – Each cluster is given a cluster
ID after clustering is complete. Now, you may reduce a feature set’s
whole feature set into its cluster ID. Clustering is effective when it can
represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major
and common use cases of clustering. Moving forward we will be discussing
Clustering Algorithms that will help you perform the above tasks.
Types of Clustering Algorithms
At the surface level, clustering helps in the analysis of unstructured data.
Graphing, the shortest distance, and the density of the data points are a few of
the elements that influence cluster formation. Clustering is the process of
determining how related the objects are based on a metric called the similarity
measure. Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases.
Depending on the type of clustering algorithm being utilized in data mining,
several techniques are employed to group the data from the datasets. In this part,
the clustering techniques are described. Various types of clustering algorithms
are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
We will be going through each of these types in brief.
1. Centroid-based Clustering (Partitioning methods)
Partitioning methods are the most easiest clustering algorithms. They group data
points on the basis of their closeness. Generally, the similarity measure chosen
for these algorithms are Euclidian distance, Manhattan Distance or Minkowski
Distance. The datasets are separated into a predetermined number of clusters,
and each cluster is referenced by a vector of values. When compared to the
vector value, the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish
the number of clusters, “k,” either intuitively or scientifically (using the Elbow
Method) before any clustering machine learning system starts allocating the
data points. Despite this, it is still the most popular type of clustering. K-
means and K-medoids clustering are some examples of this type clustering.
2. Density-based Clustering (Model-based methods)
Density-based clustering, a model-based method, finds groups based on the
density of data points. Contrary to centroid-based clustering, which requires that
the number of clusters be predefined and is sensitive to initialization, density-
based clustering determines the number of clusters automatically and is less
susceptible to beginning positions. They are great at handling clusters of
different sizes and forms, making them ideally suited for datasets with
irregularly shaped or overlapping clusters. These methods manage both dense
and sparse data regions by focusing on local density and can distinguish clusters
with a variety of morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary
shaped clusters. Due to its preset number of cluster requirements and extreme
sensitivity to the initial positioning of centroids, the outcomes can vary.
Furthermore, the tendency of centroid-based approaches to produce spherical or
convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the
drawbacks of centroid-based techniques by autonomously choosing cluster
sizes, being resilient to initialization, and successfully capturing clusters of
various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.
3. Connectivity-based Clustering (Hierarchical clustering)
A method for assembling related data points into hierarchical clusters is called
hierarchical clustering. Each data point is initially taken into account as a
separate cluster, which is subsequently combined with the clusters that are the
most similar to form one large cluster that contains all of the data points.
Think about how you may arrange a collection of items based on how similar
they are. Each object begins as its own cluster at the base of the tree when using
hierarchical clustering, which creates a dendrogram, a tree-like structure. The
closest pairings of clusters are then combined into larger clusters after the
algorithm examines how similar the objects are to one another. When every
object is in one cluster at the top of the tree, the merging process has finished.
Exploring various granularity levels is one of the fun things about hierarchical
clustering. To obtain a given number of clusters, you can select to cut
the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to
their family trees, where the nearest relatives are clustered together and the
wider branches signify more general connections. There are 2 approaches for
Hierarchical clustering:
 Divisive Clustering: It follows a top-down approach, here we
consider all data points to be part one big cluster and then this cluster
is divide into smaller groups.
 Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then these
clusters are clubbed together to make one big cluster with all data
points.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized
according to their propensity to fall into the same probability distribution (such
as a Gaussian, binomial, or other) within the data. The data elements are
grouped using a probability-based distribution that is based on statistical
distributions. Included are data objects that have a higher likelihood of being in
the cluster. A data point is less likely to be included in a cluster the further it is
from the cluster’s central point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need to
specify the clusters a priori for some algorithms, and primarily the definition of
the cluster form for the bulk of algorithms. There must be at least one tuning or
hyper-parameter selected, and while doing so should be simple, getting it wrong
could have unanticipated repercussions. Distribution-based clustering has a
definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in
order to avoid overfitting, many clustering methods only work with simulated or
manufactured data, or when the bulk of the data points certainly belong to a
preset distribution. The most popular distribution-based clustering algorithm
is Gaussian Mixture Model.
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
2. Biology: It can be used for classification among different species of
plants and animals.
3. Libraries: It is used in clustering different books on the basis of
topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we
can determine the dangerous zones.
7. Image Processing: Clustering can be used to group similar images
together, classify images based on content, and identify patterns in
image data.
8. Genetics: Clustering is used to group genes that have similar
expression patterns and identify gene networks that work together in
biological processes.
9. Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
10.Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop
targeted solutions.
11.Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
12.Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
13.Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud
or other financial crimes.
14.Traffic analysis: Clustering is used to group similar patterns of traffic
data, such as peak hours, routes, and speeds, which can help in
improving transportation planning and infrastructure.
15.Social network analysis: Clustering is used to identify communities
or groups within social networks, which can help in understanding
social behavior, influence, and trends.
16.Cybersecurity: Clustering is used to group similar patterns of
network traffic or system behavior, which can help in detecting and
preventing cyberattacks.
17.Climate analysis: Clustering is used to group similar patterns of
climate data, such as temperature, precipitation, and wind, which can
help in understanding climate change and its impact on the
environment.
18.Sports analysis: Clustering is used to group similar patterns of player
or team performance data, which can help in analyzing player or team
strengths and weaknesses and making strategic decisions.
19.Crime analysis: Clustering is used to group similar patterns of crime
data, such as location, time, and type, which can help in identifying
crime hotspots, predicting future crime trends, and improving crime
prevention strategies.

5.Explain cluster validation and factor analysis


Ans- Cluster validation and factor analysis are both important techniques in
machine learning for assessing and understanding data patterns. Let's delve
into each:

1. **Cluster Validation**:
Cluster validation is the process of evaluating the quality and validity of
clusters produced by a clustering algorithm. Since clustering is an unsupervised
learning technique, meaning there are no predefined labels for the data, it's
crucial to assess the clustering results objectively. There are several methods
for cluster validation, including:

- **Internal Validation**: Internal validation methods evaluate the clustering


results using only the input data and the clustering algorithm itself. Common
internal validation metrics include:
- **Silhouette Score**: Measures the cohesion and separation of clusters. A
higher silhouette score indicates better-defined clusters.
- **Davies-Bouldin Index**: Measures the average similarity between each
cluster and its most similar cluster, with lower values indicating better
clustering.
- **Calinski-Harabasz Index**: Ratio of the between-cluster dispersion and
within-cluster dispersion, where higher values suggest better-defined clusters.

- **External Validation**: External validation methods compare the


clustering results to some ground truth or external criteria when available. This
is common in scenarios where true labels or categories exist for the data.
Common external validation metrics include:
- **Adjusted Rand Index (ARI)**: Measures the similarity between the true
labels and the clustering results, adjusted for chance.
- **Normalized Mutual Information (NMI)**: Measures the mutual
information between the true labels and the clustering results, normalized to
have a maximum value of 1.

By using these validation techniques, practitioners can assess the quality of


clusters and choose the most appropriate clustering algorithm and parameters
for their dataset.

2. **Factor Analysis**:
Factor analysis is a statistical method used to identify underlying factors or
latent variables that explain the correlations among observed variables. It's
commonly used in dimensionality reduction and data exploration. The goal of
factor analysis is to identify a smaller number of unobserved variables (factors)
that capture the essential information present in the original variables.

Factor analysis assumes that observed variables are influenced by one or


more underlying factors, along with measurement error. It aims to uncover the
structure of these underlying factors and their relationships with observed
variables. There are different types of factor analysis, including:
- **Exploratory Factor Analysis (EFA)**: In EFA, the goal is to explore the
structure of the data and identify underlying factors without preconceived
hypotheses about the relationships among variables.
- **Confirmatory Factor Analysis (CFA)**: CFA tests a specific hypothesis
about the structure of the data based on prior theoretical knowledge or
assumptions. It seeks to confirm or reject a predefined factor structure.

Factor analysis can help in various tasks such as data reduction, identifying
hidden patterns or structures in data, and understanding the relationships
among variables. It's widely used in fields such as psychology, sociology, and
marketing research to analyze complex datasets with many interrelated
variables.

You might also like