ML Unit 5
ML Unit 5
ML Unit 5
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
ADVERTISEMEN
o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.
We can cut the dendrogram tree structure at any level as per our requirement.
The dataset is containing the information of customers that have visited a mall for
shopping. So, the mall owner wants to find some patterns or some particular
behavior of his customers using the dataset information.
1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters
The above lines of code are used to import the libraries to perform specific tasks,
such as numpy for the Mathematical operations, matplotlib for drawing the graphs
or scatter plot, and pandas for importing the dataset.
Here we will extract only the matrix of features as we don't have any further
information about the dependent variable. Code is given below:
Here we have extracted only 3 and 4 columns as we will use a 2D plot to see the
clusters. So, we are considering the Annual income and spending score as the matrix
of features.
In the above lines of code, we have imported the hierarchy module of scipy library.
This module provides us a method shc.denrogram(), which takes the linkage() as a
parameter. The linkage function is used to define the distance between two clusters,
so here we have passed the x(matrix of features), and method "ward," the popular
method of linkage in hierarchical clustering.
The remaining lines of code are to describe the labels for the dendrogram plot.
Output:
By executing the above lines of code, we will get the below output:
ADVERTISEMENT
Using this Dendrogram, we will now determine the optimal number of clusters for
our model. For this, we will find the maximum vertical distance that does not cut
any horizontal bar. Consider the below diagram:
In the above diagram, we have shown the vertical distances that are not cutting their
horizontal bars. As we can visualize, the 4 th distance is looking the maximum, so
according to this, the number of clusters will be 5(the vertical lines in this range).
We can also take the 2nd number as it approximately equals the 4 th distance, but we
will consider the 5 clusters because the same we calculated in the K-means
algorithm.
So, the optimal number of clusters will be 5, and we will train the model in the
next step, using the same.
Then we have created the object of this class named as hc. The
AgglomerativeClustering class takes the following parameters:
In the last line, we have created the dependent variable y_pred to fit or train the
model. It does train not only the model but also returns the clusters to which each
data point belongs.
After executing the above lines of code, if we go through the variable explorer option
in our Sypder IDE, we can check the y_pred variable. We can compare the original
dataset with the y_pred variable. Consider the below image:
As we can see in the above image, the y_pred shows the clusters value, which means
the customer id 1 belongs to the 5 th cluster (as indexing starts from 0, so 4 means
5th cluster), the customer id 2 belongs to 4 th cluster, and so on.
Step-4: Visualizing the clusters
As we have trained our model successfully, now we can visualize the clusters
corresponding to the dataset.
Here we will use the same lines of code as we did in k-means clustering, except one
change. Here we will not plot the centroid that we did in k-means, because here we
have used dendrogram to determine the optimal number of clusters. The code is
given below:
Output: By executing the above lines of code, we will get the below output:
In hierarchical cluster analysis, a dendrogram is a diagram that shows the
arrangement of the clusters produced by the clustering algorithm. It's a tree-
like structure where the leaves represent individual data points, and the
branches represent the merging of clusters at different similarity levels. The
height at which two clusters merge in the dendrogram indicates their similarity.
1. Each data point starts as its own cluster at the bottom of the dendrogram.
2. As the algorithm progresses, similar data points or clusters are merged
together, represented by the merging of branches in the dendrogram.
3. The height at which two branches merge indicates the level of dissimilarity
between the clusters being merged. Lower merges imply higher similarity.
```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Create the custom dataset with make_blobs and plot it
Python3
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
Clustering dataset
k = 3
clusters = {}
np.random.seed(23)
clusters[idx] = cluster
clusters
Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points':
[]},
1: {'center': array([ 1.06183904, -0.87041662]),
'points': []},
2: {'center': array([-1.11581855, 0.74488834]),
'points': []}}
Plot the random initialize center with data points
Python3
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
Output:
The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It
also marks the initial cluster centers (red stars) generated for K-means
clustering.
Define Euclidean distance
Python3
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Create the function to Assign and Update the cluster center
The E-step assigns data points to the nearest cluster center, and the M-step
updates cluster centers based on the mean of assigned points in K-means
clustering.
Python3
#Implementing E step
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
clusters[i]['points'] = []
return clusters
Step 7: Create the function to Predict the cluster for the datapoints
Python3
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Plot the data points with their predicted cluster center
Python3
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Output:
K-means Clustering
The plot shows data points colored by their predicted clusters. The red markers
represent the updated cluster centers after the E-M steps in the K-means
clustering algorithm.
Example 2
Import the necessary libraries
Python3
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
Load the Dataset
Python3
X, y = load_iris(return_X_y=True)
Elbow Method
Finding the ideal number of groups to divide the data into is a basic stage in any
unsupervised algorithm. One of the most common techniques for figuring out
this ideal value of k is the elbow approach.
Python3
sns.set_style("whitegrid")
g=sns.lineplot(x=range(1,11), y=sse)
plt.show()
Output:
Elbow Method
From the above graph, we can observe that at k=2 and k=3 elbow-like situation.
So, we are considering K=3
Build the Kmeans clustering model
Python3
kmeans.cluster_centers_
Output:
array([[5.006 , 3.428 , 1.462 , 0.246 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])
Predict the cluster group:
Python3
pred = kmeans.fit_predict(X)
pred
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2,
2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1,
1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2,
1], dtype=int32)
Plot the cluster center with data points
Python3
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[:2]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")
plt.subplot(1,2,2)
plt.scatter(X[:,2],X[:,3],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[2:4]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Output:
K-means clustering
The subplot on the left display petal length vs. petal width with data points
colored by clusters, and red markers indicate K-means cluster centers. The
subplot on the right show sepal length vs. sepal width similarly.
Conclusion
In conclusion, K-means clustering is a powerful unsupervised machine learning
algorithm for grouping unlabeled datasets. Its objective is to divide data into
clusters, making similar data points part of the same group. The algorithm
initializes cluster centroids and iteratively assigns data points to the nearest
centroid, updating centroids based on the mean of points in each cluster.
The working of the AHC algorithm can be explained using the below
steps:
o Step-1: Create each data point as a single cluster. Let's say there
are N data points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them
to form one cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them
together to form one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.
As we have seen, the closest distance between the two clusters is crucial
for the hierarchical clustering. There are various ways to calculate the
distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods. Some of the
popular linkage methods are given below:
We can cut the dendrogram tree structure at any level as per our
requirement.
The steps for implementation will be the same as the k-means clustering,
except for some changes such as the method to find the number of
clusters. Below are the steps:
1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters
In this step, we will import the libraries and datasets for our model.
The above lines of code are used to import the libraries to perform
specific tasks, such as numpy for the Mathematical
operations, matplotlib for drawing the graphs or scatter plot,
and pandas for importing the dataset.
Here we will extract only the matrix of features as we don't have any
further information about the dependent variable. Code is given below:
Now we will find the optimal number of clusters using the Dendrogram
for our model. For this, we are going to use scipy library as it provides a
function that will directly return the dendrogram for our code. Consider
the below lines of code:
The remaining lines of code are to describe the labels for the
dendrogram plot.
Output:
By executing the above lines of code, we will get the below output:
Using this Dendrogram, we will now determine the optimal number of
clusters for our model. For this, we will find the maximum vertical
distance that does not cut any horizontal bar. Consider the below
diagram:
In the above diagram, we have shown the vertical distances that are not
cutting their horizontal bars. As we can visualize, the 4 th distance is
looking the maximum, so according to this, the number of clusters will
be 5(the vertical lines in this range). We can also take the 2 nd number as
it approximately equals the 4th distance, but we will consider the 5
clusters because the same we calculated in the K-means algorithm.
So, the optimal number of clusters will be 5, and we will train the
model in the next step, using the same.
Then we have created the object of this class named as hc. The
AgglomerativeClustering class takes the following parameters:
In the last line, we have created the dependent variable y_pred to fit or
train the model. It does train not only the model but also returns the
clusters to which each data point belongs.
After executing the above lines of code, if we go through the variable
explorer option in our Sypder IDE, we can check the y_pred variable. We
can compare the original dataset with the y_pred variable. Consider the
below image:
As we can see in the above image, the y_pred shows the clusters value,
which means the customer id 1 belongs to the 5 th cluster (as indexing
starts from 0, so 4 means 5 th cluster), the customer id 2 belongs to
4th cluster, and so on.
Here we will use the same lines of code as we did in k-means clustering,
except one change. Here we will not plot the centroid that we did in k-
means, because here we have used dendrogram to determine the
optimal number of clusters. The code is given below:
Output: By executing the above lines of code, we will get the below
output:
Now it is not necessary that the clusters formed must be circular in shape. The
shape of clusters can be arbitrary. There are many algortihms that work well
with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are
not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering: In this type of clustering, each data point belongs to
a cluster completely or not. For example, Let’s say there are 4 data
point and we have to cluster them into 2 clusters. So each data point
will either belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1
Soft Clustering: In this type of clustering, instead of assigning each
data point into a separate cluster, a probability or likelihood of that
point being that cluster is evaluated. For example, Let’s say there are 4
data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters.
This probability is calculated for all data points.
Data
Probability of C1 Probability of C2
Points
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly used
for:
Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more audience.
Market Basket Analysis – Shop owners analyze their sales and figure
out which items are majorly bought together by the customers. For
example, In USA, according to a study diapers and beers were usually
bought together by fathers.
Social Network Analysis – Social media sites use your data to
understand your browsing behaviour and provide you with targeted
friend recommendations or content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas
in diagnostic images like X-rays.
Anomaly Detection – To find outliers in a stream of real-time dataset
or forecasting fraudulent transactions we can use clustering to identify
them.
Simplify working with large datasets – Each cluster is given a cluster
ID after clustering is complete. Now, you may reduce a feature set’s
whole feature set into its cluster ID. Clustering is effective when it can
represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major
and common use cases of clustering. Moving forward we will be discussing
Clustering Algorithms that will help you perform the above tasks.
Types of Clustering Algorithms
At the surface level, clustering helps in the analysis of unstructured data.
Graphing, the shortest distance, and the density of the data points are a few of
the elements that influence cluster formation. Clustering is the process of
determining how related the objects are based on a metric called the similarity
measure. Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases.
Depending on the type of clustering algorithm being utilized in data mining,
several techniques are employed to group the data from the datasets. In this part,
the clustering techniques are described. Various types of clustering algorithms
are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
We will be going through each of these types in brief.
1. Centroid-based Clustering (Partitioning methods)
Partitioning methods are the most easiest clustering algorithms. They group data
points on the basis of their closeness. Generally, the similarity measure chosen
for these algorithms are Euclidian distance, Manhattan Distance or Minkowski
Distance. The datasets are separated into a predetermined number of clusters,
and each cluster is referenced by a vector of values. When compared to the
vector value, the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish
the number of clusters, “k,” either intuitively or scientifically (using the Elbow
Method) before any clustering machine learning system starts allocating the
data points. Despite this, it is still the most popular type of clustering. K-
means and K-medoids clustering are some examples of this type clustering.
2. Density-based Clustering (Model-based methods)
Density-based clustering, a model-based method, finds groups based on the
density of data points. Contrary to centroid-based clustering, which requires that
the number of clusters be predefined and is sensitive to initialization, density-
based clustering determines the number of clusters automatically and is less
susceptible to beginning positions. They are great at handling clusters of
different sizes and forms, making them ideally suited for datasets with
irregularly shaped or overlapping clusters. These methods manage both dense
and sparse data regions by focusing on local density and can distinguish clusters
with a variety of morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary
shaped clusters. Due to its preset number of cluster requirements and extreme
sensitivity to the initial positioning of centroids, the outcomes can vary.
Furthermore, the tendency of centroid-based approaches to produce spherical or
convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the
drawbacks of centroid-based techniques by autonomously choosing cluster
sizes, being resilient to initialization, and successfully capturing clusters of
various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.
3. Connectivity-based Clustering (Hierarchical clustering)
A method for assembling related data points into hierarchical clusters is called
hierarchical clustering. Each data point is initially taken into account as a
separate cluster, which is subsequently combined with the clusters that are the
most similar to form one large cluster that contains all of the data points.
Think about how you may arrange a collection of items based on how similar
they are. Each object begins as its own cluster at the base of the tree when using
hierarchical clustering, which creates a dendrogram, a tree-like structure. The
closest pairings of clusters are then combined into larger clusters after the
algorithm examines how similar the objects are to one another. When every
object is in one cluster at the top of the tree, the merging process has finished.
Exploring various granularity levels is one of the fun things about hierarchical
clustering. To obtain a given number of clusters, you can select to cut
the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to
their family trees, where the nearest relatives are clustered together and the
wider branches signify more general connections. There are 2 approaches for
Hierarchical clustering:
Divisive Clustering: It follows a top-down approach, here we
consider all data points to be part one big cluster and then this cluster
is divide into smaller groups.
Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then these
clusters are clubbed together to make one big cluster with all data
points.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized
according to their propensity to fall into the same probability distribution (such
as a Gaussian, binomial, or other) within the data. The data elements are
grouped using a probability-based distribution that is based on statistical
distributions. Included are data objects that have a higher likelihood of being in
the cluster. A data point is less likely to be included in a cluster the further it is
from the cluster’s central point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need to
specify the clusters a priori for some algorithms, and primarily the definition of
the cluster form for the bulk of algorithms. There must be at least one tuning or
hyper-parameter selected, and while doing so should be simple, getting it wrong
could have unanticipated repercussions. Distribution-based clustering has a
definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in
order to avoid overfitting, many clustering methods only work with simulated or
manufactured data, or when the bulk of the data points certainly belong to a
preset distribution. The most popular distribution-based clustering algorithm
is Gaussian Mixture Model.
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
2. Biology: It can be used for classification among different species of
plants and animals.
3. Libraries: It is used in clustering different books on the basis of
topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we
can determine the dangerous zones.
7. Image Processing: Clustering can be used to group similar images
together, classify images based on content, and identify patterns in
image data.
8. Genetics: Clustering is used to group genes that have similar
expression patterns and identify gene networks that work together in
biological processes.
9. Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
10.Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop
targeted solutions.
11.Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
12.Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
13.Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud
or other financial crimes.
14.Traffic analysis: Clustering is used to group similar patterns of traffic
data, such as peak hours, routes, and speeds, which can help in
improving transportation planning and infrastructure.
15.Social network analysis: Clustering is used to identify communities
or groups within social networks, which can help in understanding
social behavior, influence, and trends.
16.Cybersecurity: Clustering is used to group similar patterns of
network traffic or system behavior, which can help in detecting and
preventing cyberattacks.
17.Climate analysis: Clustering is used to group similar patterns of
climate data, such as temperature, precipitation, and wind, which can
help in understanding climate change and its impact on the
environment.
18.Sports analysis: Clustering is used to group similar patterns of player
or team performance data, which can help in analyzing player or team
strengths and weaknesses and making strategic decisions.
19.Crime analysis: Clustering is used to group similar patterns of crime
data, such as location, time, and type, which can help in identifying
crime hotspots, predicting future crime trends, and improving crime
prevention strategies.
1. **Cluster Validation**:
Cluster validation is the process of evaluating the quality and validity of
clusters produced by a clustering algorithm. Since clustering is an unsupervised
learning technique, meaning there are no predefined labels for the data, it's
crucial to assess the clustering results objectively. There are several methods
for cluster validation, including:
2. **Factor Analysis**:
Factor analysis is a statistical method used to identify underlying factors or
latent variables that explain the correlations among observed variables. It's
commonly used in dimensionality reduction and data exploration. The goal of
factor analysis is to identify a smaller number of unobserved variables (factors)
that capture the essential information present in the original variables.
Factor analysis can help in various tasks such as data reduction, identifying
hidden patterns or structures in data, and understanding the relationships
among variables. It's widely used in fields such as psychology, sociology, and
marketing research to analyze complex datasets with many interrelated
variables.