Clustering Agglo Devisive DBSCAN
Clustering Agglo Devisive DBSCAN
Clustering Agglo Devisive DBSCAN
• Data mining is the process of searching large sets of data to look out
for patterns and trends that can't be found using simple analysis
techniques
• Data mining is the process that helps in extracting information from a given data set to identify trends,
patterns, and useful data. The objective is to make data-supported decisions from enormous data sets.
• Recognizes patterns in datasets for a set of problems that belong to a specific domain
• A technique of investigation patterns of data that belong to particular perspectives. This helps us in
categorizing that data into useful information. This useful information is then accumulated and assembled
to either be stored in database servers, like data warehouses, or used in data mining algorithms and
analysis to help in decision making.
• It can be used for revenue generation and cost-cutting amongst other purposes
Data Mining …
• Data mining is used by businesses to draw out specific information from large
volumes of data to find solutions to their business problems.
• It has the capability of transforming raw data into information that can help
businesses grow by taking better decisions.
• Data mining has several types, including pictorial data mining, text mining, social
media mining, web mining, and audio and video mining amongst others.
Data Mining Techniques
1. Clustering
2. Association Rule
3. Classification
4. Prediction
5. Sequential patterns
Clustering
• Clustering is a way to group a set of data points in a way that similar
data points are grouped together
• All clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches.
Clustering and Association Rule in Data Mining
• Cluster analysis or clustering is the task of grouping a set of objects in the same group (cluster) are
more similar (in some sense or another) to each other than to those in other groups (clusters).
• Association rule (ARM) learning is a method for discovering interesting relations between
variables in large databases.
• Both clustering and association rule mining, are in the field of unsupervised machine learning.
• Clustering is about the data points, ARM(Association Rule Mining) is about finding relationships
between the attributes of those data points
Clustering and Association Rule
• The problem of finding frequent item sets (ARM) differs from the similarity
search (clustering).
• used to determine plant and animal taxonomies, categorization of genes with the same
functionalities
Requirements for Clustering
1. Scalability
2. Interpretability
6. High dimensionality
Clustering in Data Mining
• The first step is to partition the set of data into groups with the help of data
similarity, and then groups are assigned to their respective labels.
• Hierarchical Method
• Constraint-Based Method
• Grid-Based Method
• Partitioning Method
• Density-Based Method
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
• Dendrogram is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view)
1. Agglomerative method
is a bottom-up method
Algorithm:
2. Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
3. Consider every data point as a individual cluster
4. Merge the clusters which are highly similar or close to each other.
5. Recalculate the proximity matrix for each cluster
6. Repeat Step 3 and 4 until only a single cluster remains.
7. Graphical representation of this algorithm using a dendrogram.
Example: Agglomerative
• six data points A, B, C, D, E, F
The divisive clustering algorithm is a top-down clustering approach, initially, all the points in the
dataset belong to one cluster and split is performed recursively as one moves down the hierarchy.
splitting (or dividing) the clusters at each step, hence the name divisive hierarchical clustering
ML - Types of Linkages in Clustering
1. Single Linkage
2. Complete Linkage
3. Average Linkage
4. Centroid Linkage
1. Single Linkage
2. Complete Linkage
3. Average Linkage
4. Centroid Linkage
• In Centroid Linkage, the distance between two clusters is the distance between their
centroids
Clustering Using Single Linkage
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.cluster.hierarchy as shc
from scipy.spatial.distance import squareform, pdist
plt.figure(figsize=(8,5))
plt.scatter(data['a'], data['b'], c='r', marker='*')
plt.xlabel('Column a')
plt.ylabel('column b')
plt.title('Scatter Plot of x and y')for j in data.itertuples():
plt.annotate(j.Index, (j.a, j.b), fontsize=15)
Scatter Plot of a , b
Step2: Calculating the distance matrix - Euclidean method using pdist
Distance Matrix
Step 3: Look for the least distance and merge those into a cluster
the points P3, P4 has the least distance “0.30232”, so first merge the points into a cluster
Step 4: Re-compute the distance matrix after forming a cluster
• After re-computing the distance matrix, look for the least distance to make a cluster
the points P2, P5 has the least distance “0.32388”. Group the data points into a cluster and recompute the distance matrix
Update the distance between the cluster (P2,P5) to P1
= Min(dist((P2,P5),P1)) -> Min(dist(P2,P1), dist(P5, P1))
= Min(1.04139, 0.81841)
= 0.81841
After recomputing the distance matrix, again look for the least distance.
The cluster (P2,P5) has the least distance with the cluster (P3, P4) “0.35847”. So cluster them together.
Update the distance between the cluster (P3,P4, P2,P5) to P1
= Min(dist(((P3,P4),(P2,P5)), P1))
= Min(0.46098, 0.81841)
= 0.46098
The length of the vertical lines in the dendrogram shows the distance. For example, the distance between the points P2, P5 is
0.32388.
The step-by-step clustering is the same as the dendrogram
Steps to Perform Hierarchical Clustering
• Calculate similarity – Take the distance between the centroids of these clusters
• The points having the least distance are referred to as similar points and we can
merge them.
Create a Proximity Matrix : shows the distance between each of these points
use the Euclidean distance formula to calculate the distances between points
√(10-7)^2 = √9 = 3
5 x 5 proximity matrix
Steps to Perform Hierarchical Clustering...
• Step 2: Look at the smallest distance in the proximity matrix and merge the points with the
smallest distance, update the proximity matrix:
the smallest distance is 3 and hence we will merge point 1 and 2:
** The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the
threshold
Hierarchical Clustering in Python
• Dataset: Wholesale customers data.csv (Kaggle) https://www.kaggle.com/binovi/wholesale-customers-data-set
• We will first import the required libraries:
Normalization
Dendrogram to decide the number of Clusters
The values of 0s and 1s in the output since we have 2 clusters. 0 represents the points that belong to the first cluster and 1
represents points in the second cluster
Visualize two clusters
Clustering
Clustering is a way to group a set of data points in a way that similar data points
are grouped together.
There are different approaches and algorithms to perform clustering tasks which
can be divided into three sub-categories
• DBSCAN is a well-known data clustering algorithm that is commonly used in data mining and machine
learning
• Is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a
large amount of data, which is containing noise and outliers.
• It groups together points that are close to each other based on a distance measurement (usually Euclidean
distance) and a minimum number of points.
• Is a clustering algorithm that defines clusters as continuous regions of high density and works well if all the
clusters are dense enough and well separated by low-density regions.
Density-based clustering …
• Finds arbitrary shaped clusters and clusters with noise (i.e. outliers).
• eps: The distance that specifies the neighbourhoods. Two points are considered to be neighbours if
the distance between them <= eps. Radius around each point ( eps)
• minPts: Minimum number of data points to define a cluster /the minimum number of data points
that should be around that point within that radius ( MinPts)
Points in DBSCAN
Based on the two parameters eps and minPts, points are classified as
• Core point: A point is a core point if there are at least minPts number of points
(including the point itself) in its surrounding area with radius eps.
• Border point: A point is a border point if it is reachable from a core point and there
are less than minPts number of points within its surrounding area.
• Outlier: A point is an outlier if it is not a core point and not reachable from any
core points.
Points in DBSCAN …
DBSCAN Example
DBSCAN…
eps=0.6 and
minPts=4
DBSCAN Algorithm …
1. Choose a value for eps and MinPts
2. For a particular data point (x) calculate its distance from every other data point.
3. Find all the neighbourhood points of x which fall inside the circle of radius (eps) or simply whose distance from x is smaller than
or equal to eps.
4. Treat x as visited and if the number of neighbourhood points around x >= MinPts then x is a core point and if it is not assigned
to any cluster, create a new cluster and assign it to that.
5. If the number of neighbourhood points around x are < MinPts and it has a core point in its neighbourhood, treat it as a border
point.
7. Repeat the above steps for every unvisited point in the data set and find out all core, border and outlier points.
DBSCAN…
DBSCAN…
DBSCAN…
DBSCAN…
DBSCAN…
1. Direct density reachable: A point is called direct density reachable if it has a core point in its
neighbourhood. Consider the point (1, 2), it has a core point (1.2, 2.5) in its neighbourhood, hence, it
will be a direct density reachable point.
2. Density Reachable: A point is called density reachable from another point if they are connected
through a series of core points. For example, consider the points (1, 3) and (1.5, 2.5), since they are
connected through a core point (1.2, 2.5), they are called density reachable from each other.
3. Density Connected: Two points are called density connected if there is a core point which is
density reachable from both the points.
Implementation
Use datasets from:
• https://scikit-learn.org/stable/modules/clustering.html#dbscan
• https://
scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plo
t-dbscan-py
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
• Example:
• https://
scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-download-auto-examples
-cluster-plot-dbscan-py
Implementation…
Implementation…
Visualize the clusters determined by DBSCAN
Clusters determined by DBSCAN