Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Clustering Agglo Devisive DBSCAN

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 78

What is Data Mining?

• Data mining is the process of searching large sets of data to look out
for patterns and trends that can't be found using simple analysis
techniques

• Involves exploring and analysing large blocks of information to


extract meaningful patterns and trends

• Data mining is also known as Knowledge Discovery in Data (KDD)


Introduction to Data Mining

• Automatic summarization of data,

• Extraction of the "essence" of information stored

• The discovery of patterns in raw data


Data Mining

• Data mining is the process that helps in extracting information from a given data set to identify trends,
patterns, and useful data. The objective is to make data-supported decisions from enormous data sets.

• Recognizes patterns in datasets for a set of problems that belong to a specific domain

• A technique of investigation patterns of data that belong to particular perspectives. This helps us in
categorizing that data into useful information. This useful information is then accumulated and assembled
to either be stored in database servers, like data warehouses, or used in data mining algorithms and
analysis to help in decision making.

• It can be used for revenue generation and cost-cutting amongst other purposes
Data Mining …

• Data mining is used by businesses to draw out specific information from large
volumes of data to find solutions to their business problems.

• It has the capability of transforming raw data into information that can help
businesses grow by taking better decisions.

• Data mining has several types, including pictorial data mining, text mining, social
media mining, web mining, and audio and video mining amongst others.
Data Mining Techniques

1. Clustering

2. Association Rule

3. Classification

4. Prediction

5. Sequential patterns
Clustering
• Clustering is a way to group a set of data points in a way that similar
data points are grouped together

• Clustering algorithms look for similarities or dissimilarities among


data points.

• Clustering is an unsupervised learning method so there is no label


associated with data points. The algorithm tries to find the underlying
structure of the data.
Clustering

There are different approaches and algorithms to perform clustering


tasks which can be divided into three sub-categories

1. Partition-based clustering: E.g. k-means, k-median

2. Hierarchical clustering: E.g. Agglomerative, Divisive

3. Density-based clustering: E.g. DBSCAN


Clustering analysis

• Clustering analysis is an unsupervised learning method that separates the data


points into several specific bunches or groups, such that the data points in the
same groups have similar properties and data points in different groups have
different properties in some sense.

• All clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches.
Clustering and Association Rule in Data Mining

• Cluster analysis or clustering is the task of grouping a set of objects in the same group (cluster) are
more similar (in some sense or another) to each other than to those in other groups (clusters).

• Association rule (ARM) learning is a method for discovering interesting relations between
variables in large databases.

• Both clustering and association rule mining, are in the field of unsupervised machine learning.

• Clustering is about the data points, ARM(Association Rule Mining) is about finding relationships
between the attributes of those data points
Clustering and Association Rule
• The problem of finding frequent item sets (ARM) differs from the similarity
search (clustering).

• both concern big data sets to some extent


Clustering and Association Rule

• Clustering: Given many items (could be text documents, images, people,


etc..) find cohesive subsets of items.

• Association rule mining: could be text documents, actual supermarket


baskets, other semi-structured objects, find which items inside a basket
predict another item in the basket.
Clustering in Data Mining
Clustering in Data Mining

• Clustering is an unsupervised Machine Learning Algorithm

• A good clustering algorithm aims to obtain clusters whose:

• The intra-cluster similarities are high.

• The inter-cluster similarity is low

• A connected region of a multidimensional space with a comparatively high density of objects.


Applications of cluster analysis in data mining

• Data analysis, market research, pattern recognition, and image processing

• Marketers can characterize their customer groups

• used in tracking applications such as detection of credit card fraud

• used to determine plant and animal taxonomies, categorization of genes with the same
functionalities
Requirements for Clustering

1. Scalability

2. Interpretability

3. Discovery of clusters with attribute shape

4. Ability to deal with different types of attributes

5. Ability to deal with noisy data.

6. High dimensionality
Clustering in Data Mining

One group is treated as a cluster of data objects

• The first step is to partition the set of data into groups with the help of data
similarity, and then groups are assigned to their respective labels.

• The biggest advantage of clustering over-classification is it can adapt to the


changes made
Clustering Methods
• Model-Based Method

• Hierarchical Method

• Constraint-Based Method

• Grid-Based Method

• Partitioning Method

• Density-Based Method
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining

• Produces a hierarchical series of nested clusters

• Dendrogram is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view)

• The basic method to generate hierarchical clustering are:


1. Agglomerative
2. Divisive
Dendrogram
• A dendrogram is a tree showing hierarchical clustering
• shows the hierarchical relationship between objects.
• commonly created as an output from hierarchical clustering.
• to allocate objects to clusters
Hierarchical Clustering …

1. Agglomerative method

 is a bottom-up method

Algorithm:

2. Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
3. Consider every data point as a individual cluster
4. Merge the clusters which are highly similar or close to each other.
5. Recalculate the proximity matrix for each cluster
6. Repeat Step 3 and 4 until only a single cluster remains.
7. Graphical representation of this algorithm using a dendrogram.
Example: Agglomerative
• six data points A, B, C, D, E, F

Agglomerative Hierarchical clustering


Agglomerative clustering (Additive Hierarchical Clustering)
Divisive Clustering

The divisive clustering algorithm is a top-down clustering approach, initially, all the points in the
dataset belong to one cluster and split is performed recursively as one moves down the hierarchy.

Steps of Divisive Clustering:

1. Initially, all points in the dataset belong to one single cluster.


2. Partition the cluster into two least similar cluster
3. Proceed recursively to form new clusters until the desired number of clusters is obtained.
2. Divisive method
• opposite of the Agglomerative Hierarchical clustering

Divisive Hierarchical clustering


Divisive method …

splitting (or dividing) the clusters at each step, hence the name divisive hierarchical clustering
ML - Types of Linkages in Clustering

1. Single Linkage

2. Complete Linkage

3. Average Linkage

4. Centroid Linkage
1. Single Linkage
2. Complete Linkage
3. Average Linkage
4. Centroid Linkage
• In Centroid Linkage, the distance between two clusters is the distance between their
centroids
Clustering Using Single Linkage
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.cluster.hierarchy as shc
from scipy.spatial.distance import squareform, pdist

#create data using numpy.random.random_sample


a = np.random.random_sample(size = 5)
b = np.random.random_sample(size = 5)

#create a pandas data frame


point = ['P1','P2','P3','P4','P5']
data = pd.DataFrame({'Point':point, 'a':np.round(a,2), 'b':np.round(b,2)})
data = data.set_index('Point')
Data
Clustering Using Single Linkage: clustering steps

1. Visualize the data using a Scatter Plot

plt.figure(figsize=(8,5))
plt.scatter(data['a'], data['b'], c='r', marker='*')
plt.xlabel('Column a')
plt.ylabel('column b')
plt.title('Scatter Plot of x and y')for j in data.itertuples():
plt.annotate(j.Index, (j.a, j.b), fontsize=15)

Scatter Plot of a , b
Step2: Calculating the distance matrix - Euclidean method using pdist

dist = pd.DataFrame(squareform(pdist(data[[‘a’, ‘b’]]), ‘euclidean’), columns=data.index.values, index=data.index.values)

considering only the lower bound values of the matrix

Distance Matrix
Step 3: Look for the least distance and merge those into a cluster

the points P3, P4 has the least distance “0.30232”, so first merge the points into a cluster
Step 4: Re-compute the distance matrix after forming a cluster

Update the distance between the cluster (P3,P4) to P1


= Min(dist(P3,P4), P1)) -> Min(dist(P3,P1),dist(P4,P1))
= Min(0.59304, 0.46098)
= 0.46098

Update the distance between the cluster (P3,P4) to P2


= Min(dist(P3,P4), P2) -> Min(dist(P3,P2),dist(P4,P2))
= Min(0.77369, 0.61612)
= 0.61612

Update the distance between the cluster (P3,P4) to P5


= Min(dist(P3,P4), P5) -> Min(dist(P3,P5),dist(P4,P5))
= Min(0.45222, 0.35847)
= 0.35847
Updated Distance Matrix
Repeat steps 3,4 until left with a single cluster

• After re-computing the distance matrix, look for the least distance to make a cluster

the points P2, P5 has the least distance “0.32388”. Group the data points into a cluster and recompute the distance matrix
Update the distance between the cluster (P2,P5) to P1
= Min(dist((P2,P5),P1)) -> Min(dist(P2,P1), dist(P5, P1))
= Min(1.04139, 0.81841)
= 0.81841

Update the distance between the cluster (P2,P5) to (P3,P4)


= Min(dist((P2,P5), (P3,P4))) -> = Min(dist(P2,(P3,P4)), dist(P5,(P3,P4)))
= Min(dist(0.61612, 0.35847))
= 0.35847

After recomputing the distance matrix, again look for the least distance.

The cluster (P2,P5) has the least distance with the cluster (P3, P4) “0.35847”. So cluster them together.
Update the distance between the cluster (P3,P4, P2,P5) to P1
= Min(dist(((P3,P4),(P2,P5)), P1))
= Min(0.46098, 0.81841)
= 0.46098

Thus obtaining a single cluster

the clustering steps:

P3, P4 points have the least distance and are merged


P2, P5 points have the least distance and are merged
The clusters (P3, P4), (P2, P5) are clustered
The cluster (P3, P4, P2, P5) is merged with the data point P1
visualize the same using a dendrogram
plt.figure(figsize=(12,5))
plt.title("Dendrogram with Single linkage")
dend = shc.dendrogram(shc.linkage(data[['a', 'b']], method='single'), labels=data.index)

The length of the vertical lines in the dendrogram shows the distance. For example, the distance between the points P2, P5 is
0.32388.
The step-by-step clustering is the same as the dendrogram
Steps to Perform Hierarchical Clustering

• Calculate similarity – Take the distance between the centroids of these clusters

• The points having the least distance are referred to as similar points and we can
merge them.

• Referred as a distance-based algorithm as well, since we are calculating the


distances between the clusters

• A proximity matrix: Stores the distances between each point


Steps to Perform Hierarchical Clustering …

Create a Proximity Matrix : shows the distance between each of these points

use the Euclidean distance formula to calculate the distances between points

Euclidean distance between point 1 and 2:

√(10-7)^2 = √9 = 3
5 x 5 proximity matrix
Steps to Perform Hierarchical Clustering...

• Step 1: First, assign all the points to an individual cluster:

• Step 2: Look at the smallest distance in the proximity matrix and merge the points with the
smallest distance, update the proximity matrix:
the smallest distance is 3 and hence we will merge point 1 and 2:

The updated clusters

accordingly update the proximity matrix:


again calculate the proximity matrix for these clusters

Step 3: Repeat step 2 until only a single cluster is left.


How to choose the Number of Clusters in Hierarchical Clustering?

• Dendrogram: To get the number of clusters for hierarchical clustering


Choosing the Number of Clusters in Hierarchical Clustering …
Choosing the Number of Clusters : Dendrogram

** The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the
threshold
Hierarchical Clustering in Python
• Dataset: Wholesale customers data.csv (Kaggle) https://www.kaggle.com/binovi/wholesale-customers-data-set
• We will first import the required libraries:
Normalization
Dendrogram to decide the number of Clusters

plt.title("Dendrogram with Single inkage")


dend = shc.dendrogram(shc.linkage(dataset, method='single'),
labels=data.index)
Cut the Dendrogram based on the value of Threshold
Apply hierarchical clustering for 2 clusters:

The values of 0s and 1s in the output since we have 2 clusters. 0 represents the points that belong to the first cluster and 1
represents points in the second cluster
Visualize two clusters
Clustering

Clustering is a way to group a set of data points in a way that similar data points
are grouped together.

There are different approaches and algorithms to perform clustering tasks which
can be divided into three sub-categories

1. Partition-based clustering: E.g. K-Means, K-median

2. Hierarchical clustering: E.g. Agglomerative, Divisive

3. Density-based clustering: E.g. DBSCAN


Density-based clustering
• Used for arbitrary shaped clusters or detecting outliers
Density-based clustering …
Density-Based Spatial Clustering of Applications with Noise
(DBSCAN) clustering
DBSCAN - Density-Based Spatial Clustering of Applications with
Noise (distance between nearest points)

• DBSCAN is a well-known data clustering algorithm that is commonly used in data mining and machine
learning

• Is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a
large amount of data, which is containing noise and outliers.

• It groups together points that are close to each other based on a distance measurement (usually Euclidean
distance) and a minimum number of points.

• It also marks as outliers the points that are in low-density regions.

• Is a clustering algorithm that defines clusters as continuous regions of high density and works well if all the
clusters are dense enough and well separated by low-density regions.
Density-based clustering …

• DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.

• Finds arbitrary shaped clusters and clusters with noise (i.e. outliers).

There are two key parameters of DBSCAN

• eps: The distance that specifies the neighbourhoods. Two points are considered to be neighbours if
the distance between them <= eps. Radius around each point ( eps)

• minPts: Minimum number of data points to define a cluster /the minimum number of data points
that should be around that point within that radius ( MinPts)
Points in DBSCAN

Based on the two parameters eps and minPts, points are classified as

• Core point: A point is a core point if there are at least minPts number of points
(including the point itself) in its surrounding area with radius eps.

• Border point: A point is a border point if it is reachable from a core point and there
are less than minPts number of points within its surrounding area.

• Outlier: A point is an outlier if it is not a core point and not reachable from any
core points.
Points in DBSCAN …
DBSCAN Example
DBSCAN…

eps=0.6 and
minPts=4
DBSCAN Algorithm …
1. Choose a value for eps and MinPts

2. For a particular data point (x) calculate its distance from every other data point.

3. Find all the neighbourhood points of x which fall inside the circle of radius (eps) or simply whose distance from x is smaller than
or equal to eps.

4. Treat x as visited and if the number of neighbourhood points around x >= MinPts then x is a core point and if it is not assigned
to any cluster, create a new cluster and assign it to that.

5. If the number of neighbourhood points around x are < MinPts and it has a core point in its neighbourhood, treat it as a border
point.

6. Include all the density connected points as a single cluster.

7. Repeat the above steps for every unvisited point in the data set and find out all core, border and outlier points.
DBSCAN…
DBSCAN…
DBSCAN…
DBSCAN…
DBSCAN…
1. Direct density reachable: A point is called direct density reachable if it has a core point in its
neighbourhood. Consider the point (1, 2), it has a core point (1.2, 2.5) in its neighbourhood, hence, it
will be a direct density reachable point.

2. Density Reachable: A point is called density reachable from another point if they are connected
through a series of core points. For example, consider the points (1, 3) and (1.5, 2.5), since they are
connected through a core point (1.2, 2.5), they are called density reachable from each other.

3. Density Connected: Two points are called density connected if there is a core point which is
density reachable from both the points.
Implementation
Use datasets from:

• https://scikit-learn.org/stable/modules/clustering.html#dbscan

• https://
scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plo
t-dbscan-py

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

• Example:
• https://
scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-download-auto-examples
-cluster-plot-dbscan-py
Implementation…
Implementation…
Visualize the clusters determined by DBSCAN
Clusters determined by DBSCAN

You might also like