Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DBSCAN.docx

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Density Based Spatial Clustering of Applications with Noise

https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a


popular clustering algorithm used in machine learning and data mining to group points in a
dataset that are closely packed together, based on their distance to other points.
DBSCAN works by partitioning the data into dense regions of points that are separated by
less dense areas. It defines clusters as areas of the dataset where there are many points close
to each other, while the points that are far from any cluster are considered outliers or noise.
Consider the following dataset:

Our goal is to cluster these points into groups that are densely packed together. Firstly, count
the number of points close to each point. For example, if we start with the yellow point, we
draw a circle around it. The radius ε (epsilon) of the circle is the first parameter that we have
to determine when using DBSCAN.
After drawing the circle, count the overlaps. For example, for our yellow point, there are 5
close points. Likewise, we count the number of close points for all remaining points.

Further, determine another parameter, the minimum number of points m. Each point is
considered Core Point if they are close to at least m other points. For example, if we take m
as 3, then, purple points are considered as Core Points but the yellow one is not because it
doesn’t have any close points around it.

Then, randomly select a Core Point and assign it as the first point in our first cluster. Other
points that are close to this point are also assigned to the same cluster (i.e., within the circle
of the selected point). Further, extend it to the other points that are close.
Stop the process when no more core points can be assigned to the first cluster. Some core
points could not be appointed even though they were close to the first cluster. Draw the
circles of these points and see if the first cluster is close to the core points. If there is an
overlap, put them in the first cluster. However, we cannot assign any non-core points. They
are considered outliers.
Now, move to other core points that are not assigned. Randomly select another
point from them and start over. DBSCAN works sequentially, so it’s important to note
that non-core points will be assigned to the first cluster that meets the requirement of
closeness.
Technical specification:
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous
region of high point density, separated from other such clusters by contiguous regions of low
point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm
for density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
● minPts: The minimum number of points (a threshold) clustered together for a region
to be considered dense.
● eps (ε): A distance measure that will be used to locate the points in the neighborhood
of any point.

Understand the parameters through:


Reachability in terms of density establishes a point to be reachable from another if it lies
within a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine
whether points are located in a particular cluster. For example, p and q points could be
connected if p->r->s->t->q, where a->b means b is in the neighborhood of a.
There are three types of points after the DBSCAN clustering is complete:
● Core — This is a point that has at least m points within distance n from itself.
● Border — This is a point that has at least one Core point at a distance n.
● Noise — This is a point that is neither a Core nor a Border. And it has less
than m points within distance n from itself.
Algorithmic steps for DBSCAN clustering:
● The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points
have been visited).
● If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we
consider all these points to be part of the same cluster.
● The clusters are then expanded by recursively repeating the neighborhood calculation
for each neighboring point

Parameter Estimation:
● minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not
make sense, as then every point on its own will already be a cluster. With minPts ≤ 2,
the result will be the same as of hierarchical clustering with the single link metric,
with the dendrogram cut at height ϵ. Therefore, minPts must be chosen at least 3.
However, larger values are usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2.dim can be used, but it may be
necessary to choose larger values for very large data, for noisy data or for data that
contains many duplicates.
● ϵ: The value for ϵ can then be chosen by using a k-distance graph, plotting the
distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest
value. Good values of ϵ are where this plot shows an "elbow": if ϵ is chosen much too
small, a large part of the data will not be clustered; whereas for a too high value of ϵ,
clusters will merge and the majority of objects will be in the same cluster. In general,
small values of ϵ are preferable, and as a rule of thumb, only a small fraction of points
should be within this distance of each other.
● Distance function: The choice of distance function is tightly linked to the choice of ε,
and has a major impact on the outcomes. In general, it will be necessary to first
identify a reasonable measure of similarity for the data set, before the parameter ε can
be chosen. There is no estimation for this parameter, but the distance functions need
to be chosen appropriately for the data set.
Comparison of K-means clustering and DBSCAN clustering techniques:
1. Versatility: DBSCAN can identify clusters of any shape, providing versatility beyond
traditional clustering algorithms like k-means or hierarchical clustering.
2. Robustness to Noise: It effectively filters out noise points that do not belong to any
cluster, aiding in data cleaning and outlier detection.
3. Parameter-Free: DBSCAN operates without the need for users to specify the number
of clusters beforehand, offering convenience in scenarios where determining the
optimal cluster count is challenging.
4. Efficiency on Large Datasets: With a time complexity of O(n log n), it efficiently
handles large datasets, making it scalable even for datasets with millions of points.

Despite its strengths, DBSCAN also has limitations:


1. Sensitivity to Hyperparameters: Its performance may vary based on hyperparameters
like ϵ and min_samples, requiring trial and error to find optimal values.
2. Challenges with Different Cluster Densities: DBSCAN may struggle with datasets
containing clusters of significantly different densities, as determining a single optimal
eps value becomes challenging.
3. Limitations in High-Dimensional Data: The "curse of dimensionality" affects
DBSCAN's effectiveness on high-dimensional data, potentially necessitating
dimensionality reduction techniques for improved performance.

Reference:
DBSCAN Demystified: Understanding How This Algorithm Works
DBSCAN Clustering Algorithm in Machine Learning

You might also like