DBSCAN.docx
DBSCAN.docx
DBSCAN.docx
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Our goal is to cluster these points into groups that are densely packed together. Firstly, count
the number of points close to each point. For example, if we start with the yellow point, we
draw a circle around it. The radius ε (epsilon) of the circle is the first parameter that we have
to determine when using DBSCAN.
After drawing the circle, count the overlaps. For example, for our yellow point, there are 5
close points. Likewise, we count the number of close points for all remaining points.
Further, determine another parameter, the minimum number of points m. Each point is
considered Core Point if they are close to at least m other points. For example, if we take m
as 3, then, purple points are considered as Core Points but the yellow one is not because it
doesn’t have any close points around it.
Then, randomly select a Core Point and assign it as the first point in our first cluster. Other
points that are close to this point are also assigned to the same cluster (i.e., within the circle
of the selected point). Further, extend it to the other points that are close.
Stop the process when no more core points can be assigned to the first cluster. Some core
points could not be appointed even though they were close to the first cluster. Draw the
circles of these points and see if the first cluster is close to the core points. If there is an
overlap, put them in the first cluster. However, we cannot assign any non-core points. They
are considered outliers.
Now, move to other core points that are not assigned. Randomly select another
point from them and start over. DBSCAN works sequentially, so it’s important to note
that non-core points will be assigned to the first cluster that meets the requirement of
closeness.
Technical specification:
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous
region of high point density, separated from other such clusters by contiguous regions of low
point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm
for density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
● minPts: The minimum number of points (a threshold) clustered together for a region
to be considered dense.
● eps (ε): A distance measure that will be used to locate the points in the neighborhood
of any point.
Parameter Estimation:
● minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not
make sense, as then every point on its own will already be a cluster. With minPts ≤ 2,
the result will be the same as of hierarchical clustering with the single link metric,
with the dendrogram cut at height ϵ. Therefore, minPts must be chosen at least 3.
However, larger values are usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2.dim can be used, but it may be
necessary to choose larger values for very large data, for noisy data or for data that
contains many duplicates.
● ϵ: The value for ϵ can then be chosen by using a k-distance graph, plotting the
distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest
value. Good values of ϵ are where this plot shows an "elbow": if ϵ is chosen much too
small, a large part of the data will not be clustered; whereas for a too high value of ϵ,
clusters will merge and the majority of objects will be in the same cluster. In general,
small values of ϵ are preferable, and as a rule of thumb, only a small fraction of points
should be within this distance of each other.
● Distance function: The choice of distance function is tightly linked to the choice of ε,
and has a major impact on the outcomes. In general, it will be necessary to first
identify a reasonable measure of similarity for the data set, before the parameter ε can
be chosen. There is no estimation for this parameter, but the distance functions need
to be chosen appropriately for the data set.
Comparison of K-means clustering and DBSCAN clustering techniques:
1. Versatility: DBSCAN can identify clusters of any shape, providing versatility beyond
traditional clustering algorithms like k-means or hierarchical clustering.
2. Robustness to Noise: It effectively filters out noise points that do not belong to any
cluster, aiding in data cleaning and outlier detection.
3. Parameter-Free: DBSCAN operates without the need for users to specify the number
of clusters beforehand, offering convenience in scenarios where determining the
optimal cluster count is challenging.
4. Efficiency on Large Datasets: With a time complexity of O(n log n), it efficiently
handles large datasets, making it scalable even for datasets with millions of points.
Reference:
DBSCAN Demystified: Understanding How This Algorithm Works
DBSCAN Clustering Algorithm in Machine Learning