Automatic Clustering Algorithms
Automatic Clustering Algorithms
Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of
data sets. In contrast with other cluster analysis techniques, automatic clustering algorithms can determine
the optimal number of clusters even in the presence of noise and outlier points.[1]
Centroid-based
Given a set of n objects, centroid-based algorithms create k partitions based on a dissimilarity function, such
that k≤n. A major problem in applying this type of algorithm is determining the appropriate number of
clusters for unlabeled data. Therefore, most research in clustering analysis has been focused on the
automation of the process.
Automated selection of k in a K-means clustering algorithm, one of the most used centroid-based clustering
algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the
elbow method. It consists of running k-means clustering to the data set with a range of values, calculating
the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best
value of k will be on the "elbow".[2]
Another method that modifies the k-means algorithm for automatically choosing the optimal number of
clusters is the G-means algorithm. It was developed from the hypothesis that a subset of the data follows a
Gaussian distribution. Thus, k is increased until each k-means center's data is Gaussian. This algorithm only
requires the standard statistical significance level as a parameter and does not set limits for the covariance of
the data.[3]
Hierarchical models can either be divisive, where partitions are built from the entire data set available, or
agglomerating, where each partition begins with a single object and additional objects are added to the
set.[4] Although hierarchical clustering has the advantage of allowing any valid metric to be used as the
defined distance, it is sensitive to noise and fluctuations in the data set and is more difficult to automate.
Methods have been developed to improve and automate existing hierarchical clustering algorithms[5] such
as an automated version of single linkage hierarchical cluster analysis (HCA). This computerized method
bases its success on a self-consistent outlier reduction approach followed by the building of a descriptive
function which permits defining natural clusters. Discarded objects can also be assigned to these clusters.
Essentially, one needs not to resort to external parameters to identify natural clusters. Information gathered
from HCA, automated and reliable, can be resumed in a dendrogram with the number of natural clusters
and the corresponding separation, an option not found in classical HCA. This method includes the two
following steps: outliers being removed (this is applied in many filtering applications) and an optional
classification allowing expanding clusters with the whole set of objects.[6]
BIRCH (balanced iterative reducing and clustering using hierarchies) is an algorithm used to perform
connectivity-based clustering for large data-sets.[7] It is regarded as one of the fastest clustering algorithms,
but it is limited because it requires the number of clusters as an input. Therefore, new algorithms based on
BIRCH have been developed in which there is no need to provide the cluster count from the beginning, but
that preserves the quality and speed of the clusters. The main modification is to remove the final step of
BIRCH, where the user had to input the cluster count, and to improve the rest of the algorithm, referred to
as tree-BIRCH, by optimizing a threshold parameter from the data. In this resulting algorithm, the threshold
parameter is calculated from the maximum cluster radius and the minimum distance between clusters,
which are often known. This method proved to be efficient for data sets of tens of thousands of clusters. If
going beyond that amount, a supercluster splitting problem is introduced. For this, other algorithms have
been developed, like MDB-BIRCH, which reduces super cluster splitting with relatively high speed.[8]
Density-based
Unlike partitioning and hierarchical methods, density-based clustering algorithms are able to find clusters of
any arbitrary shape, not only spheres.
The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding
geographical location and distance to a particular number of neighbors. It is considered autonomous
because a priori knowledge on what is a cluster is not required.[9] This type of algorithm provides different
methods to find clusters in the data. The fastest method is DBSCAN, which uses a defined distance to
differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust
by using a range of distances instead of a specified one. Lastly, the method OPTICS creates a reachability
plot based on the distance from neighboring features to separate noise from clusters of varying density.
These methods still require the user to provide the cluster center and cannot be considered automatic. The
Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on
developing automatic density-based clustering. ALDC works out local density and distance deviation of
every point, thus expanding the difference between the potential cluster center and other points. This
expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the
points that are left by their closest neighbor of higher density. [10]
In the automation of data density to identify clusters, research has also been focused on artificially
generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation
of valid algorithms by the directed acyclic graph (DAG), in which nodes represent procedures (building
block) and edges represent possible execution sequences between two nodes. Building Blocks determine
the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated
are compared to DBSCAN, a manual algorithm, in experimental results.[11]
References
1. Outlier
2. "Using the elbow method to determine the optimal number of clusters for k-means
clustering" (https://bl.ocks.org/rpgove/0060ff3b656618e9136b). bl.ocks.org. Retrieved
2018-11-12.
3. Hamerly, Greg; Elkan, Charles (9 December 2003). Sebastian Thrun; Lawrence K Saul;
Bernhard H Schölkopf (eds.). Learning the k in k-means (https://web.archive.org/web/20221
016235553/https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7
a7d8-Paper.pdf) (PDF). Proceedings of the 16th International Conference on Neural
Information Processing Systems (https://dl.acm.org/doi/proceedings/10.5555/2981345).
Whistler, British Columbia, Canada: MIT Press. pp. 281–288. Archived from the original (http
s://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pd
f) (PDF) on 16 October 2022. Retrieved 3 November 2022.
4. "Introducing Clustering II: Clustering Algorithms - GameAnalytics" (https://gameanalytics.co
m/blog/introducing-clustering-ii-clustering-algorithms.html). GameAnalytics. 2014-05-20.
Retrieved 2018-11-06.
5. Almeida, J.A.S.; Barbosa, L.M.S.; Pais, A.A.C.C.; Formosinho, S.J. (June 2007). "Improving
hierarchical cluster analysis: A new method with outlier detection and automatic clustering"
(https://core.ac.uk/download/pdf/19123336.pdf) (PDF). Chemometrics and Intelligent
Laboratory Systems. Elsevier. 87 (2): 208–217. doi:10.1016/j.chemolab.2007.01.005 (https://
doi.org/10.1016%2Fj.chemolab.2007.01.005). hdl:10316/5042 (https://hdl.handle.net/1031
6%2F5042). Retrieved 3 November 2022.
6. Almeida, J.A.S.; Barbosa, L.M.S.; Pais, A.A.C.C.; Formosinho, S.J. (2007-06-15). "Improving
hierarchical cluster analysis: A new method with outlier detection and automatic clustering"
(https://estudogeral.sib.uc.pt//bitstream/10316/5042/1/filec983b44ba0b8489db5983985ef05
dfd7.pdf) (PDF). Chemometrics and Intelligent Laboratory Systems. 87 (2): 208–217.
doi:10.1016/j.chemolab.2007.01.005 (https://doi.org/10.1016%2Fj.chemolab.2007.01.005).
hdl:10316/5042 (https://hdl.handle.net/10316%2F5042). ISSN 0169-7439 (https://www.world
cat.org/issn/0169-7439).
7. Zhang, Tian; Ramakrishnan, Raghu; Livny, Miron; Zhang, Tian; Ramakrishnan, Raghu;
Livny, Miron (1996-06-01). "BIRCH: an efficient data clustering method for very large
databases, BIRCH: an efficient data clustering method for very large databases" (https://doi.
org/10.1145%2F235968.233324). ACM SIGMOD Record. 25 (2): 103, 103–114, 114.
doi:10.1145/235968.233324 (https://doi.org/10.1145%2F235968.233324). ISSN 0163-5808
(https://www.worldcat.org/issn/0163-5808).
8. Lorbeer, Boris; Kosareva, Ana; Deva, Bersant; Softić, Dženan; Ruppel, Peter; Küpper, Axel
(2018-03-01). "Variations on the Clustering Algorithm BIRCH" (https://doi.org/10.1016%2Fj.b
dr.2017.09.002). Big Data Research. 11: 44–53. doi:10.1016/j.bdr.2017.09.002 (https://doi.or
g/10.1016%2Fj.bdr.2017.09.002). ISSN 2214-5796 (https://www.worldcat.org/issn/2214-579
6).
9. "How Density-based Clustering works—ArcGIS Pro | ArcGIS Desktop" (http://pro.arcgis.com/
en/pro-app/tool-reference/spatial-statistics/how-density-based-clustering-works.htm).
pro.arcgis.com. Retrieved 2018-11-05.
10. "An algorithm for automatic recognition of cluster centers based on local density clustering -
IEEE Conference Publication". doi:10.1109/CCDC.2017.7978726 (https://doi.org/10.1109%
2FCCDC.2017.7978726). S2CID 23267464 (https://api.semanticscholar.org/CorpusID:2326
7464).
11. "AutoClustering: An estimation of distribution algorithm for the automatic generation of
clustering algorithms - IEEE Conference Publication". doi:10.1109/CEC.2012.6252874 (http
s://doi.org/10.1109%2FCEC.2012.6252874).