An Unsupervised Machine Learning Algorithms_Comprehensive Review
An Unsupervised Machine Learning Algorithms_Comprehensive Review
net/publication/368983958
CITATIONS READS
65 13,742
4 authors:
All content following this page was uploaded by Aqib Ali on 16 April 2023.
Received 25 May. 2022, Revised 19 Dec. 2022, Accepted 6 Feb. 2023, Published 16 Apr. 2023
Abstract: Machine learning (ML) is a data-driven strategy in which computers learn from data without human intervention. The
outstanding ML applications are used in a variety of areas. In ML, there are three types of learning problems: Supervised, Unsupervised,
and Semi-Supervised Learning. Examples of unsupervised learning techniques and algorithms include Apriori algorithm, ECLAT
algorithm, frequent pattern growth algorithm, clustering using k-means, principal components analysis. Objects are grouped based on
their same properties. The clustering algorithms are divided into two categories: hierarchical clustering and partition clustering. Many
unsupervised learning techniques and algorithms have been created during the last decade, and some of them are well-known and
commonly used unsupervised learning algorithms. Unsupervised learning approaches have seen a lot of success in disciplines including
machine vision, speech recognition, the creation of self-driving cars, and natural language processing. Unsupervised learning eliminates
the requirement for labeled data and human feature engineering, making standard machine learning approaches more flexible and
automated. Unsupervised learning is the topic of this survey report.
http:// journals.uob.edu.bh
E-mail address: samreencsit@gmail.com, aqibcsit@gmail.com, sania.anam7@gmail.com, munawar69@iub.edu.pk
912 Samreen Naeem, et al.: An Unsupervised Machine Learning Algorithms: Comprehensive Review.
• Data tagging is a time-consuming operation that ne- ture—this study surveys time-series clustering techniques.
cessitates human intervention. The uniqueness and limitations of earlier studies are also
explored, along with prospective research areas. Time series
• However, ML may be used to drive the same process, clustering applications are also listed. This literature review
making coding easier for everyone involved. focuses on time series clustering approaches. The reference
[11] surveys unsupervised and semi-supervised clustering
• It can be used to investigate unknown or unprocessed that describe clustering techniques and methodologies. The
data. authors gave external and internal clustering validity mea-
• It comes in handy when dealing with massive data sures. Their work helps researchers, although their literature
sets and pattern detection. review is limited to algorithms and clustering.
B. Motivation and Contribution
A. Literature Review
Unsupervised algorithms are extensively employed to
Many researchers have surveyed unsupervised learning
complete data mining jobs; they are discussed alone or
(UL) techniques. The reference [8] surveyed unsupervised
in groups based on learning needs. Literature studies on
learning literature. Their study included 49 studies. UL
supervised algorithms tend to focus little on unsupervised.
models are equivalent to supervised learning (SL) models,
The authors analyzed 35 papers between 2018 and 2022
and Fuzzy C-means and Fuzzy SOMs perform best among
and found that majority focused on unsupervised learning
UL methods. Their work focused on UL models for soft-
techniques. This review focused on unsupervised machine
ware fault prediction. The reference [4] analyzed supervised
learning techniques developed between 2018 and 2022.
and UL studies using a literature scan. They prioritized
research works published between 2015 and 2018 that 2. UNSUPERVISED LEARNING
address or use supervised and unsupervised ML approaches. In supervised learning, a data scientist offers labeled
This survey only included k-means, hierarchical cluster- data to the system, such as photographs of cats tagged
ing, and PCA. The reference [9] surveyed UL multiway as cats, so that it may learn by example. In unsupervised
models, algorithms, and their applications in chemometrics, learning, a data scientist merely gives photos, and it is up
neurology, social network analysis, text mining, and com- to the system to examine the data and determine whether
puter vision. Their poll exclusively analyzed unsupervised or not they are cat images. Large amounts of data are
multiway data. The reference [10] surveyed the litera- required for unsupervised machine learning [12]. In most
http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. 13, No.1, 911-921 (Apr-23) 913
circumstances, supervised learning works similarly, with time [14], [15]. The workflow of the clustering algorithm
the model becoming more accurate as more examples are is shown in Figure 3.
added. When data scientists use datasets to train algorithms,
the unsupervised learning process begins. These datasets
include no labeled or classed data points. The purpose of
learning the algorithm is to find patterns in the dataset
and rate the data points according to those patterns. The
clustering, association, anomaly detection, and autoencoder
issues are four types of unsupervised learning challenges,
as shown in Figure 2.
Figure 3. Workflow of clustering in unsupervised ML
B. ECLAT Algorithm
ECLAT (Equivalence Class Clustering and Bottom-up
Lattice Traversal) is a data mining technique used to acquire
element set mining and locate frequent items. Because the
a priori technique utilizes a horizontal data structure, it
must scan the database numerous times to find frequently
occurring objects. On the other hand, ECLAT takes a
vertical approach and is faster in general since it only has
to scan the database once [35].
http:// journals.uob.edu.bh
916 Samreen Naeem, et al.: An Unsupervised Machine Learning Algorithms: Comprehensive Review.
• Eclat’s approach does not require repetitive input • Step 6: The constructed FP tree must now be ex-
scanning to compute individual support values. tracted. The lowest node, as well as the relationships
between the weakest nodes, are evaluated first. The
• Unlike Apriori, which scans the original dataset, the lowest node represents the length of frequency model
Eclat algorithm searches the recently created datasets. 1. Then take the way via the FP tree. The conditional
model base refers to this path or pathways. The
2) Disadvantages dependent model is based on a secondary database
• The Eclat algorithm uses more RAM to build inter- containing prefix pathways in the FP tree that begin
mediate transaction ID sets. at the lowest node (suffix).
3) Applications • Step 7: Count the number of sets of items in the
• In the medical profession, for example, patient path to create a conditional FP tree. The hanging FP
database analysis. tree considers the collections of items that pass the
support criterion.
• In forestry, data from forest fires is used to analyze
the frequency and intensity of forest fires [36]. • Step 8: Create a conditional FP tree by counting the
number of sets of items in the route. The hanging
C. Frequent Pattern Growth Algorithm FP tree considers the groups of things that pass the
The Apriori algorithm has been improved with the support threshold.
Frequent Pattern (FP) Growth algorithm. This algorithm
• Step 9: The conditional FP tree generates frequent
represents the database in the form of a pattern or frequent
patterns.
tree (FT) structure. The most common patterns are extracted
using this regular tree. The Apriori technique must search 1) Advantages
the database n + 1 times (where n is the most extended
model’s length), but the FP growth approach needs two • This approach only needs to scan the database twice,
scans [37]. The stages for the Frequent pattern (FP) growth compared to Apriori, which examines the transactions
algorithm are as follows: for each iteration [38].
• This method avoids item matching, which speeds up
• Step 1: The first step is to run a database scan to the process.
see whether there are any occurrences of the item
sets. This is the same as the first step in the Apriori • Extraction of long and short frequent patterns is
method. The supporting count or frequency of one efficient and scalable since the database is compressed
collection of items in the database is the frequency in memory.
of one set of things.
2) Disadvantages
• Step 2: The FP tree is built. Begin by constructing
• FP Tree is bulkier and more complicated to build than
the tree’s root. The word null is used to represent the
Apriori, and it might be rather expensive.
root.
• The approach may not fit in shared memory if the
• Step 3: Re-scanning the database and going over all
database is extensive.
the transactions is the next stage. Examine the first
transaction to see what items it contains. The highest- 3) Applications
counting things are taken first, then the lowest-
counting items, etc. It denotes that the tree branch • Clustering, classification, software issue identifica-
is built up of sets of transaction components in tion, recommendations, and other problems may all
decreasing order of count. be solved with a Frequent pattern (FP) growth algo-
rithm [38].
• Step 4: The next transaction in the database is
examined. The object sets are ordered in ascending D. Clustering using K-Means
order by count. If a group of components from this In data science, several rounds of the k-means method
transaction already exists in the root, this transaction are commonly utilized. The k-means clustering algorithm
branch will share a common prefix. This signifies that divides components into groups based on their similarity.
the standard item set is linked to the new node of Graphically representation of the K-mean clustering work-
another item set in this transaction. flow is shown in Figure 9.
• Step 5: The item set count increases as transactions The letter k denotes the number of groups. As a result,
are made. As new nodes are established and joined if k is 3, there will be three groupings [39], [40], [41].
based on transactions, the count of both the familiar This clustering algorithm divides the unlabeled dataset into
and new nodes increases by one. unique clusters with comparable qualities for each data
http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. 13, No.1, 911-921 (Apr-23) 917
Apriori Algorithm ECLAT Algorithm Frequent Pattern Growth Algorithm K-Means Algorithm Principal Components Analysis Algorithm
Accuracy in General Satisfactory Good Good Superb Superb
Speed of Learning Good Excellent Good Superb Superb
Speed of Classification Superb Superb Good Excellent Superb
Tolerance to Missing Values Good Excellent Superb Superb Superb
Tolerance to Irrelevant Attributes Good Excellent Satisfactory Superb Good
Tolerance to Redundant Attributes Excellent Good Satisfactory Excellent Good
Tolerance to Highly Interdependent Attributes Satisfactory Good Excellent Excellent Satisfactory
Tolerance to Noise Excellent Good Satisfactory Superb Superb
Dealing with Danger of Overfitting Good Good Superb Excellent Excellent
Attempts for Incremental Learning Superb Satisfactory Good Good Superb
Transparency of Knowledge/ Classification Satisfactory Good Excellent Excellent Good
Support Multi classification Good Excellent Excellent Superb Superb
The steps below will demonstrate how the PCA • The trade-off between dimensionality reduction and
approach works: information loss. Reduced dimensionality is benefi-
cial, but it comes at a cost. Information loss is an
• Step 1: Obtain the data set unavoidable aspect of the PCA [44].
• Step 2: Data representation in a structure. 3) Applications
• Step 3: Standardization of data • PCA is mainly utilized in artificial intelligence appli-
cations such as computer vision, image compression,
• Step 4: Z’s covariance is calculated. and a dimensionality reduction approach.
• Step 5: Eigenvalues and eigenvectors are calculated. • If the data is vast enough, it may also be utilized to
find hidden models. Finance, Data Mining, Psychol-
• Step 6: The eigenvectors are categorized. ogy, and more areas employ PCA [45].
• Step 7: New characteristics or primary components 4. COMPARATIVELY ANALYSIS
are calculated. So, here’s a comparison of the most popular unsuper-
• Step 8: Remove characteristics from the new dataset vised classification algorithms. Several strategies have been
that are less significant or irrelevant. created, some of which have been addressed in earlier
sections. Based on available facts and theoretical stud-
1) Advantages ies, Table 1 compares various regularly used unsupervised
• PCA helps us to better generalize machine learning algorithms. This comparison demonstrates that no single
models by lowering the dimensionality of the input. learning algorithm beats the others.
This aids us in overcoming the ”dimensionality curse”
[44]. 5. CONCLUSION
Unsupervised learning is one of the many types of ma-
• The calculation is simple. PCA is based on linear chine learning. The model is trained on an unlabeled dataset
algebra, which computers can solve quickly. in unsupervised learning. Grouping, association, anomaly
detection, and automated encoders are also included. Var-
• Other machine learning algorithms will be sped up. ious techniques for unsupervised learning have been pre-
Machine learning algorithms are trained on crucial sented throughout the last decade. Unsupervised learning
components rather than the original dataset and con- has many applications, from intrusion detection to infor-
verge faster. mation retrieval, disease diagnosis, and protein sequence
search. This review of the literature focuses on unsupervised
• Reduces the challenges associated with high- learning methodologies and algorithms and the numerous
dimensional data. Regression-based algorithms are assessment metrics used to evaluate the performance of un-
readily over-adaptable when dealing with high- supervised learning models. It also outlines the advantages
dimensional data. We avoid overfitting prediction and disadvantages of each study. This survey report will
algorithms by utilizing PCA to minimize the size of aid academics in determining which unsupervised learning
the training dataset in advance. algorithms or approaches to utilize for issue solving. Also,
2) Disadvantages which study field needs greater attention. The scope of
• The key components have low interpretability. Princi- this research is confined to commonly used unsupervised
pal components are linear combinations of the origi- learning techniques. Only research within the last five years
nal data’s features, but they’re not easy to understand. should be highlighted. We may operate more algorithms and
For example, it’s challenging to identify the dataset’s methodologies in the future to improve targeting.
most relevant properties after computing the main
components.
http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. 13, No.1, 911-921 (Apr-23) 919
[11] N. Grira, M. Crucianu, and N. Boujemaa, “Unsupervised and semi- [25] M. A. Kabir and X. Luo, “Unsupervised learning for network flow
supervised clustering: a brief survey,” A review of machine learning based anomaly detection in the era of deep learning,” in 2020 IEEE
techniques for processing multimedia content, vol. 1, pp. 9–16, 2004. Sixth International Conference on Big Data Computing Service and
Applications (BigDataService). IEEE, 2020, pp. 165–168.
[12] R. A. Bantan, A. Ali, S. Naeem, F. Jamal, M. Elgarhy, and C. Ches-
neau, “Discrimination of sunflower seeds using multispectral and [26] K. Kottmann, P. Huembeli, M. Lewenstein, and A. Acı́n, “Unsu-
texture dataset in combination with region selection and supervised pervised phase discovery with deep anomaly detection,” Physical
classification methods,” Chaos: An Interdisciplinary Journal of Review Letters, vol. 125, no. 17, p. 170603, 2020.
Nonlinear Science, vol. 30, no. 11, p. 113142, 2020.
http:// journals.uob.edu.bh
920 Samreen Naeem, et al.: An Unsupervised Machine Learning Algorithms: Comprehensive Review.
[27] H. Choi, M. Kim, G. Lee, and W. Kim, “Unsupervised learning ap- mean clustering technique,” International Journal of E-Health and
proach for network intrusion detection system using autoencoders,” Medical Communications (IJEHMC), vol. 10, no. 4, pp. 54–65,
The Journal of Supercomputing, vol. 75, no. 9, pp. 5597–5621, 2019. 2019.
[28] J.-H. Seong and D.-H. Seo, “Selective unsupervised learning-based [42] M. Mateen, J. Wen, S. Song, and Z. Huang, “Fundus image clas-
wi-fi fingerprint system using autoencoder and gan,” IEEE Internet sification using vgg-19 architecture with pca and svd,” Symmetry,
of Things Journal, vol. 7, no. 3, pp. 1898–1909, 2019. vol. 11, no. 1, p. 1, 2018.
[29] A. I. Károly, R. Fullér, and P. Galambos, “Unsupervised clustering [43] Y. Dong and S. J. Qin, “A novel dynamic pca algorithm for dynamic
for deep learning: A tutorial survey,” Acta Polytechnica Hungarica, data modeling and process monitoring,” Journal of Process Control,
vol. 15, no. 8, pp. 29–53, 2018. vol. 67, pp. 1–11, 2018.
[30] N. Urs, S. Behpour, A. Georgaras, and M. V. Albert, “Unsupervised [44] S. Ghani, S. Kumari, and A. Bardhan, “A novel liquefaction study
learning in images and audio to produce neural receptive fields: for fine-grained soil using pca-based hybrid soft computing models,”
a primer and accessible notebook,” Artificial Intelligence Review, Sādhanā, vol. 46, no. 3, pp. 1–17, 2021.
vol. 55, no. 1, pp. 111–128, 2022.
[45] D. H. Grossoehme, M. Brown, G. Richner, S. M. Zhou, and
[31] K. La Marca and H. Bedle, “Deepwater seismic facies and ar- S. Friebert, “A retrospective examination of home pca use and
chitectural element interpretation aided with unsupervised machine parental satisfaction with pediatric palliative care patients,” Ameri-
learning techniques: Taranaki basin, new zealand,” Marine and can Journal of Hospice and Palliative Medicine®, vol. 39, no. 3,
Petroleum Geology, vol. 136, p. 105427, 2022. pp. 295–307, 2022.
[35] L. Jia, L. Xiang, and X. Liu, “An improved eclat algorithm based
on tissue-like p system with active membranes,” Processes, vol. 7,
no. 9, p. 555, 2019.
AQIB ALI got his Bachelor’s Degree in
[36] W. Mohamed and M. A. Abdel-Fattah, “A proposed hybrid algo- Computer (2017), after that he enrolled and
rithm for mining frequent patterns on spark,” International Journal completed his M.Phil. Degree in Computer
of Business Intelligence and Data Mining, vol. 20, no. 2, pp. 146–
Science (2020) from The Islamia University
169, 2022.
of Bahawalpur, Pakistan. He is also working
[37] C.-H. Chee, J. Jaafar, I. A. Aziz, M. H. Hasan, and W. Yeoh, “Al- as Lecturer Computer Science and IT in
gorithms for frequent itemset mining: a literature review,” Artificial reputed institutes in Pakistan. Now he is
Intelligence Review, vol. 52, no. 4, pp. 2603–2621, 2019. pursuing a Ph.D. degree from the Southeast
University China to complete his education.
[38] L. Gubu, D. Rosadi et al., “Robust mean–variance portfolio selection
using cluster analysis: A comparison between kamila and weighted
k-mean clustering,” Asian Economic and Financial Review, vol. 10,
no. 10, pp. 1169–1186, 2020.
http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. 13, No.1, 911-921 (Apr-23) 921
http:// journals.uob.edu.bh