Clustering of uninhabitable houses using the optimized apriori algorithm
Clustering of uninhabitable houses using the optimized apriori algorithm
Corresponding Author:
Al-Khowarizmi
Department of Information Technology, Faculty of CS & IT, Universitas Muhammadiyah Sumatera Utara
Jl. Kapt. Mukhtar Basri No 3, Medan 20238, Indonesia
Email: alkhowarizmi@umsu.ac.id
1. INTRODUCTION
Data mining is a technique that is very necessary to support the success of artificial intelligence and
data science principles [1], [2]. Data mining has 5 basic roles such as association, clustering, classification,
forecasting, and prediction [3], [4]. Each war must be based on a dataset and a learning model from the data.
The model for learning from data is of course observed based on supervised leaning or unsupervised leaning
[5], [6]. In supervised leaning the role of data mining that can be processed is classification, forecasting and
prediction, while in unsupervised leaning the role of data mining that can be completed is association and
clustering [7].
Focusing on clustering, clustering is a technique for grouping data based on basic similarities and
differences in the dataset [8]. The purpose of clustering is to divide data sets into group data sets that have
similar and different characteristics [9]. Clustering does not require training data on data objects [10], so many
applications use clustering, as in [11] Optimizing business data to increase the effectiveness and accuracy of
business data by utilizing clustering techniques in business data analysis services that are smarter and show
maximum grouping above 80%. Meanwhile in research [12] used the fuzzy clustering algorithm to group
student success results and influencing factors in the dataset so that the results of the clustering research were
a student work ratio of 96.7%, a student engagement ratio of 97.5% and a behavior ratio of 95.1%.
Clustering can also be said to be data in forming data patterns so that they can be utilized by other
methods [13]. There are many algorithms that can solve clustering problems, one of which is Apriori [14]. The
Apriori is an algorithm with unsupervised leaning that is able to solve association and clustering problems. On
research [14] carried out clustering using the a priori algorithm on 609 medical records on digestive diseases
where the research aimed to explore drug use rules where the results of clustering using the A priori algorithm
showed confidence in the analysis results to be greater than 0.91 with a level of support greater than 20% of
the information without applying the concept of data mining like clustering. Meanwhile on [15] optimizing the
performance of the Apriori algorithm in conducting Clusters on Hadoop where the research results show that
the Apriori algorithm is superior by implementing MapReduce-Based compared to Apriori in general.
Various problems in everyday life can of course be solved with a priori algorithms [16], so these
algorithms need to be analyzed and optimized for their performance in performing clustering to produce new
knowledge which can be called associations [17]. However, the optimization process must be tested on the
dataset. A dataset that really supports everyday problems is Uninhabitable Houses [18], [19]. Uninhabitable
Houses are owned by the community and are not intended for habitation, so Uninhabitable Houses need
guidance from the government in order to provide assistance to become Inhabitable Houses. So, it is necessary
to cluster Uninhabitable Houses using the Apriori algorithm which is optimized based on the final result,
namely new knowledge based on associations.
T𝐴∩𝐵
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴, 𝐵) = 𝑥 100% (2)
𝑇𝑇𝑜𝑡𝑎𝑙
T𝐴∩𝐵∩𝐶
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴, 𝐵, 𝐶) = 𝑥 100% (3)
𝑇𝑇𝑜𝑡𝑎𝑙
Where,
T A is a state the number of transactions containing A,
T A ∩ B is a transaction containing A and B,
T A ∩ B ∩ C is a transaction containing A, B and C,
TTotal is the total transaction amount.
b. In calculating confidence, itemset exchange is carried out. For example, a combination of 2 itemsets,
namely A → B, then reversed to become B → A. Likewise with a combination of 3 itemsets, namely A,
B → C, then reversed to become A, C → B and B, C → A. Each itemset support value maybe it will
remain the same, but it will likely have a different confidence value. This is to find out which confidence
value is the largest for each itemset. The confidence calculation for a combination of 2 itemsets is stated
in (4). The confidence calculation for a combination of 3 itemsets is stated in (5) [24].
T𝐴∩𝐵
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴, 𝐵) = (4)
𝑇𝐴
T𝐴∩𝐵∩𝐶
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴, 𝐵, 𝐶) = (5)
𝑇𝐴 ∩ 𝐵
The Apriori algorithm is defined as a data mining algorithm that is often used in the association rule
method [25]. A priori algorithms play a role in finding high frequency patterns. High frequency patterns are
Clustering of uninhabitable houses using the optimized apriori algorithm… (Al-Khowarizmi)
152 ISSN: 2722-3221
patterns of items whose frequency is above a certain threshold in a database. The stages of a priori include the
following [23]:
– Formation of candidate itemsets. The combination of (k-1)- itemsets obtained from the previous iteration
can form a candidate itemset [26].
– Calculation of support for each k-itemset candidate. To measure the number of transactions that have
items, support is needed from each candidate which is obtained by examining the database that will be
used. How to find support can be done using calculations in (1) and (2).
– High frequency pattern analysis. High frequency patterns are determined from k-itemset candidates that
exceed the minimum support value.
– If the high frequency pattern is no longer obtained, the entire process will stop.
support and confidence value calculations. The results of using the RapidMiner tool to obtain algorithm testing
results in item data collection, forming itemset candidates, calculating support for each k-itemset candidate,
calculating confidence value and finally the formation of associations.
According to manual calculations using the a priori algorithm, it can be seen in Table 1 which is a
representation of input data in the form of input data when goods transactions occur. The process of the Al
Priori algorithm on this data is based on various formulas. The following is an example of input data used for
the data mining process in the form of a transaction item table for selecting home renovation assistance as
follows in Table 1.
From the results of the representation in Table 1, it can be seen that the frequency pattern is carried
out based on the support value in analyzing data on potential recipients of house renovation assistance. Where
table 1 also shows home ownership items. However, from the detailed patterns seen in Table 1, the next process
is to carry out calculations using the a priori algorithm. and testing through several itemset schemes which are
detailed as follows:
So,
8
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔) = 𝑥 100% = 88%
9
8
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟) = 𝑥 100% = 88%
9
6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑂𝑛𝑒′𝑠 𝑜𝑤𝑛) = 𝑥 100% = 66%
9
6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑆𝑒𝑙𝑓 − 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 𝑥 100% = 66%
9
5
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑆. 𝑃𝑒𝑟𝑚𝑎𝑛𝑒𝑛𝑡) = 𝑥 100% = 55%
9
As for the following Table 2, the formation of support values from 1 itemset of data is as follows:
So,
7
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟) = 𝑥 100% = 77%
9
5
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝑂𝑛𝑒′𝑠 𝑜𝑤𝑛) = 𝑥 100% = 55%
9
6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝑆𝑒𝑙𝑓 − 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 𝑥 100% = 66%
9
4
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝑆. 𝑃𝑒𝑟𝑚𝑎𝑛𝑒𝑛𝑡) = 𝑥 100% = 44%
9
7
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟) = 𝑥 100% = 77%
9
The following is Table 3. The formation of support values from 2 data itemsets is as follows:
So,
4
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(G𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟, 𝑂𝑛𝑒′𝑠 𝑜𝑤𝑛) = 𝑥 100% = 44 %
9
6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(G𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟, 𝑆𝑒𝑙𝑓 − 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 𝑥 100% = 66 %
9
4
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(G𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟, 𝑆. 𝑃𝑒𝑟𝑚𝑎𝑛𝑒𝑛𝑡) = 𝑥 100% = 44 %
9
4
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(G𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟, 𝑍𝑖𝑛𝑐) = 𝑥 100% = 44 %
9
The following is Table 4 for the formation of support values from 3 data itemsets as follows:
Confidance Value (Cf), the association rule search is formed after obtaining a high frequency pattern
to calculate the confidence value where the minimum confidence value that has been determined is 0.8 or 80%.
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝐵 | 𝐴)
Transaction Amount Contains X and Y
𝐶𝑜𝑛𝑓𝑖𝑑𝑎𝑛𝑐𝑒 𝑃(X, 𝑌) = 𝑥 100%
Transactions Containing X
5
𝐶𝑜𝑛𝑓𝑖𝑑𝑎𝑛𝑐𝑒 (𝑂𝑛𝑒′𝑠 𝑜𝑤𝑛, 𝐺𝑎𝑠 𝑆𝑡𝑜𝑣𝑒 3 𝐾𝑔 = 𝑥 100% = 83%
6
5
𝐶𝑜𝑛𝑓𝑖𝑑𝑎𝑛𝑐𝑒 (𝑂𝑛𝑒′𝑠 𝑜𝑤𝑛, 𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦 𝑀𝑒𝑡𝑒𝑟 = 𝑥 100% = 83%
6
5
𝐶𝑜𝑛𝑓𝑖𝑑𝑎𝑛𝑐𝑒 (𝑆𝑒𝑙𝑓 − 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑑, 𝑊𝑎𝑙𝑙 = 𝑥 100% = 83%
6
Clustering of uninhabitable houses using the optimized apriori algorithm… (Al-Khowarizmi)
156 ISSN: 2722-3221
The following is the process of forming association rules using pattern analysis as shown in Table 5.
The association rule search is formed after obtaining a high frequency pattern that has been obtained
in a combination of 2 items. Use an equation formula to calculate the confidence value where the minimum
confidence value determined by the user is 80%. To find association rules, only use the values of 2 itemsets by
setting a minimum confidence of 80%. So, the clusters that form the rule association can be seen in Table 6.
In this paper, of course, optimization is carried out using an a priori algorithm to find association rules
for itemset data patterns with a minimum support of 40% and a minimum confidence of 80%. On the analysis
process page display, the next step that will be displayed is the modeling step of the analysis process carried
out by the system using attribute item data. The model process display using the a priori method is carried out
to determine the data items resulting from the analysis which is seen in Figure 2.
Figure 2 shows a visualization of itemset association rules with high frequency values produced by
rapidminer testing. Display of the results of the itemset association rule values with the confidence values
produced by Rapidminer testing. from Figure 2 also forms an association rule which is tested on the
Uninhabitable Houses data shown in Figure 3.
Figure 3 shows the association rule search is formed after obtaining a high frequency pattern that has
been obtained in a combination of 2 items. Use an equation formula to calculate the confidence value where
the minimum confidence value determined by the user is 80%. To find association rules, only use the values
of 2 itemsets by setting a minimum confidence of 80%.
4. CONCLUSSION
Based on the results of the analysis of the calculation pattern using the a priori algorithm method, it
can be seen that it is based on a combination of 2 itemsets with a tendency value for Gas Stove fuel of 3 kg and
the installed power meter for the attribute item criteria with the result being a minimum support value of 77%
and a minimum confidence value of 87%. In the data mining testing system in selecting and clustering
Uninhabitable Houses, several forms are displayed to process the input attribute item data. However, testing in
this paper shows clustering in Uninhabitable Houses with an a priori algorithm that is optimized by adding new
knowledge in the form of associations in house renovation assistance with the help of rapidminer testing tools.
REFERENCES
[1] Al-Khowarizmi and Suherman, “Classification of skin cancer images by applying simple evolving connectionist system,” IAES
International Journal of Artificial Intelligence, vol. 10, no. 2, pp. 421–429, Jun. 2021, doi: 10.11591/IJAI.V10.I2.PP421-429.
[2] M. E. Al Khowarizmi, Rahmad Syah, Mahyuddin K. M. Nasution, “Sensitivity of MAPE using detection rate for big data forecasting
crude palm oil on k-nearest neighbor,” International Journal of Electrical and Computer Engineering, vol. 11, no. 3, pp. 2696–
2703, 2021, doi: 10.11591/ijece.v11i3.pp2696-2703.
[3] A. Dogan and D. Birant, “Machine learning and data mining in manufacturing,” Expert Systems with Applications, vol. 166, p.
114060, 2021, doi: https://doi.org/10.1016/j.eswa.2020.114060.
[4] N. Maleki, Y. Zeinali, and S. T. A. Niaki, “A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature
selection,” Expert Systems with Applications, vol. 164, p. 113981, 2021, doi: https://doi.org/10.1016/j.eswa.2020.113981.
[5] K. K. Hiran, R. K. Jain, K. Lakhwani, and R. Doshi, Machine Learning: Master Supervised and Unsupervised Learning Algorithms
with Real Examples. BPB Publications, 2021.
[6] S. Bashath, N. Perera, S. Tripathi, K. Manjang, M. Dehmer, and F. E. Streib, “A data-centric review of deep transfer learning with
applications to text data,” Information Sciences (Ny)., vol. 585, pp. 498–528, 2022.
[7] M. Alloghani, D. Al-Jumeily, J. Mustafina, A. Hussain, and A. J. Aljaaf, “A systematic review on supervised and unsupervised
machine learning algorithms for data science,” Supervised and Unsupervised Learning for Data Science, pp. 3–21, 2020.
[8] T. M. Ghazal, “Performances of K-means clustering algorithm with different distance metrics,” Intelligent Automation & Soft
Computing, vol. 30, no. 2, pp. 735–742, 2021.
[9] K. Bandara, C. Bergmeir, and S. Smyl, “Forecasting across time series databases using recurrent neural networks on groups of
similar series: A clustering approach,” Expert Systems with Applications, vol. 140, p. 112896, 2020.
[10] Y. Zhang, C. Song, and D. Zhang, “Deep learning-based object detection improvement for tomato disease,” IEEE access, vol. 8,
pp. 56607–56614, 2020.
[11] N. Wang and N. Wang, “Design of an intelligent processing system for business data design of an intelligent processing clustering
system for business data analysis based on improved algorithm analysis based on improved clustering algorithm,” Procedia
Computer Science, vol. 228, pp. 1215–1224, 2023, doi: 10.1016/j.procs.2023.11.105.
[12] H. Han, “Fuzzy clustering algorithm for university students’ psychological fitness and performance detection,” Heliyon, vol. 9, no.
8, p. e18550, 2023, doi: 10.1016/j.heliyon.2023.e18550.
[13] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, “Chapter 4 - Algorithms: The basic methods,” I. H. Witten, E. Frank, M. A. Hall,
and C. J. B. T.-D. M. (Fourth E. Pal, Eds. Morgan Kaufmann, 2017, pp. 91–160.
[14] J. Wu et al., “A study of TCM master Yan Zhenghua’s medication rule in prescriptions for digestive system diseases based on
Apriori and complex system entropy cluster,” Journal of Traditional Chinese Medical Sciences, vol. 2, no. 4, pp. 241–247, 2015,
doi: 10.1016/j.jtcms.2016.02.007.
[15] S. Singh, R. Garg, and P. K. Mishra, “Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster,”
Computers & Electrical Engineering, vol. 67, pp. 348–364, 2018, doi: 10.1016/j.compeleceng.2017.10.008.
[16] E. Kaya, B. Gorkemli, B. Akay, and D. Karaboga, “A review on the studies employing artificial bee colony algorithm to solve
combinatorial optimization problems,” Engineering Applications of Artificial Intelligence, vol. 115, p. 105311, 2022.
[17] H. Luo et al., “Associations of β-Fibrinogen Polymorphisms with the Risk of Ischemic Stroke: A Meta-analysis.,” Journal of Stroke
and Cerebrovascular Diseases, vol. 28, no. 2, pp. 243–250, Feb. 2019, doi: 10.1016/j.jstrokecerebrovasdis.2018.09.007.
[18] N. Shinohara, K. Hashimoto, H. Kim, and H. Yoshida-Ohuchi, “Fungi, mites/ticks, allergens, and endotoxins in different size
fractions of house dust from long-term uninhabited houses and inhabited houses,” Building and Environment, vol. 229, p. 109918,
2023.
[19] Y. Liu, F. Yu, J. Xu, and P. Xin, “Identification of dangerous rural houses using oblique photogrammetry and photo recognition
technology,” in 2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA),
2023, pp. 70–75.
[20] S. M. Berliana, A. W. Augustia, P. D. Rachmawati, R. Pradanie, F. Efendi, and G. E. Aurizki, “Factors associated with child neglect
in Indonesia: Findings from National Socio-Economic Survey,” Children and Youth Services Review, vol. 106, no. September, p.
104487, 2019, doi: 10.1016/j.childyouth.2019.104487.
[21] Y. Abe, K. Yamada, R. Tanaka, K. Ando, and M. Ueno, “Dynamic living space: toward a society where people can live anywhere
in 2050,” Food Bioprod. Process., p. 105151, 2023, doi: 10.1016/j.futures.2024.103363.
[22] M. Sornalakshmi et al., “Hybrid method for mining rules based on enhanced Apriori algorithm with sequential minimal optimization
in healthcare industry,” Neural Computing and Applications, pp. 1–14, 2020.
[23] R. Papi, S. Attarchi, A. Darvishi Boloorani, and N. Neysani Samany, “Knowledge discovery of Middle East dust sources using
Apriori spatial data mining algorithm,” Ecological Informatics, vol. 72, no. July, p. 101867, 2022, doi:
10.1016/j.ecoinf.2022.101867.
[24] X. Zhang and J. Zhang, “Analysis and research on library user behavior based on apriori algorithm,” Measurement: Sensors, vol.
27, no. April, p. 100802, 2023, doi: 10.1016/j.measen.2023.100802.
[25] E. V. Altay and B. Alatas, “Intelligent optimization algorithms for the problem of mining numerical association rules,” Physica A:
Statistical Mechanics and its Applications, vol. 540, p. 123142, 2020.
[26] C. Wang and X. Zheng, “Application of improved time series Apriori algorithm by frequent itemsets in association rule data mining
based on temporal constraint,” Evolutionary Intelligence, vol. 13, no. 1, pp. 39–49, 2020.
BIOGRAPHIES OF AUTHORS