6Vol99No14 2021 JATIT
6Vol99No14 2021 JATIT
6Vol99No14 2021 JATIT
ABSTRACT
Each transaction always produces junk data or bias data either due to errors or intentions. The junk data
volume is always increase day by day, mainly in the using of public and free to use applications. Junk data
is a disruption in every decision making which can cause the material or immaterial losses. This kind of
problems are also occurring in the Qasir.id application, a POS application developed by PT. Solusi
Teknologi Niaga for MSME entrepreneurs in Indonesia. In the company case, the junk data of POS
transaction causes a poor quality of GPV (Gross Payment Value) information. The article presents the
results of study in the POS transaction junk data handling. The junk data handling is performed by to
validate three machine learning techniques and to deploy the best model in the company's Business
Intelligence (BI) system. Based on the result of qualitative and quantitative evaluations, it is shown that the
proposed approach provide a significant contribution to the company's decision-making process. The
evaluation applied to the operational data sample reveals the accuracy score in the handling of junk data is
0.96 in precision, 0.73 in recall value, and the f1 score is 0.831. Whereas the qualitative evaluation based on
users feed back of two-month operation indicates that users were greatly assisted in decision-making
regarding the GPV.
Keywords: Employee Appraisal, Additional Salary, Employee Performance, Decision Support System,
FIS, Fuzzy Logic
3428
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
The GPV indicator is calculated based on user Component Analysis) method for pre-processing it
behavior in using the POS application. There are and the CNN model as its algorithm. To manage the
three categories of user behavior, namely: test, real noise, color augmentation method is deployed.
active, stop user/slipped away. However, since not Then the PCA method is applied to the image
all categories of behavior are used in the calculation resulting from the color augmentation process. At
of GPV, separation of user behavior is needed to the modeling stage the CNN algorithm is applied.
prevent the calculation of GPV from being biased. This CNN based model is used to predict dirty,
average, or clean rest-room categories. The only
This paper presents the results of research aiming
weakness in this study is on having images from 1
for managing and analyzing junk data to separate
type of restrooms.
the three user behaviors using the Machine
Learning approach. Several open-source tools from The next study about data noise is conducted by
python were used in data pre-processing, modeling, Etaiwi et al [10] which applied the a dataset
and model implementation [9]. The initial stage of published by Myle Ott et al [11]. In the dataset
the research is comparing of the performance of contains many spam reviews and fake reviews
several algorithms in order to get the best one. The which had an impact on online marketplace
results of this research have been implementing for behavior. In the study, the authors used bag-of-
2 months on a cloud server by using the open- words and words counts methods to detect spam
source ETL Pipeline from apache. Based on the reviews. Four algorithms were compared, namely
qualitative evaluation carried out by gathering user naïve-bayes, random forests, decision trees and
opinions, it is concluded that the implementation of support vector machines. For accuracy evaluation
the research results are feasible to be used for the purposes, the accuracy, precision and recall were
company operations. used. There were two stages of evaluation. The first
is the result of feature selection of bag-of-words and
Research related to the management of “junk
words counts. In the evaluation of words counts
data” has been widely conducted by other
feature selection, the best accuracy, precision, and
researchers. The management of “junk data” is done
recall is from the naïve-bayes algorithm. In the
by different methods and dataset. This section
evaluation of the bag-of-words feature selection, the
points out several studies related to managing the
best accuracy and recall were shown by naïve-bayes
“junk data”.
with 87.305% and 92.632% respectively, while the
The first study is the management of the best precision is indicated by random forests with
deodorant dataset and the address conducted by K 64.784%.
Hima Prasad, et al [5]. In this study, the authors
To the best of the author knowledge, there are no
investigated the framework to standardize sentences
studies performed in dealing with data disruption of
inputted by user. The first step is to investigate the
GPV-related POS transactions, while the real issue
dataset by identifying patterns in each row. Next,
occurs in the field. Therefore, in this study authors
data segmentation and correction of the wrong
propose and implement the research results for the
words were carried out. The RDR Framework is
purpose. The research is conducted through several
used to help correct words automatically in new
stages, namely dataset labeling based on
data. In this RDR Framework, researchers applied
predetermined parameters [7], [12], comparison of
rule-based method that resulted in quite good
several algorithms [10], the implementation of the
performance with an average precision of 0.6 and
best model on the cloud server, and the quantitative
recall of 0.6. Shiyang Xuan, et al [7] discussed
and qualitative evaluation on the model
research to deal with the issue of fraudulent use of
implementation results
credit cards. The study is started by data labeling
which performed by using predetermined The rest of the paper is organized as follows.
parameters based on the usage history of user's Section two discusses the related study regarding
credit card. The researchers applied 2 different the discussed topic. In the next section, section tree,
Random Forest Algorithms. RF1 provided the it is presented the material and method used in the
precision and recall of 90.27% and 67.89%, study. The data processing and computational
whereas RF2 provided the precision and recall of mechanism are also presented in the section two.
89.46% and 95.27%. The experiment results, its deployment evaluation
and the discussion are discussed in the third section.
The study on noise in images published by
The last section presents the conclusion and the
Lahiru Jayasinghe, et al [6] used a hygiene image
future work related to the subject
dataset from restrooms with the PCA (Principal
3429
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
Entropy = å - pi * log 2 pi
merely experimental transactions. Classification is a n
technique in data mining or machine learning used (1)
to classify dataset based on label or target class. i=1
Hence, algorithms or methods for solving
classification problems are categorized as Remarks:
supervised learning. The purpose of supervised S : The Sets of cases
learning is in which label or target attribute acting A : Feature
as a ‘teacher’ or ‘supervisor’ who guides the n : The number of Partitions S
machine learning process in order to achieve a pi : The proportions of Si against S
certain level of accuracy or precision [13]. Some
algorithms or methods that can be used to solve 2.3. Random Forest
classification problems such as Random Forest,
Random Forest is a classification method as the
C.45 or better known as Decision Tree, Logistic
Decision Tree. The basic concept of this method is
Regression, Naïve Bayes, Deep Learning and
to create a collection of trees by randomly selecting
othersD’Urso et al, as presented in the publication
attributes. In developing and analysing the tree,
[14], study the MCDM (Multi Criteria Decision
random forest consumes less time because the tree
Making) in fuzzy logic to support decision making
created will have only a few attributes. In cases, the
that able to accommodate many complex criteria.
accuracy of this method is better compared to the
The author proposes the fuzzy logic hierarchy
Decision Tree method as the classification results
method to overcome some issues associated with
do not only depend on one tree but many trees [16],
the uncertainty and the vagueness of specific
[17]. Another interesting of fuzzy variant applied in
decisions in very complex and multi-criteria
the human resources management area is ANFIS
frameworks. Based on the experiment results,
(Adaptive Neural Fuzzy Inference System) which is
author conclude that the method can be improved to
proposed by Krichevky et al [18]. In supporting the
get the optimum solution.
decision making on employee candidate selection,
2.2. Decision Tree the author proposed a multi layers decision making
system. The multi layers configuration is
Decision Tree is one of the most popular
combination of NN and fuzzy logic. The
classification methods as it is easy to interpret by
intermediate output of this architectures is the
humans. Decision Tree is a classification method
regression equation which connects the candidate
that applies a tree structure representation, each
quality with his/her characteristics. Whereas in their
node representing an attribute, a branch
publication [19], authors present the study result of
representing the value of an attribute, and a leaf
the using Random Forest classifier to classify text
representing a class or target. Regardless its easy
dataset in fishery domain. By tuning its parameters,
interpretation, decision tree has a lack of efficiency
the best accuracy performance result achieved is
in analysis and in its level of accuracy. [13], [15].
0.95.
The decision tree construction is based on the
3430
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
The class prediction is performed based on those the predetermined parameters [12] using the
tree votes which is computed as the formula (3). calculation of the number of transactions from ‘day
1’ to ‘day 31’. The second step is the pre-processing
å p (c | v)
1 T
p(c | v) =
stage applying the Feature Selection method, by
(2)
T t=1 t eliminating some unnecessary attributes [21]. The
third stage is the modeling performed by to
Remarks:
compare those three different algorithms, namely
p(c|v) : Forest Class
Random Forest, Decision Tree, and Logistic
T : Size of Forest
Regression. The fourth stage is to validate those
pt(c|v) : each tree leaf yields the posterior
three algorithms. The algorithm which provided the
t : number of trees
best results is then implemented on the cloud
server. The fifth stage is the implementation of the
2.4. Logistic Regression best model on the cloud server with the help of the
Logistic Regression is a classification algorithm open-source ETL Pipeline from apache. The last
used for probability prediction by comparing data stage is the evaluation of model having been
on logit functions of logistic curves. Unlike the implemented for first of two months. In this final
Linear Regression which produces a target output in stage, quantitative and qualitative evaluations were
the form of continuous data, the Logistic used. Illustration of the above stages is presented as
Regression output produced is categorical data in figure 1.
computed by the formula (4) [20].
e0 11 2 2 n n
b +b x +b x +...+b x
P(Y ) = (3)
1+ e 0 1 1 2 2 n n
b +b x +b x +...+b x
Remarks:
P : probability of Y occuring
e : natural logarithm base
b0 : interception at y-axis Figure 1. Experiment Stages
b1 : line gradient 2.6.1. Data Labelling
bn : regression coefficient of Xn
X1 : predictor variable Data Labeling in this study is carried out using
the python package namely pandas and numpy to
2.5. Dataset transform data based on predetermined parameters
[12]. There are three categories of labels defined
The dataset used in the study is merchant such as Real Active User, Stop User/Slipped Away,
transaction data through the Qasir point-of-sales and Testing User. The first thing to do is to
application. The dataset is obtained by querying a calculate the number of values from column day1 to
table in the production database, then exporting it to day31 valued 0. If the total value of 0 obtained is
a CSV file. Each instant data consisted of 34 less than or equal to 20 then it is categorized as
attributes, with 47.506 instant data collected. All active user, and other than that it is categorized as
attributes has a numeric type as in table 1. “day1” to User Testing. Then the data labeled as active user is
“day31” attributes were the number of transactions separated into real active user and stop user/slipped
carried out by the merchant on the first day to the away by calculating the number of values of 0 from
31st day of the same month. “Month” described the column day24 to day31. If the total value of 0 is
current month and “Year” is the current year. more than 7 then it is categorized as Stop
Table 1. Example of Dataset User/Slipped Away, while the rest are Real Active
merchant_id month year day1 … day31 Users.
213124 8 2019 100 … 12
192920 9 2019 200 … 1
828293 10 2019 0 … 0
3431
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
a. Precision
Precision is the classification ratio of
positive data considered true to number of positive
data considered true and false [10], [24].
TP
precision = (4)
TP + FP
b. Recall
Figure 3. Feature Selection Process Recall is the number of classification ratio
of positive data considered true to the number of
2.6.3. Modelling positive data considered true and negative data
In this research modeling is done by testing three considered false [10], [24].
classification algorithms, namely: Random Forst,
Decision Tree, and Logistic Regression. These three TP
recall = (5)
TP + FN
algorithms were tested using two test scenarios
based on the separation of training data and testing
data. The first data separation scenario is random
splitting which sorted training data from testing c. F1-Score
data with the composition of training data: testing
3432
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
3433
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
MySQL [29]. The first process undertaken is to loaded back into the data warehouse. The
deploy the model into Airflow. Airflow withdrew implementation scheme can be seen in figure 4.
data from datasource and loaded the data into a
model that had been deployed. The model is then
3434
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
Table 10. Output Data Example a model on the cloud server. A portion of the BI
merchant_i mont
year
day
…
day31 predictio application interface is presented as figure 6. The x-
d h 1 n
201 100 1 2
axis of the graph in Fig 6 represents the month of
213124 8
9
… the transaction, while the y-axis represents the
201 200 2 1 number of transactions. In each month period, the
192920 9 …
9 BI application presents three types of transactions
201 0 1 0
828293 10
9
…
0
based on the predicted model implemented, such as:
Real Active User with a blue line, User Stop with a
yellow line and User Testing with a red line.
3.4. Deployment Model Evaluation
Based on the model evaluation at the
experimental stage, the Random Forest algorithm is
selected to be implemented in a BI application with
After the random forest model has been table 12, the precision is still at 0.95, but the recall
implementing for 2 months, quantitative and is down by around 10% to 0.734.
qualitative evaluations were carried out.
Table 11. Confusion Matrix Result
Quantitative evaluation is conducted by comparing
Confusion Matrix
the results of prediction and labeling performed Label
manually using transaction data in October. This 0 1 2
evaluation is done by taking about 10000 real data 0 6570 342 504
analyzed by the model. The prediction results 32 3237 0
1
provided by the model of six thousand of data were
0 0 216
verified manually. The model performance based 2
on the results of manual verification is presented in
table 11. Quantitative Evaluation applied the Table 12. Quantitive Evaluation Result
Confusion Matrix basis for Precision and Recall
calculations [30], [31]. From the Confusion Matrix Precision Recall F1-Score
in table 5, label 0 is the testing transaction, 1 is the 0.958 0.734 0.831
active transaction, and label 2 is the user stopped
transaction. The model correctly predict 6570 of
To validate the operation performance of the BI
class 0 out of 7416, and those predicted as class 1
system supported by the seletected random forest
and class 2 were 342 and 504, respectively. The
model we performed a qualitative analysis. The
model also correctly predicted 3232 of class 1 and
analysis is carried out by gathering feedback from
32 of class 0. For class 2 the model predicted 100%
users regarding the performance of the implemented
accurately the 216 out of 216. The results of
model. Feedback is obtained by distributing
precision and recall computation is presented aas
questionnaires to BI users based on the
3435
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
implemented model. A summary of the the use of BI in one day is on the average of 63.15
questionnaire results is presented in figure 7. In minutes and in a week, it is used at most for 2 days.
terms of BI user representation, respondents After the model is implemented, the use of BI
represented the data custodian, product manager, increased to 120.4 minutes/day. The frequency of
top management, marketing staff, business staff, the use of BI per week also increased to 5 days
financial, and customer satisfaction divisions. The selected by 8 people, followed by 4 days, 3 days,
parameters evaluation included the intensity of the and 2 days, respectively by 6, 5, and 1 person.
use of BI per day, the frequency of days using the Finally, at the level of accuracy prior to Machine
BI application per week, and the accuracy of the Learning, many selected “rather accurate”, but after
information presented by BI. Most users come from Machine Learning, many selected “Accurate”.
the Data Division and the Product or Development
Division. Before using the Machine Learning based
model for Dashboard Analysis of BI applications,
3436
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
In the next research, we plan to further explore [8] S. Salloum, J. Z. Huang, and Y. He,
the data generated by POS transactions. Further ‘Exploring and cleaning big data with
prospects of data exploration include: the purpose random sample data blocks’, J. Big Data,
of product/service recommendations, prediction of vol. 6, no. 1, p. 45, 2019.
quantity and quality of transactions at merchants, [9] L. Thurner et al., ‘Pandapower - An Open-
and analysis of merchant behavior at the company Source Python Tool for Convenient
level. Modeling, Analysis, and Optimization of
Electric Power Systems’, IEEE Trans.
ACKNOWLEDGMENTS Power Syst., vol. 33, no. 6, pp. 6510–6521,
The authors would like to thank PT. Solusi 2018.
Teknologi Niaga (Qasir.id) for permitting the use of [10] W. Etaiwi and A. Awajan, ‘The Effects of
data and information on business process Features Selection Methods on Spam
knowledge provided. The writer also appreciates Review Detection Performance’, Proc. -
the willingness of Mrs. Aulia Permata Sari and Mr. 2017 Int. Conf. New Trends Comput. Sci.
Heri Husaeri Achsan to be the reviewer of this ICTCS 2017, vol. 2018-Janua, no. 2, pp.
article. 116–120, 2018.
[11] M. Ott, C. Cardie, and J. T. Hancock,
‘Negative deceptive opinion spam’, NAACL
REFRENCES: HLT 2013 - 2013 Conf. North Am. Chapter
[1] S. Kumar and M. Singh, ‘A novel Assoc. Comput. Linguist. Hum. Lang.
clustering technique for efficient clustering Technol. Proc. Main Conf., no. June, pp.
of big data in Hadoop Ecosystem’, Big 497–501, 2013.
Data Min. Anal., vol. 2, no. 4, pp. 240–247, [12] L. Zheng, G. Liu, C. Yan, and C. Jiang,
2019. ‘Transaction fraud detection based on total
[2] S. Krishnan et al., ‘SampleClean: Fast and order relation and behavior diversity’, IEEE
Reliable Analytics on Dirty Data’, Bull. Trans. Comput. Soc. Syst., vol. 5, no. 3, pp.
IEEE Comput. Soc. Tech. Comm. Data 796–806, 2018.
Eng., pp. 59–75, 2015. [13] M. Sadikin, F. Afiandi, and F. Alfiandi,
[3] S. Juddoo, ‘Overview of data quality ‘Comparative Study of Classification
challenges in the context of Big Data’, 2015 Method on Customer Candidate Data to
Int. Conf. Comput. Commun. Secur. ICCCS Predict its Potential Risk’, Int. J. Electr.
2015, 2016. Comput. Eng., vol. 8, no. 6, 2018.
[4] M. Zhou, Y. Wang, A. K. Srivastava, Y. [14] M. G. D’Urso and D. Masi, ‘Multi-Criteria
Wu, and P. Banerjee, ‘Ensemble-Based Decision-Making Methods and Their
Algorithm for Synchrophasor Data Applications for Human Resources’, in
Anomaly Detection’, IEEE Trans. Smart ISPRS - International Archives of the
Grid, vol. 10, no. 3, pp. 2979–2988, 2019. Photogrammetry, Remote Sensing and
[5] K. H. Prasad, T. A. Faruquie, S. Joshi, S. Spatial Information Sciences, 2015, vol.
Chaturvedi, L. V. Subramaniam, and M. XL-6/W1, no. June, pp. 31–37.
Mohania, ‘Data cleansing techniques for [15] N. Quadrianto and Z. Ghahramani, ‘A very
large enterprise datasets’, Proc. - 2011 simple safe-Bayesian random forest’, IEEE
Annu. SRII Glob. Conf. SRII 2011, pp. 135– Trans. Pattern Anal. Mach. Intell., vol. 37,
144, 2011. no. 6, pp. 1297–1303, 2015.
[6] L. Jayasinghe, N. Wijerathne, C. Yuen, and [16] I. Ahmad, M. Basheri, M. J. Iqbal, and A.
M. Zhang, ‘Feature Learning and Analysis Rahim, ‘Performance Comparison of
for Cleanliness Classification in Support Vector Machine, Random Forest,
Restrooms’, IEEE Access, vol. 7, pp. and Extreme Learning Machine for
14871–14882, 2019. Intrusion Detection’, IEEE Access, vol. 6,
[7] S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, pp. 33789–33795, 2018.
and C. Jiang, ‘Random forest for credit card [17] A. Criminisi, J. Shotton, and E. Konukoglu,
fraud detection’, ICNSC 2018 - 15th IEEE ‘Decision forests: A unified framework for
Int. Conf. Networking, Sens. Control, pp. 1– classification, regression, density
6, 2018. estimation, manifold learning and semi-
supervised learning’, Found. Trends
3437
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific
Comput. Graph. Vis., vol. 7, no. 2–3, pp. [29] ‘Cloud SQL for MySQL documentation’.
81–227, 2011. [Online]. Available:
[18] M. L. Krichevsky, J. Martunova, and V. https://cloud.google.com/sql/docs/mysql/fe
Sirotkin, ‘Neuro-fuzzy recruitment system’, atures. [Accessed: 14-Oct-2019].
Espacios, vol. 38, no. 62, p. 15, 2017. [30] A. Tharwat, ‘Classification assessment
[19] D. Ramayanti and U. Salamah, ‘Text methods’, Applied Computing and
Classification on Dataset of Marine and Informatics, 2018.
Fisheries Sciences Domain using Random [31] B. H. Shekar and G. Dagnew, ‘A Multi-
Forest Classifier’, Int. J. Comput. Tech., Classifier Approach on L1-Regulated
vol. 5, no. 5, pp. 1–7, 2018. Features of Microarray Cancer Data’, 2018
[20] H. Khurshid and M. F. Khan, Int. Conf. Adv. Comput. Commun.
‘Segmentation and classification using Informatics, ICACCI 2018, pp. 1515–1522,
logistic regression in remote sensing 2018.
imagery’, IEEE J. Sel. Top. Appl. Earth
Obs. Remote Sens., vol. 8, no. 1, pp. 224–
232, 2015.
[21] H. Liu, X. Li, and S. Zhang, ‘Learning
Instance Correlation Functions for
Multilabel Classification’, IEEE Trans.
Cybern., vol. 47, no. 2, pp. 499–510, 2017.
[22] B. Vinzamuri, Y. Li, and C. K. Reddy,
‘Pre-processing censored survival data
using inverse covariance matrix based
calibration’, IEEE Trans. Knowl. Data
Eng., vol. 29, no. 10, pp. 2111–2124, 2017.
[23] J. L. García-Balboa, M. V. Alba-Fernández,
F. J. Ariza-López, and J. Rodríguez-Avi,
‘Homogeneity test for confusion matrices:
A method and an example’, Int. Geosci.
Remote Sens. Symp., vol. 2018-July, pp.
1203–1205, 2018.
[24] M. Ohsaki, P. Wang, K. Matsuda, S.
Katagiri, H. Watanabe, and A. Ralescu,
‘Confusion-matrix-based kernel logistic
regression for imbalanced data
classification’, IEEE Trans. Knowl. Data
Eng., vol. 29, no. 9, pp. 1806–1819, 2017.
[25] A. Aksjonov, P. Nedoma, V. Vodovozov,
E. Petlenkov, and M. Herrmann, ‘Detection
and Evaluation of Driver Distraction Using
Machine Learning and Fuzzy Logic’, IEEE
Trans. Intell. Transp. Syst., vol. 20, no. 6,
pp. 2048–2059, 2019.
[26] Z. C. Lipton, C. Elkan, and B.
Naryanaswamy, ‘Thresholding Classifiers
to Maximize F1 Score’.
[27] Sally, ‘the Apache Software Foundation
Announces Apache® AirflowTM as a Top-
Level Project’, 2019. .
[28] G. Cloud, ‘Compute Engine’, 2018.
[Online]. Available:
https://cloud.google.com/compute/.
[Accessed: 09-Oct-2018].
3438