Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

6Vol99No14 2021 JATIT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Theoretical and Applied Information Technology

31st July 2021. Vol.99. No 14


© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

THE APPLICATION OF MACHINE LEARNING APPROACH


TO ADDRESS THE GPV BIAS ON POS TRANSACTION
1
MUJIONO SADIKIN, 2PURWANTO SK, 3LUTHFIR RAHMAN BAGASKARA
1
Faculty Member, Universitas Mercu Buana, Computer Science Faculty, Indonesia
2
Faculty Member, Universitas Esa Unggul, Economic & Business Faculty, Indonesia
3
BI Engineer, PT. Solusi Teknologi Niaga, Data Analytics Division, Indonesia
E-mail: mujiono.sadikin@mercubuana.ac.id, 2purwanto@esaunggul.ac.id. 3Luthfirrahman04@gmail.com
1

ABSTRACT

Each transaction always produces junk data or bias data either due to errors or intentions. The junk data
volume is always increase day by day, mainly in the using of public and free to use applications. Junk data
is a disruption in every decision making which can cause the material or immaterial losses. This kind of
problems are also occurring in the Qasir.id application, a POS application developed by PT. Solusi
Teknologi Niaga for MSME entrepreneurs in Indonesia. In the company case, the junk data of POS
transaction causes a poor quality of GPV (Gross Payment Value) information. The article presents the
results of study in the POS transaction junk data handling. The junk data handling is performed by to
validate three machine learning techniques and to deploy the best model in the company's Business
Intelligence (BI) system. Based on the result of qualitative and quantitative evaluations, it is shown that the
proposed approach provide a significant contribution to the company's decision-making process. The
evaluation applied to the operational data sample reveals the accuracy score in the handling of junk data is
0.96 in precision, 0.73 in recall value, and the f1 score is 0.831. Whereas the qualitative evaluation based on
users feed back of two-month operation indicates that users were greatly assisted in decision-making
regarding the GPV.
Keywords: Employee Appraisal, Additional Salary, Employee Performance, Decision Support System,
FIS, Fuzzy Logic

1. INTRODUCTION such as e-commerce or other open data recording


Apart from its high complexity and large volume, systems. The presence of junk data disrupts the
one of the problems in the Big Data era at present various analyses and reporting needs of an
time is junk data and data bias (noise). Junk data is organization [8], a company's business, employee
the data contains anomalies so it is not standardized performance reports, and even gross value of a
or inconsistent [1]–[4]. Some examples of start-up company
anomalies are blur in images, non-standardized
This junk data issues also occur at PT Solusi
vocabulary or unnecessary words in text, or
Teknologi Niaga (Qasir.id), a start-up company
background noises of voice data. Junk data is the
who engage in developing point-of-sale
topic of concern and discussion material of the
applications by name” Qasir”. The point-of-sales
researchers owing to its effect on the data quality
(POS) application developed to assist the MSMEs
and lesss accuracy decision making. Some
(Micro Small and Medium Enterprises) in recording
publications discuss this junk data are addressing
their online and offline transactions, managing
data issues which do not meet the standards[5] and
products, and monitoring transaction reports
managing biased of image data [6]. In the second
without paying for application services. Due to the
study, the image data is used to identify cleanliness
fact that this point-of-sales application is free, it
of restrooms to help allocate cleaning service
leads to the consequence that lots of “trial”
personnel. Furthermore, the junk data on credit card
transactions done by merchants which affect on one
transactions is the subject of this paper [7].
of the performance indicators at PT Solusi
Junk data or noise can arise from all data Teknologi Niaga i.e GPV (Gross Payment Value).
recording systems. The quantity of junk data is GPV is the merchant transaction value recorded in
getting higher for open data recording transactions the system, but it is not counted as company profits.

3428
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

The GPV indicator is calculated based on user Component Analysis) method for pre-processing it
behavior in using the POS application. There are and the CNN model as its algorithm. To manage the
three categories of user behavior, namely: test, real noise, color augmentation method is deployed.
active, stop user/slipped away. However, since not Then the PCA method is applied to the image
all categories of behavior are used in the calculation resulting from the color augmentation process. At
of GPV, separation of user behavior is needed to the modeling stage the CNN algorithm is applied.
prevent the calculation of GPV from being biased. This CNN based model is used to predict dirty,
average, or clean rest-room categories. The only
This paper presents the results of research aiming
weakness in this study is on having images from 1
for managing and analyzing junk data to separate
type of restrooms.
the three user behaviors using the Machine
Learning approach. Several open-source tools from The next study about data noise is conducted by
python were used in data pre-processing, modeling, Etaiwi et al [10] which applied the a dataset
and model implementation [9]. The initial stage of published by Myle Ott et al [11]. In the dataset
the research is comparing of the performance of contains many spam reviews and fake reviews
several algorithms in order to get the best one. The which had an impact on online marketplace
results of this research have been implementing for behavior. In the study, the authors used bag-of-
2 months on a cloud server by using the open- words and words counts methods to detect spam
source ETL Pipeline from apache. Based on the reviews. Four algorithms were compared, namely
qualitative evaluation carried out by gathering user naïve-bayes, random forests, decision trees and
opinions, it is concluded that the implementation of support vector machines. For accuracy evaluation
the research results are feasible to be used for the purposes, the accuracy, precision and recall were
company operations. used. There were two stages of evaluation. The first
is the result of feature selection of bag-of-words and
Research related to the management of “junk
words counts. In the evaluation of words counts
data” has been widely conducted by other
feature selection, the best accuracy, precision, and
researchers. The management of “junk data” is done
recall is from the naïve-bayes algorithm. In the
by different methods and dataset. This section
evaluation of the bag-of-words feature selection, the
points out several studies related to managing the
best accuracy and recall were shown by naïve-bayes
“junk data”.
with 87.305% and 92.632% respectively, while the
The first study is the management of the best precision is indicated by random forests with
deodorant dataset and the address conducted by K 64.784%.
Hima Prasad, et al [5]. In this study, the authors
To the best of the author knowledge, there are no
investigated the framework to standardize sentences
studies performed in dealing with data disruption of
inputted by user. The first step is to investigate the
GPV-related POS transactions, while the real issue
dataset by identifying patterns in each row. Next,
occurs in the field. Therefore, in this study authors
data segmentation and correction of the wrong
propose and implement the research results for the
words were carried out. The RDR Framework is
purpose. The research is conducted through several
used to help correct words automatically in new
stages, namely dataset labeling based on
data. In this RDR Framework, researchers applied
predetermined parameters [7], [12], comparison of
rule-based method that resulted in quite good
several algorithms [10], the implementation of the
performance with an average precision of 0.6 and
best model on the cloud server, and the quantitative
recall of 0.6. Shiyang Xuan, et al [7] discussed
and qualitative evaluation on the model
research to deal with the issue of fraudulent use of
implementation results
credit cards. The study is started by data labeling
which performed by using predetermined The rest of the paper is organized as follows.
parameters based on the usage history of user's Section two discusses the related study regarding
credit card. The researchers applied 2 different the discussed topic. In the next section, section tree,
Random Forest Algorithms. RF1 provided the it is presented the material and method used in the
precision and recall of 90.27% and 67.89%, study. The data processing and computational
whereas RF2 provided the precision and recall of mechanism are also presented in the section two.
89.46% and 95.27%. The experiment results, its deployment evaluation
and the discussion are discussed in the third section.
The study on noise in images published by
The last section presents the conclusion and the
Lahiru Jayasinghe, et al [6] used a hygiene image
future work related to the subject
dataset from restrooms with the PCA (Principal

3429
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

2. RESEARCH METHOD selection of dataset attributes used as a node at each


stage of the tree development. The node is choose
This research applied experimental methods,
based on the information gain computed by
system implementation, and surveys to get feedback
formulas (1) and (2)
from users. The experiments were conducted to get
Gain(S, A) = Entropy(S) - å
the best model which is implemented in the BI n
| Si | (1)
system. In the experimental stage the validation of
i=1 | S |
three classification algorithms, namely: Random
Forests, Decision Trees, and Logistic Regression,
are carried out. The best algorithm of the validation Remarks:
results is then implemented in the BI system. After S : The Sets of cases
the using of the BI for two months, a survey is A : Attribute
conducted to obtain feedback on the use of the BI n : The number of Partitions
system. attribute A
|Si| : Number of cases in the I partitions
2.1. Classification |S| : Number of cases on S
The classification in this research are used to sort
out the types of transactions consisting of active Which, to calculate entropy is as follows:
user transactions, unsustainable transactions, and

Entropy = å - pi * log 2 pi
merely experimental transactions. Classification is a n
technique in data mining or machine learning used (1)
to classify dataset based on label or target class. i=1
Hence, algorithms or methods for solving
classification problems are categorized as Remarks:
supervised learning. The purpose of supervised S : The Sets of cases
learning is in which label or target attribute acting A : Feature
as a ‘teacher’ or ‘supervisor’ who guides the n : The number of Partitions S
machine learning process in order to achieve a pi : The proportions of Si against S
certain level of accuracy or precision [13]. Some
algorithms or methods that can be used to solve 2.3. Random Forest
classification problems such as Random Forest,
Random Forest is a classification method as the
C.45 or better known as Decision Tree, Logistic
Decision Tree. The basic concept of this method is
Regression, Naïve Bayes, Deep Learning and
to create a collection of trees by randomly selecting
othersD’Urso et al, as presented in the publication
attributes. In developing and analysing the tree,
[14], study the MCDM (Multi Criteria Decision
random forest consumes less time because the tree
Making) in fuzzy logic to support decision making
created will have only a few attributes. In cases, the
that able to accommodate many complex criteria.
accuracy of this method is better compared to the
The author proposes the fuzzy logic hierarchy
Decision Tree method as the classification results
method to overcome some issues associated with
do not only depend on one tree but many trees [16],
the uncertainty and the vagueness of specific
[17]. Another interesting of fuzzy variant applied in
decisions in very complex and multi-criteria
the human resources management area is ANFIS
frameworks. Based on the experiment results,
(Adaptive Neural Fuzzy Inference System) which is
author conclude that the method can be improved to
proposed by Krichevky et al [18]. In supporting the
get the optimum solution.
decision making on employee candidate selection,
2.2. Decision Tree the author proposed a multi layers decision making
system. The multi layers configuration is
Decision Tree is one of the most popular
combination of NN and fuzzy logic. The
classification methods as it is easy to interpret by
intermediate output of this architectures is the
humans. Decision Tree is a classification method
regression equation which connects the candidate
that applies a tree structure representation, each
quality with his/her characteristics. Whereas in their
node representing an attribute, a branch
publication [19], authors present the study result of
representing the value of an attribute, and a leaf
the using Random Forest classifier to classify text
representing a class or target. Regardless its easy
dataset in fishery domain. By tuning its parameters,
interpretation, decision tree has a lack of efficiency
the best accuracy performance result achieved is
in analysis and in its level of accuracy. [13], [15].
0.95.
The decision tree construction is based on the

3430
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

The class prediction is performed based on those the predetermined parameters [12] using the
tree votes which is computed as the formula (3). calculation of the number of transactions from ‘day
1’ to ‘day 31’. The second step is the pre-processing

å p (c | v)
1 T
p(c | v) =
stage applying the Feature Selection method, by
(2)
T t=1 t eliminating some unnecessary attributes [21]. The
third stage is the modeling performed by to
Remarks:
compare those three different algorithms, namely
p(c|v) : Forest Class
Random Forest, Decision Tree, and Logistic
T : Size of Forest
Regression. The fourth stage is to validate those
pt(c|v) : each tree leaf yields the posterior
three algorithms. The algorithm which provided the
t : number of trees
best results is then implemented on the cloud
server. The fifth stage is the implementation of the
2.4. Logistic Regression best model on the cloud server with the help of the
Logistic Regression is a classification algorithm open-source ETL Pipeline from apache. The last
used for probability prediction by comparing data stage is the evaluation of model having been
on logit functions of logistic curves. Unlike the implemented for first of two months. In this final
Linear Regression which produces a target output in stage, quantitative and qualitative evaluations were
the form of continuous data, the Logistic used. Illustration of the above stages is presented as
Regression output produced is categorical data in figure 1.
computed by the formula (4) [20].

e0 11 2 2 n n
b +b x +b x +...+b x
P(Y ) = (3)
1+ e 0 1 1 2 2 n n
b +b x +b x +...+b x

Remarks:
P : probability of Y occuring
e : natural logarithm base
b0 : interception at y-axis Figure 1. Experiment Stages
b1 : line gradient 2.6.1. Data Labelling
bn : regression coefficient of Xn
X1 : predictor variable Data Labeling in this study is carried out using
the python package namely pandas and numpy to
2.5. Dataset transform data based on predetermined parameters
[12]. There are three categories of labels defined
The dataset used in the study is merchant such as Real Active User, Stop User/Slipped Away,
transaction data through the Qasir point-of-sales and Testing User. The first thing to do is to
application. The dataset is obtained by querying a calculate the number of values from column day1 to
table in the production database, then exporting it to day31 valued 0. If the total value of 0 obtained is
a CSV file. Each instant data consisted of 34 less than or equal to 20 then it is categorized as
attributes, with 47.506 instant data collected. All active user, and other than that it is categorized as
attributes has a numeric type as in table 1. “day1” to User Testing. Then the data labeled as active user is
“day31” attributes were the number of transactions separated into real active user and stop user/slipped
carried out by the merchant on the first day to the away by calculating the number of values of 0 from
31st day of the same month. “Month” described the column day24 to day31. If the total value of 0 is
current month and “Year” is the current year. more than 7 then it is categorized as Stop
Table 1. Example of Dataset User/Slipped Away, while the rest are Real Active
merchant_id month year day1 … day31 Users.
213124 8 2019 100 … 12
192920 9 2019 200 … 1
828293 10 2019 0 … 0

2.6. Research Stages


The research stages carried out were divided into
six steps. The first one is dataset labeling based on

3431
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

data are 70:30, 80:20, and 90:10 respectively. The


second scenario of sorting data sets used k-fold
cross with k-fold values of 5, 10, 15, and 20
respectively.

2.6.4. Model Validation


The next stage is model validation which is
carried out to measure the performance of the three
classification algorithms tested. The accuracy
performance of the model is validated using
Figure 2. Data labelling composition precision, recall, and f1 score parameters.
Validation schemes were to ensure that the model
really performs well in predicting new data. The
The result of dataset labelliing is depicted as
calculation of the three parameters are based on the
figure 2. The meanings of label 0, 1 and 2 are
confusion matrix [23]–[25]. Based on the confusion
Testing User, Real Active User, and also User Stop
matrix, the performance indicators are compute.
respectively. The amount of dataset instant of each
The performance indicators are TP, FP, TN, and FN
label consqutively are label 0 28,577 merchants
as shown in table 2.
(60.2%), label 17,370 merchants (36.6%), and the
rest are total merchants of label 2 Table 2. Confusion Matrix Definition
Name Definition
2.6.2. Data Pre-Processing TP (True Positive) The number of positive data
Data pre-processing is carried out to improve the considered true
quality of classification results. This process is
performed by removing unnecessary columns or FP (False Positive) The number of positive data
changing a value or object in the data instant [22]. considered false
Pre-processing conducted in this research is Feature
Selection to select the certain attributes used for TN (True Negative) The number of negative data
modeling purpose. The attributes removed from the considered false
dataset are merchant_id, month and year to adjust
the research focus on the patterns exisiting on the
FN (False Negative) The number of negatiev data
day1 to day31 attributes as presented as figure 3. considered true

a. Precision
Precision is the classification ratio of
positive data considered true to number of positive
data considered true and false [10], [24].

TP
precision = (4)
TP + FP
b. Recall
Figure 3. Feature Selection Process Recall is the number of classification ratio
of positive data considered true to the number of
2.6.3. Modelling positive data considered true and negative data
In this research modeling is done by testing three considered false [10], [24].
classification algorithms, namely: Random Forst,
Decision Tree, and Logistic Regression. These three TP
recall = (5)
TP + FN
algorithms were tested using two test scenarios
based on the separation of training data and testing
data. The first data separation scenario is random
splitting which sorted training data from testing c. F1-Score
data with the composition of training data: testing

3432
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

F1-Score is the result of 2 times precision 3.2. Cross Validation Results


and recall then divided by precision plus recall [26].
The results of Cross Validation experiment
scenarios with various K-Fold values is presented
2*(Pr ecision * Re call)
F1 = as tables 6, 7, 8 and 9. The performance results of
(6)
(Pr ecision + Re call) the cross-validation experiments scenario confirm
that RF provided the best results of the f1-score
value. For each k-fold, 5, 10 and 15, Random Forest
2.6.5. Model Evaluation
achieved f1-scores values of 0.836, 0.858, and
The best model obtained from the modeling stage 0.860 respectively which were higher than the 2
is selected to be implemented in the BI system. algorithms. Although for k-fold 20, Random Forest
Evaluation of the model is performed after the BI and Decision Tree had the same f1-score of 0.866,
system has been operating in two months. Model but on average for all three schemes, RF remained
evaluation is carried out by using two methods, the best.
namely Quantitative Evaluation and Qualitative
Evaluation. Quantitative evaluation is conducted by Table 6. K-Fold 5 Results
comparing the results of BI predictions and the data K-Fold 5
Classifier
labeling done manually. Transaction data used for Precision Recall F1 Time(s)
this evaluation is collected from transaction data in DT 0.879 0.818 0.832 1.33
October 2019. Qualitative evaluation is carried out RF 0.916 0.796 0.836 13.95
LR 0.878 0.794 0.804 10.33
using survey methods to find out the feedback from
respondents who use BI in the daily operation. Table 7. K-Fold 10 Results
K-Fold 10
Classifier
Precision Recall F1 Time(s)
3. RESULTS AND DISCUSSION DT 0.898 0.840 0.856 3.10
3.1. Random Splitting Validation Results RF 0.932 0.822 0.858 31.31
LR 0.901 0.817 0.831 23.87
The validation results of the random splitting
scenario experiments are presented as in table 3, Table 8. K-Fold 15 Results
table 4, and table 5 as well. The results of the trial Classifier
K-Fold 15
Precision Recall F1 Time(s)
of all random splitting schemes show that the
DT 0.902 0.841 0.858 6.97
Random Forest achieved the best f1-score values RF 0.933 0.823 0.860 56.33
compared to the other two algorithms with the LR 0.907 0.820 0.836 40.01
scores are 0.893, 0.877, and 0.877 respectively. In
terms of processing time, RF performance is the Table 9. K-Fold 20 Results
least good as it required the longest time for the Classifier
K-Fold 20
three splitting schemes. Precision Recall F1 Time(s)
DT 0.910 0.846 0.866 11.21
Table 3. Random Splitting 70:30 RF 0.940 0.827 0.866 81.85
70:30 LR 0.914 0.826 0.844 58.06
Classifier
Precision Recall F1 Time(s)
DT 0.876 0.861 0.868 0.25 3.3. Model Implementation
RF 0.955 0.84 0.893 2.97
LR 0.809 0.876 0.841 2.46 Based on the experiments results, it is cobcluded
that RF on average gave the best results for the f-
Table 4. Random Splitting 80:20 score accuracy performance. This results indicated
Classifier
70:30 that RF is more suitable for the characteristics of
Precision Recall F1 Time(s)
DT 0.875 0.868 0.871 0.31
POS transaction data. Therefore, the Random Forest
RF 0.942 0.822 0.877 3.55 model is selected to be implemented in a Business
LR 0.868 0.798 0.831 3.15 Intelligence (BI) application. The model
implementation for BI in the operational
Table 5 Random Splitting 90:10 environment ise developed by using the open-
Classifier
70:30 source ETL Pipeline from Apache called Airflow.
Precision Recall F1 Time(s) Airflow is a ETL Pipeline with a batching process
DT 0.875 0.861 0.867 0.34
RF 0.946 0.818 0.877 4 that applies the Python programming language [27].
LR 0.867 0.794 0.828 3.79 For the purpose of server deployment, a compute
engine from the Google Cloud Platform is used
[28]. Data Warehouse used is Cloud SQL based on

3433
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

MySQL [29]. The first process undertaken is to loaded back into the data warehouse. The
deploy the model into Airflow. Airflow withdrew implementation scheme can be seen in figure 4.
data from datasource and loaded the data into a
model that had been deployed. The model is then

Figure 4. Model Implementation Schema


visualization. The scheme of sorting and combining
At the stage of importing the real data into the attributes in a real dataset is presented as in figure
model, a Feature Selection process is carried out. 5, whereas table 10 presents dataset output results.
Feature selection aims to select only the features
needed in running the model, while attributes that
are not used by the model are still used for

Figure 5. Feature Selection on Deployment Model

3434
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Table 10. Output Data Example a model on the cloud server. A portion of the BI
merchant_i mont
year
day

day31 predictio application interface is presented as figure 6. The x-
d h 1 n
201 100 1 2
axis of the graph in Fig 6 represents the month of
213124 8
9
… the transaction, while the y-axis represents the
201 200 2 1 number of transactions. In each month period, the
192920 9 …
9 BI application presents three types of transactions
201 0 1 0
828293 10
9

0
based on the predicted model implemented, such as:
Real Active User with a blue line, User Stop with a
yellow line and User Testing with a red line.
3.4. Deployment Model Evaluation
Based on the model evaluation at the
experimental stage, the Random Forest algorithm is
selected to be implemented in a BI application with

Figure 6. Merchants Status Graph

After the random forest model has been table 12, the precision is still at 0.95, but the recall
implementing for 2 months, quantitative and is down by around 10% to 0.734.
qualitative evaluations were carried out.
Table 11. Confusion Matrix Result
Quantitative evaluation is conducted by comparing
Confusion Matrix
the results of prediction and labeling performed Label
manually using transaction data in October. This 0 1 2
evaluation is done by taking about 10000 real data 0 6570 342 504
analyzed by the model. The prediction results 32 3237 0
1
provided by the model of six thousand of data were
0 0 216
verified manually. The model performance based 2
on the results of manual verification is presented in
table 11. Quantitative Evaluation applied the Table 12. Quantitive Evaluation Result
Confusion Matrix basis for Precision and Recall
calculations [30], [31]. From the Confusion Matrix Precision Recall F1-Score
in table 5, label 0 is the testing transaction, 1 is the 0.958 0.734 0.831
active transaction, and label 2 is the user stopped
transaction. The model correctly predict 6570 of
To validate the operation performance of the BI
class 0 out of 7416, and those predicted as class 1
system supported by the seletected random forest
and class 2 were 342 and 504, respectively. The
model we performed a qualitative analysis. The
model also correctly predicted 3232 of class 1 and
analysis is carried out by gathering feedback from
32 of class 0. For class 2 the model predicted 100%
users regarding the performance of the implemented
accurately the 216 out of 216. The results of
model. Feedback is obtained by distributing
precision and recall computation is presented aas
questionnaires to BI users based on the

3435
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

implemented model. A summary of the the use of BI in one day is on the average of 63.15
questionnaire results is presented in figure 7. In minutes and in a week, it is used at most for 2 days.
terms of BI user representation, respondents After the model is implemented, the use of BI
represented the data custodian, product manager, increased to 120.4 minutes/day. The frequency of
top management, marketing staff, business staff, the use of BI per week also increased to 5 days
financial, and customer satisfaction divisions. The selected by 8 people, followed by 4 days, 3 days,
parameters evaluation included the intensity of the and 2 days, respectively by 6, 5, and 1 person.
use of BI per day, the frequency of days using the Finally, at the level of accuracy prior to Machine
BI application per week, and the accuracy of the Learning, many selected “rather accurate”, but after
information presented by BI. Most users come from Machine Learning, many selected “Accurate”.
the Data Division and the Product or Development
Division. Before using the Machine Learning based
model for Dashboard Analysis of BI applications,

Figure 7. Questionnaire Result


The research contribution presented in this
3.5. Prior and This Work Analysis paper is the creation of a Machine Learning-based
BI application that helps improve the quality of
GPV bias is the essential issue has to be adressed decision making in organizations. In this research,
in order to the decision-making, especially in terms the issue of junk data/bias in POS transaction is
of investment, can be carried out more precisely. addressed as the issue hindered the decision-
However, the solution to this problem in the making owing to the fact that GPV information
computational point of view as long as our became biased. GPV bias is caused by the presence
knowledge has not been touched at all, even only in of noise in the form of trial transactions. Managing
the experimental level. In this study we not only this issue is done by selecting the appropriate
conducted an experiment to find the best technique
machine learning technique as the core engine of a
to solve this issue, but also implemented the best company's BI application. From the results of the
Random Forest model experimental results in a real model development experiments, it is found that the
operational environment. Based on observations of Random Forest algorithm owned the best
the use of models in the BI system over three performance. This random forest-based algorithm
months periods, the proposed solution is proven model is then implemented in a BI application.
able to assist the better decision making. The Qualitative and quantitative evaluation on the
beneficial of our proposed solution is convinced by implementation and use of BI applications showed
the feedback collected from respondents consisting that the research results provided significant
of various user roles. benefits in improving the quality of decision
4. CONCLUSION making. This is indicated by user feedback pointing
out a positive increase in terms of frequency of use,
intensity of use, and speed of decision making.

3436
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

In the next research, we plan to further explore [8] S. Salloum, J. Z. Huang, and Y. He,
the data generated by POS transactions. Further ‘Exploring and cleaning big data with
prospects of data exploration include: the purpose random sample data blocks’, J. Big Data,
of product/service recommendations, prediction of vol. 6, no. 1, p. 45, 2019.
quantity and quality of transactions at merchants, [9] L. Thurner et al., ‘Pandapower - An Open-
and analysis of merchant behavior at the company Source Python Tool for Convenient
level. Modeling, Analysis, and Optimization of
Electric Power Systems’, IEEE Trans.
ACKNOWLEDGMENTS Power Syst., vol. 33, no. 6, pp. 6510–6521,
The authors would like to thank PT. Solusi 2018.
Teknologi Niaga (Qasir.id) for permitting the use of [10] W. Etaiwi and A. Awajan, ‘The Effects of
data and information on business process Features Selection Methods on Spam
knowledge provided. The writer also appreciates Review Detection Performance’, Proc. -
the willingness of Mrs. Aulia Permata Sari and Mr. 2017 Int. Conf. New Trends Comput. Sci.
Heri Husaeri Achsan to be the reviewer of this ICTCS 2017, vol. 2018-Janua, no. 2, pp.
article. 116–120, 2018.
[11] M. Ott, C. Cardie, and J. T. Hancock,
‘Negative deceptive opinion spam’, NAACL
REFRENCES: HLT 2013 - 2013 Conf. North Am. Chapter
[1] S. Kumar and M. Singh, ‘A novel Assoc. Comput. Linguist. Hum. Lang.
clustering technique for efficient clustering Technol. Proc. Main Conf., no. June, pp.
of big data in Hadoop Ecosystem’, Big 497–501, 2013.
Data Min. Anal., vol. 2, no. 4, pp. 240–247, [12] L. Zheng, G. Liu, C. Yan, and C. Jiang,
2019. ‘Transaction fraud detection based on total
[2] S. Krishnan et al., ‘SampleClean: Fast and order relation and behavior diversity’, IEEE
Reliable Analytics on Dirty Data’, Bull. Trans. Comput. Soc. Syst., vol. 5, no. 3, pp.
IEEE Comput. Soc. Tech. Comm. Data 796–806, 2018.
Eng., pp. 59–75, 2015. [13] M. Sadikin, F. Afiandi, and F. Alfiandi,
[3] S. Juddoo, ‘Overview of data quality ‘Comparative Study of Classification
challenges in the context of Big Data’, 2015 Method on Customer Candidate Data to
Int. Conf. Comput. Commun. Secur. ICCCS Predict its Potential Risk’, Int. J. Electr.
2015, 2016. Comput. Eng., vol. 8, no. 6, 2018.
[4] M. Zhou, Y. Wang, A. K. Srivastava, Y. [14] M. G. D’Urso and D. Masi, ‘Multi-Criteria
Wu, and P. Banerjee, ‘Ensemble-Based Decision-Making Methods and Their
Algorithm for Synchrophasor Data Applications for Human Resources’, in
Anomaly Detection’, IEEE Trans. Smart ISPRS - International Archives of the
Grid, vol. 10, no. 3, pp. 2979–2988, 2019. Photogrammetry, Remote Sensing and
[5] K. H. Prasad, T. A. Faruquie, S. Joshi, S. Spatial Information Sciences, 2015, vol.
Chaturvedi, L. V. Subramaniam, and M. XL-6/W1, no. June, pp. 31–37.
Mohania, ‘Data cleansing techniques for [15] N. Quadrianto and Z. Ghahramani, ‘A very
large enterprise datasets’, Proc. - 2011 simple safe-Bayesian random forest’, IEEE
Annu. SRII Glob. Conf. SRII 2011, pp. 135– Trans. Pattern Anal. Mach. Intell., vol. 37,
144, 2011. no. 6, pp. 1297–1303, 2015.
[6] L. Jayasinghe, N. Wijerathne, C. Yuen, and [16] I. Ahmad, M. Basheri, M. J. Iqbal, and A.
M. Zhang, ‘Feature Learning and Analysis Rahim, ‘Performance Comparison of
for Cleanliness Classification in Support Vector Machine, Random Forest,
Restrooms’, IEEE Access, vol. 7, pp. and Extreme Learning Machine for
14871–14882, 2019. Intrusion Detection’, IEEE Access, vol. 6,
[7] S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, pp. 33789–33795, 2018.
and C. Jiang, ‘Random forest for credit card [17] A. Criminisi, J. Shotton, and E. Konukoglu,
fraud detection’, ICNSC 2018 - 15th IEEE ‘Decision forests: A unified framework for
Int. Conf. Networking, Sens. Control, pp. 1– classification, regression, density
6, 2018. estimation, manifold learning and semi-
supervised learning’, Found. Trends

3437
Journal of Theoretical and Applied Information Technology
31st July 2021. Vol.99. No 14
© 2021 Little Lion Scientific

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Comput. Graph. Vis., vol. 7, no. 2–3, pp. [29] ‘Cloud SQL for MySQL documentation’.
81–227, 2011. [Online]. Available:
[18] M. L. Krichevsky, J. Martunova, and V. https://cloud.google.com/sql/docs/mysql/fe
Sirotkin, ‘Neuro-fuzzy recruitment system’, atures. [Accessed: 14-Oct-2019].
Espacios, vol. 38, no. 62, p. 15, 2017. [30] A. Tharwat, ‘Classification assessment
[19] D. Ramayanti and U. Salamah, ‘Text methods’, Applied Computing and
Classification on Dataset of Marine and Informatics, 2018.
Fisheries Sciences Domain using Random [31] B. H. Shekar and G. Dagnew, ‘A Multi-
Forest Classifier’, Int. J. Comput. Tech., Classifier Approach on L1-Regulated
vol. 5, no. 5, pp. 1–7, 2018. Features of Microarray Cancer Data’, 2018
[20] H. Khurshid and M. F. Khan, Int. Conf. Adv. Comput. Commun.
‘Segmentation and classification using Informatics, ICACCI 2018, pp. 1515–1522,
logistic regression in remote sensing 2018.
imagery’, IEEE J. Sel. Top. Appl. Earth
Obs. Remote Sens., vol. 8, no. 1, pp. 224–
232, 2015.
[21] H. Liu, X. Li, and S. Zhang, ‘Learning
Instance Correlation Functions for
Multilabel Classification’, IEEE Trans.
Cybern., vol. 47, no. 2, pp. 499–510, 2017.
[22] B. Vinzamuri, Y. Li, and C. K. Reddy,
‘Pre-processing censored survival data
using inverse covariance matrix based
calibration’, IEEE Trans. Knowl. Data
Eng., vol. 29, no. 10, pp. 2111–2124, 2017.
[23] J. L. García-Balboa, M. V. Alba-Fernández,
F. J. Ariza-López, and J. Rodríguez-Avi,
‘Homogeneity test for confusion matrices:
A method and an example’, Int. Geosci.
Remote Sens. Symp., vol. 2018-July, pp.
1203–1205, 2018.
[24] M. Ohsaki, P. Wang, K. Matsuda, S.
Katagiri, H. Watanabe, and A. Ralescu,
‘Confusion-matrix-based kernel logistic
regression for imbalanced data
classification’, IEEE Trans. Knowl. Data
Eng., vol. 29, no. 9, pp. 1806–1819, 2017.
[25] A. Aksjonov, P. Nedoma, V. Vodovozov,
E. Petlenkov, and M. Herrmann, ‘Detection
and Evaluation of Driver Distraction Using
Machine Learning and Fuzzy Logic’, IEEE
Trans. Intell. Transp. Syst., vol. 20, no. 6,
pp. 2048–2059, 2019.
[26] Z. C. Lipton, C. Elkan, and B.
Naryanaswamy, ‘Thresholding Classifiers
to Maximize F1 Score’.
[27] Sally, ‘the Apache Software Foundation
Announces Apache® AirflowTM as a Top-
Level Project’, 2019. .
[28] G. Cloud, ‘Compute Engine’, 2018.
[Online]. Available:
https://cloud.google.com/compute/.
[Accessed: 09-Oct-2018].

3438

You might also like