Software Defect Prediction Using ML
Software Defect Prediction Using ML
Abstract—Software defect prediction provides development Relationship among data attributes. An example could be to
groups with observable outcomes while contributing to industrial define a group of friends on a website for social networking.
results and development faults predicting defective code areas Software quality can be enhanced by predicting defect
can help developers identify bugs and organize their test modules. Defect prediction is the method of designing models
activities. The percentage of classification providing the proper that are utilized in the initial stages of the process to detect
prediction is essential for early identification. Moreover, defective systems such as units or classes. This can be
software-defected data sets are supported and at least partially achieved by classifying the modules as defect prone or not.
recognized due to their enormous dimension. This Problem is Different methods are used to identify the classification
handled by hybridized approach that includes the PCA,
module, the most common of which is support vector
randomforest, naïve bayes and the SVM Software Framework,
classifier (SVC), random forest, naive bayes, decision trees
which as five datasets as PC3, MW1, KC1, PC4, and CM1, are
listed in software analysis using the weka simulation tool. A
(DT), neural networks (NN). The detected defect prone
systematic research analysis is conducted in which parameters of modules are given high priority in progress testing phases and
confusion, precision, recall, recognition accuracy, etcAre the non-defect prone modules are examined as time and cost
measured as well as compared with the prevailing schemes. The permits. The feature of classification, known as the
analytical analysis indicates that the proposed approach will relationship between the attributes and the training dataset
provide more useful solutions for device defects prediction. class label is established on the classifier method and
examined through the formulae for the categorization of the
Keywords—Defect prediction softwares; machine learning targets. Those rules are also needed to define future dataset
methods; metric softwares; prediction defect model; quality class labels. Thus, the unknown datasets can be categorized
software; using the classification patterns and a classifier. Defining
software defects, finding the defect and identifying it is a
I. INTRODUCTION repetitive work for researchers due to the massive deployment
In the past decade, humans have progressively focused on of software. The main goal of categorizing the software
software-based systems in which software quality is regarded dataset as a model for bug prediction into a defective and non-
as the most critical element in user functionality.Because of defective dataset. The input software dataset is given to the
the vast production of application software, software quality classifier according to this method where the user knows the
remains an unresolved problem that gives inadequate output actual class values. Requirement-based and design-based
for industrial and private applications.Designs of defect metric methods demonstrated considerable results before this
prediction are commonly utilized by industries and Such scheme. But the design of algorithms and the accuracy of
models help in predicting faults, estimating effort testing predictions remain a problem able task.
software reliability, hazard analysis, etc. during the growth
RELATED WORK
stage. A supervised machine learning predictive algorithm is
consumed with the predefined collection of training data. The Machine learning is a powerful methodology for
algorithm then gains expertise from the training dataset and prediction, software defect prediction model proposed by
produces rules for predicting the class label for a new data set. Wang et al. [3] for increasing the quantity of application
Learning phases consists to use mathematical algorithms to software systems. Databases of defective software comprise of
generate and strengthen the predictor function. Training data unbalanced data which produces random patterns. This
used in this process has an attribute input value and its defined problem encourages the creation of an effective and reliable
output value. The expected ML algorithm quality is compared classifier of situations for academic and industrial
with the often known output. This is repeated in many applications. Xu et al. [4] researched “software defect
iterations of training data until the optimal prediction accuracy prediction strategies and hypothesized that traditional
is reached or the upper limit number of loops is finished. In techniques use vectorization and feature selection” framework
the field of unsupervised learning algorithms, the class label to minimize trivial features, but still exclude other essential
output value is not known in data. Alternatively, a cluster of features resulting in degraded performance of defect
data loads the software, and the algorithm identifies a pattern prediction strategy. A piece of maximum information, data
and relationships within it. The main emphasis is on the
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
correlation-based technique is proposed to tackle this problem. of their defect prediction system by evaluating the conditions
Recently, Duksan et al.[5] discussed the unbalanced nature of that strongly impact the predictive results of the software
software defect results, and very few occurrences display defect classificatory. They noticed that classifier choice affects
attributes that belong to the defective class during the output only marginally, while model building factors (i.e.
prediction process. This phase creates a reduction of efficiency Factors specific to the study group) produce a major impact.
in software industries and therefore involves a specific This is since the group of researchers is in charge of pre-
classification scheme. To resolve this problem, the whole processing the data”. Jayanthi et al.,[1] “established a
issue is transformed into an issue of multi-objective Selection of features for apps scheme the unequal dataset of
optimization, where a Multi goal system of learning is software defects. Next, the selection with attributes built on
implemented by analyzing a varied cross-project environment. the wrapper is implemented, culminating in the collection of
Shan et al.[6] “utilized a well-known methodology in machine subsets of attributes. In the next process, random sampling is
learning, i.e. SVM (support vector machine). Besides, implemented to help reduce the negative impact of the
predictability in attributes is discussed through the diligence of unbalanced dataset”.
a locally linear embedding strategy with a support vector
classifier. SVM constraints are indeed configured with a II. RESEARCH METHODOLOGY
tenfold cross-validation process and grid search scheme The collection has been made to include the most common
according to this approach”. Experimental analysis reveals and often used machine learning methods. The following
that the LLE-SVM works well for detecting defects. Yang et techniques are listed combined with their textual explanations.
al.[7] “proposed the Predicting Software Deficiencies using a
neural network method in which the neural network concept is A. Naïve Bayes
incorporated along with the Bayesian approach as a radial Naive Bayes is an applied training that is used for
basis. The efficiency of the radial neural network can be statistical technique knowledge grouping. When the name
enhanced by optimizing the weight update framework, using a implies Naive this approach casually claims the attributes of a
single Gaussian and two Gaussian structures, while the specific class are autonomous. Features assume powerful or
motivation-minimization scheme is often employed for weight naive isolation. It functions as a template applied to
realization”. Han et al., [8] “as stated in the proposed software
problematic objects as class labels, allocated as a vector used
development based stable program quality estimation model.
Our approach involves an advanced software reliability to draw class descriptors from finite sets. Naive bays are
template, a system building forecast model, a Rayleigh model, categorized into the complex situation of real-world problems
and a computer-assisted software safety estimate to boost given their simplistic nature and assumptions.
predictive results”. Parthipan et al.[9] “have presented an B. Random Forest
analytical model describing the signs of design uncertainty
The purpose of the algorithm is to create a framework
using an aspect-oriented approach for measuring uncertainty.
Noticed that defect prediction models are mainly developed in capable of predicting value function based on different input
factors. Each internal node reflects one of the input
the design phase and code level either to differentiate between
unreliable and non-faulty (binary classification) or to estimate parameters. For each potential value of these factors, there are
boundaries for their offspring. Leaf in the tree describes the
the number of defects (regression analysis)”. Panichella et
al.[10] “enhanced the recognition of defect-prone instances in goal factor in which the specified parameters of the input
software projects through a unified predictor of defects that factor can be crossed by way of the root to the leaf. The
brings into consideration the clusters provided by different learning strategy uses the random forest as a statistical model
approaches of machine learning”. Felix et al.,[2] have where the target quality interpretation of the item is mapped
“proposed a study for the prediction of software defects using with an analysis of the object. This is a predictive technique
a machine learning method focused on the neural network. used in mining, statistics, machine learning.
Github databases are regarded in this work for the study of C. SVC
defect prediction. A NN is applied with the aid of registry The learning method allows for interpreting the
relationships between software codes and their faults, and to information utilized in the categorization and regression
obtain classification and prediction. In classification and analysis. An SVM model is defined as test samples in the
prediction strategy based on machine learning, feature section range that are distributed and in that way that they are divided
and reduction will increase performance. By referring to that by a gap far as feasible based on the divisions to which they
as a significant aspect”. Lu et al. [12] “used a version of the belong. The classification of new samples is calculated per the
algorithm for self-study, to examine the implementation of a side of the gap to which they fell after mapping into a certain
semi-supervised learning technique for software defect area.
prediction. The research concluded that trust fitting could be
used as a replacement for existing supervised algorithms. In An SVM constructs a collection of hyperlens in a non-
conjunction with dimensional reduction, the semi-supervised dimensional space used in correlation, classification. A
algorithm behaved significantly better than a random forest hyperplane with the maximum distance to the set of points in a
model when training modules with typical defects were used”. particular class termed functional margin. Ultimately, the
Shepperd et al.[13] “carried out a meta-study of all the factors functional margin was in inverse proportion to the error in
affecting output in predictions. As calculated based on the generalization.
Matthews correlation coefficient, they checked the efficiency
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
Fig 1. Machine learning techniques Classification Fig 2. Software Defect Prediction Model
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
Data retrieval (referred to as AT in this survey) and numeric PC4, MW1, and CM1. The algorithms chosen for analysis are
defect level KC1 classes were utilized. Naive Bayes, Random Forest, SVC (Linear Regression),
Throughout this study, addressed Kaggle PROMISE Neural Network.
database datasets which are called KC1, CM1, PC3, MW1 and The data sets are collected in arff form from the Kaggle
PC4 where different attributes are present in the given dataset. database, assisted by the r studio tool. Therefore the data
The tables show various metrics about the assumed dataset, processing section mast was run to render data sets consistent
showing a attributes count , usable components, faulty with it. The outcome test summary is shown in the table
components, and defective percentage. below. It shows the accuracy value of each method
Through these datasets, a software deficiency prediction (percentage accordance). The maximum heuristic in a dataset
Dataset Precision Recall F1- Support Acc is labeled prominent to imply that amongst others.
measure TABLE I. PERFORMANCE EVALUATION FOR SVC
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
The table above describes the standard deviation of Dataset Precision Recall F1- Support Acc
measure
that same test from the predicted expected defect. The lowest
failure rate is the neural network method. The rate of failure
KC1 T-0.99 F-1.00 F-1.00 F-117 0.96
would help to overcome the tie. The lesser error, the greater
the accuracy in the scenario of a tie between two algorithms F-1.00 T-0.88 T-0.93 T-8
between terms of prediction of defects.
CM1 T-0.98 N-0.96 N-0.97 N-104 0.97
F-0.86 Y-0.93 Y-0.89 Y-27
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
[3] Wang, T., Zhang, Z., Jing, X., Zhang, L.: Multiple kernel ensemble
learning for software defect prediction. Autom. Softw. Eng. 23, 569–590
(2015).
[4] Xu, Z., Xuan, J., Liu, J., Cui, X.: MICHAC: defect prediction via feature
selection based on maximal information coefficient with hierarchical
agglomerative clustering. In: 2016 IEEE 23rd International Conference
on Software Analysis, Evolution, and Reengineering (SANER), Suita,
pp. 370–381 (2016).
[5] Ryu, D., Baik, J.: Effective multi-objective naïve Bayes learning for
cross-project defect prediction. Appl. Soft Comput. 49, 1062 (2016).
[6] Shan C., Chen B., Hu C., Xue J., Li N.: Software defect prediction
model based on LLE and SVM. In: Proceedings of the Communications
Security Conference (CSC ’14), pp. 1–5 (2014).
[7] Yang, Z.R.: A novel radial basis function neural network for
discriminant analysis. IEEE Trans. Neural Netw. 17(3), 604–612(2006).
[8] K. Han, J.-H. Cao, S.-H. Chen, and W.-W. Liu, “A software reliability
prediction method based on software development process,” in Quality,
Fig 6.Overall Algorithm classification on Dataset. Reliability, Risk, Maintenance, and Safety Engineering (QR2MSE),
2013 International Conference on. IEEE, 2013, pp. 280–283.
The results indicate that perhaps the neural network [9] S. Parthipan, S. Senthil Velan, and C. Babu, “Design level metrics to
has the least failure rate in the study preceded by random measure the complexity across versions of ao software,” in Advanced
Communication Control and Computing Technologies (ICACCCT),
forest. The greatest detection rate, however, is the dimensional 2014 International Conference on. IEEE, 2014, pp. 1708–1714.
classification. In case of a prediction of tie accuracy, the [10] A. Panichella, R. Oliveto, and A. De Lucia, “Cross-project defect
failure rate parameter can be used to determine the correct prediction models: L’union fait la force,” in Software Maintenance,
outcome. [11] Bautista, A.M., Feliu, T.S.: Defect prediction in software repositories
with artificial neural networks. In: Mejia, J., Munoz,M., Rocha,Á.,
REFERENCES Calvo-Manzano, J. (eds.) Trends and Applications in Software
Engineering.Advances in Intelligent Systems and Computing, vol.405.
[1] Jayanthi, R. and Florence, L., 2019. Software defect prediction Springer, Cham (2016).
techniques using metrics based on neural network classifiers. Cluster
Computing, 22(1), pp.77-88. [12] H. Lu, B. Cukic, and M. Culp, “Software defect prediction using
semisupervised learning with dimension reduction,” in Automated
[2] Felix, E.A. and Lee, S.P., 2017. Integrated approach to software defect Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM
prediction. IEEE Access, 5, pp.21524-21547. International Conference on. IEEE, 2012, pp. 314–317.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 23,2020 at 04:36:29 UTC from IEEE Xplore. Restrictions apply.