4.1. Experimental Results on Scenario A
In this section, experimental results based on Scenario A as defined in
Section 3.5 are presented and discussed. Scenario A is based on assessing and comparing the prediction performances of NB and DT models based on proposed AREMFFS and baseline FS (CS, IG, REF, and NoFS) methods.
Figure 2,
Figure 3 and
Figure 4 display box-plot representations of the accuracy, AUC, and f-measure values of NB and DT classifiers with the NoFS method, the baseline FFS (IG, REF, CS) method, and the proposed AREMFFS method. Specifically, the accuracy values of NB and DT classifiers compared with the FS (IG, CS, REF, and AREMFFS) and NoFS methods are presented in
Figure 2. The results indicate that NB and DT had good accuracy values on the software defect dataset. Nonetheless, the increased deployment of baseline FFS methods (IG, CS, and REF) further enhanced the accuracy values of NB and DT classifiers. This can be seen in their respective average accuracy values as depicted in
Figure 2. Particularly, NB and DT classifiers with the NoFS method had average accuracy values of 76% and 80.89%, respectively. Concerning baseline FFS methods, CS with NB and DT classifiers recorded average accuracy values of 78.93% and 81.97%, which indicated increments of +3.85% and +1.34%, respectively. Identical occurrences were realized in models with IG (NB: 78.51%, DT: 81.9%) and REF (NB: 78.99%, DT: 81.17%) FS methods, with increments of average accuracy values (IG: (+3.32%, +1.25%) and DT: (+3.93%, +0.3%)), respectively. However, models based on AREMFFS with NB and DT classifiers had superior average accuracy values over the NB and DT models with baseline FFS (NoFS, CS, IG, REF) methods. As presented in
Table 4, concerning models based on the NB classifier, AREMFFS had increments of +7.46%, +3.47%, +4.02%, and +3.39% in the average accuracy values over models based on the NoFS, CS, IG, and REF methods, respectively. Also, concerning models based on the DT classifier as shown in
Table 5, AREMFFS had increments of +3%, +1.63%, +1.72%, and +2.64% in average accuracy values over models based on the NoFS, CS, IG, and REF methods, respectively. These results showed that, concerning accuracy values, models based on AREMFFS outperformed models based on baseline FSS (CS, IG, REF) methods. That is, AREMFFS had a superior positive impact on the prediction accuracy values of NB and DT models over the CS, IG, and REF FS methods.
In terms of AUC values,
Figure 3 displays box-plot representations of models based on NB and DT classifiers with baseline FFS and proposed AREMFFS methods. Similar to observations on accuracy values, NB and DT models based on baseline FFS (CS, IG, and REF) had superior AUC values when compared with NB and DT models with the NoFS method. Specifically, CS had increments of +4.4% and +2.55% in AUC values for models based on NB (0.762) and DT (0.682) over the NoFS (NB: 0.73, DT: 0.665) method. Correspondingly, NB and DT models based on IG had increments of +4.25% and +1.65% in average AUC values. Also, models based on REF recorded increments of +3.15% and +0.6% in average AUC values for NB and DT classifiers, respectively, when compared with models with the NoFS method. Nonetheless, similar to the observations on accuracy values, models based on AREMFFS had superior average AUC values over models with baseline FFS (NoFS, CS, IG, REF) methods. As shown in
Table 4, concerning models based on the NB classifier, AREMFFS had increments of +7.4%, +2.88%, +3.02%, and +4.12% in average AUC values over models based on the NoFS, CS, IG, and REF methods, respectively. A similar case is observed with models based on the DT classifier as shown in
Table 5. AREMFFS had increments of +8.72%, +6.01%, +6.95%, and +8.07% in average AUC values over models based on the NoFS, CS, IG, and REF methods, respectively.
Also,
Figure 4 presents the f-measure values for NB and DT models based on the baseline FFS and proposed AREMFFS methods. Models based on the NB classifier with the CS (0.779), IG (0.778), and REF (0.776) methods recorded average f-measure value increases of +3.04%, +2.91%, and +2.64%, respectively, over the NB model with the NoFS method. As for the DT models, IG and CS recorded increments of +1.38% and +1.13% in average f-measure values, but DT models with the REF method performed poorly with a −0.5% decrease of the f-measure value. Summarily, it can be observed that models based on FS methods had better prediction performance than models with the NoFS method. However, models based on AREMFFS with NB and DT classifiers recorded better average f-measure values over NB and DT models with baseline FFS (CS, IG, REF) methods. From
Table 4, the NB model based on AREMFFS had increments of +5.42%, +2.31%, +2.44%, and +2.71% in average f-measure values over models based on the NoFS, CS, IG, and REF methods, respectively. Also, the DT model with AREMFFS had increments of +3.51%, +2.1%, +2.36%, and +4.04% in average f-measure values over models based on the NoFS, CS, IG, and REF methods, respectively. These results further indicate the superiority of models (NB and DT) based on AREMFFS over models based on baseline FSS (NoFS, CS, IG, REF) methods.
Summarily, experimental results, as displayed in
Figure 2,
Figure 3 and
Figure 4, showed that the deployment of FS methods in SDP further enhances the prediction performances of SDP models. This finding is supported by observations in existing studies where FS methods are applied in SDP [
19,
20,
21,
22]. Nonetheless, it was also observed that the effect of FS methods varies and depends on the classifiers selected in this study. Also, there are no clear-cut differences in the performances of each of the FFS (CS, IG, and REF) methods, even though the selected FFS methods have different underlying computational characteristics. Thus, the selection of an appropriate FFS method to be used in SDP processes becomes a problem that can be termed a filter rank selection problem. This observation from the experimental results strengthens the aim of our study, which proposes a rank aggregation-based multi-filter FS method for SDP. As shown in
Figure 2,
Figure 3 and
Figure 4, the proposed AREMFFS method not only had a superior positive impact on NB and DT models, but also had a more positive impact than the individual CS, IG, and REF FS methods. Particularly,
Table 4 and
Table 5 present the prediction performances (average accuracy, average AUC, and average f-measure) of NB and DT models with the proposed AREMFFS methods and the experimented baseline FFS (NoFS, CS, IG, REF) methods, respectively.
Figure 5,
Figure 6 and
Figure 7 present statistical rank tests of the models (NB and DT) tested based on accuracy, AUC, and f-measure values, respectively. Specifically, the Scott–KnottESD statistical rank test, a mean comparison approach that uses hierarchical clustering to separate mean values into statistically distinct clusters with non-negligible mean differences, was conducted [
62,
80] to show significant statistical differences in the mean values of methods and results used. As depicted in
Figure 5,
Figure 6 and
Figure 7, models with different colours show that there are statistically significant differences amongst their values; hence, they are grouped into a different category. Similarly, models with the same colour indicate that there are no statistically significant differences in their values.
As presented in
Figure 5, there are statistically significant differences in the average accuracy values of NB and DT models with the proposed AREMFFS method when compared with other FS methods. In particular, for NB models, AREMFFS ranks highest (first), followed by REF, CS, and IG, which are in the same category, while the NoFS method ranks last. In the case of DT models, AREMFFS still ranks highest followed by other FS methods (CS, IG, REF, and NoFS). It should be noted that the arrangements of models from the statistical rank test are vital as models that appear first (from left to right) are superior to the other models, irrespective of their category. This observation indicates that models based on AREMFFS have superior accuracy values over models based on CS, IG, REF, and NoFS methods. Also, similar observations were recorded from statistical rank tests based on AUC values.
Figure 6 presents the Scott–KnottESD statistical rank tests based on AUC values, and there too models based on AREMFFS rank highest. In terms of NB models, AREMFFS ranks highest followed by CS, IG, and REF, which are in the same category, while the NoFS method ranks last. As for DT models, AREMFFS still ranks highest, followed by CS, IG, REF, and NoFS FS methods. Lastly, statistical rank tests based on f-measure values, as shown in
Figure 7, followed the same pattern as that of accuracy and AUC values, with models based on AREMFFS being statistically superior to the other experimented baseline FS methods. A summary of the Scott–KnottESD statistical rank tests of the proposed AREMFFS and baseline FS methods with NB and DT classifiers is presented in
Table 6.
In summary, from the experimental and statistical test results, the proposed AREMFFS method recorded a superior positive impact on the prediction performances of SDP models (NB and DT) in comparison with individual FSS (CS, IG, REF, and NoFS) methods on the defect datasets that were studied.
4.2. Experimental Results on Scenario B
This section presents and discusses experimental results based on Scenario B (see
Section 3.5). Scenario B is defined by evaluating and comparing the prediction performances of NB and DT models based on the proposed AREMFFS method and the existing (Min, Max, Mean, Range, GMean, HMean) rank aggregation-based multi-filter FS methods.
Figure 8,
Figure 9 and
Figure 10 show box-plot representations of the accuracy, AUC, and f-measure values of NB and DT classifiers with proposed AREMFFS and existing rank aggregation-based multi-filter FS methods. In particular,
Figure 8 presents the accuracy values of NB and DT models with AREMFFS and existing rank aggregation-based multi-filter FS methods. It can be observed that models based on AREMFFS (NB: 81.67,% DT: 83.31%) had superior average accuracy values compared to existing Min (NB: 79.72%, DT: 82.60%), Max (NB: 79.88%, DT: 82.34%), Mean (NB: 79.36%, DT: 82.53%), Range (NB: 77.99%, DT: 81.87%), GMean (NB: 79.48%, DT: 82.70%), and HMean (NB: 79.66%, DT: 82.61%) rank aggregation-based multi-filter FS methods. Specifically, based on NB models, AREMFFS had increments of +2.91%, +2.45%, +2.24%, +4.72%, +2.76%, and +2.52% in average accuracy value over the existing Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods. Likewise for DT models, AREMFFS had increments of +0.95%, +0.86%, +1.18%, +1.76%, +0.74%, and +0.85% in average accuracy value over the existing Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods. As observed, the experimental results indicate that models based on AREMFFS outperformed models based on existing rank aggregation-based multi-filter FS methods on accuracy values. In other words, AREMFFS had a superior positive impact on the prediction accuracy values of NB and DT models over the Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods.
Concerning AUC values,
Figure 9 presents box-plot representations of models based on NB and DT classifiers with the proposed AREMFFS and existing rank aggregation based multi-filter FS methods. Similar to observations on accuracy values, models based on AREMFFS (NB: 0.784, DT: 0.723) had superior average AUC values over models with existing Min (NB: 0.769 DT: 0.697), Max (NB: 0.770, DT: 0.688), Mean (NB: 0.767, DT: 0.687), Range (NB: 0.748, DT: 0.677), GMean (NB: 0.768, DT: 0.694), and HMean (NB: 0.769, DT: 0.696) rank aggregation-based multi-filter FS methods. As presented in
Table 7, concerning models based on NB classifier, AREMFFS had increments of +2.22%, +1.95%, +1.82%, +4.81, +2.08%, and +1.95% in average AUC values over models based on Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods, respectively. A similar case is observed with models based on the DT classifier as depicted in
Table 8. AREMFFS had increments of +5.24%, +3.73%, +5.09%, +6.79%, +4.18%, and +3.88% in average AUC values over models based on Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods, respectively. As observed, the experimental results indicated that models based on AREMFFS outperformed models based on existing rank aggregation-based multi-filter FS methods on accuracy values. In other words, AREMFFS had a superior positive impact on the prediction accuracy values of NB and DT models over Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods.
Furthermore, concerning f-measure values,
Figure 10 presents box-plot representations of models with NB and DT classifiers with proposed AREMFFS and existing rank aggregation based multi-filter FS methods. Models based on AREMFFS (NB: 0.797, DT: 0.825) had superior average f-measure values over models with existing Min (NB: 0.777, DT: 0.815), Max (NB: 0.781, DT: 0.814), Mean (NB: 0.775, DT: 0.813), Range (NB: 0.789, DT: 0.802), GMean (NB: 0.775, DT: 0.814), and HMean (NB: 0.776, DT: 0.815) rank aggregation-based multi-filter FS methods. Specifically, based on the NB classifier, AREMFFS had increments of +2.84%, +2.57%, +2.05%, +5.00, +2.84%, and +2.71% in average AUC values over models based on Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods, respectively. Also, based on the DT classifier, AREMFFS had increments of +1.48%, +1.23%, +1.35%, +2.87%, +1.35%, and +1.23% in average AUC values over models based on Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods, respectively. As observed, the experimental results indicated that models based on AREMFFS outperformed models based on existing rank aggregation-based multi-filter FS methods on f-measure values. In other words, AREMFFS had a superior positive impact on the f-measure values of NB and DT models over Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods.
In summary, the findings from the experimental results, as shown in
Figure 8,
Figure 9 and
Figure 10, indicate the superiority of the proposed AREMFFS over existing rank aggregation-based multi-filter FS methods. That is, NB and DT models based on AREMFFS outperformed NB and DT models based on existing Mean, Min, Max, Range, GMean, and HMean rank aggregation-based multi-filter FS methods. The superior performance of the proposed AREMFFS can be attributed to a combination of the robust strategy it deploys for aggregating multiple rank lists based on majority voting, and its backtracking ability which further removes irrelevant features from the generated optimal feature list.
For further analyses, the performance of the proposed AREMFFS and the existing experimented rank aggregation-based multi-filter FS methods were subjected to Scott–KnottESD statistical rank tests to determine the statistically significant differences in their respective performances.
Figure 11,
Figure 12 and
Figure 13 present statistical rank tests of the proposed AREMFFS and existing rank aggregation-based multi-filter FS methods on NB and DT classifiers based on accuracy, AUC, and f-measure values, respectively. As depicted in
Figure 11A, it can be observed that there are statistically significant differences in the average accuracy values of NB models using the proposed AREMFFS when compared with NB models based on existing rank aggregation-based multi-filter FS methods.
Specifically, AREMFFS ranks highest (first), followed by Max, Min, HMean, GMean, and Mean, which are in the same category, while the Range aggregation method ranks last. In the case of DT models as presented in
Figure 11B, although there is no significant statistical difference in the prediction accuracy values, AREMFFS still ranks highest, followed by the GMean, HMean, Min, Mean, Max, and Range aggregation methods. In this case, the order of superiority/arrangements (left to right) of the models from the statistical test was considered and AREMFFS appeared first. These findings indicate that models based on AREMFFS have superior accuracy values over models based on Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods. Also, similar observations were recorded from statistical rank tests based on AUC values.
Figure 12 presents the Scott–KnottESD statistical rank tests based on AUC values, and models based on AREMFFS rank highest. In terms of NB models, AREMFFS ranks highest followed by Max, Min, HMean, GMean, and Mean, which are in the same category, while the Range rank aggregation method ranks last. A similar situation can be observed in the case of DT models as AREMFFS ranks highest followed by the Min, HMean, GMean, Max, Mean, and Range rank aggregation methods. In addition, statistical rank tests based on f-measure values, as presented in
Figure 13, indicated similar findings to those of the accuracy and AUC values with Min, Max, Mean, Range, GMean, and HMean rank aggregation-based multi-filter FS methods.
Table 9 summarizes and presents the Scott–KnottESD statistical rank tests of proposed AREMFFS and existing rank aggregation-based multi-filter FS methods with NB and DT classifiers.
In summary, based on the experimental and statistical test results, the proposed AREMFFS method recorded a superior positive impact on the prediction performances of SDP models (NB and DT) over existing rank aggregation-based multi-filter FS (Min, Max, Mean, Range, GMean, HMean) methods on the defect datasets studied.
4.3. Experimental Results on Scenario C
In this section, experimental results based on Scenario C (See
Section 3.5) are presented and discussed. Scenario C is based on assessing and comparing the prediction performances of NB and DT models based on the proposed AREMFFS and its variant (REMFFS: rank aggregation-based ensemble multi-filter feature selection) as proposed in this study. The REMFFS method is based on the same working principle as AREMFFS but without the backtracking function included. The results of this analysis will allow for a fair comparison between the two and empirically validate the effectiveness of the proposed AREMFF method.
Figure 14,
Figure 15 and
Figure 16 present box-plot representations of the accuracy, AUC, and f-measure values of NB and DT classifiers with the proposed AREMFFS and REMFFS methods. Correspondingly,
Figure 14 presents the accuracy values of NB and DT models with the AREMFFS and REMFFS methods. It can be observed that models based on AREMFFS (NB: 81.67%, DT: 83.31%) had superior average accuracy values when compared with the REMFFS (NB: 80.62%, DT: 82.75%) methods. In particular, based on NB and DT models, AREMFFS had increments of +1.3% and +0.67% in average accuracy values, respectively, over the REMFFS method. As observed, the experimental results indicated that models based on AREMFFS outperformed models based on REMFFS on accuracy values. That is, AREMFFS had a superior positive impact on the prediction accuracy values of NB and DT models over the REMFFS method.
Regarding AUC values,
Figure 15 shows box-plot representations of models based on NB and DT classifiers with the proposed AREMFFS and REMFFS methods. As presented in
Table 10 and
Table 11, models based on AREMFFS (NB: 0.784, DT: 0.723) had superior average AUC values over models with REMFFS (NB: 0.771, DT: 0.699). Specifically, NB and DT models with AREMFFS had increments of +1.69% and +3.43% in average AUC values, respectively, over models based on the REMFFS method. As observed, the experimental results indicated that models based on AREMFFS outperformed models based on REMFFS. In other words, AREMFFS had a superior positive impact on the AUC values of NB and DT models over the REMFFS method. Also, in terms of f-measure values,
Figure 16 presents box-plot representations of models with NB and DT classifiers with proposed AREMFFS and REMFFS methods. Models based on AREMFFS (NB: 0.797, DT: 0.825) had superior average f-measure values over models with the REMFFS (NB: 0.778, DT: 0.813) method. In particular, NB and DT models with AREMFFS had increments of +2.44% and +1.48% in average f-measure values, respectively, over models based on the REMFFS method. Similarly, the experimental results showed the superiority of models based on the proposed AREMFFS as it outperformed models based on REMFFS on f-measure values. That is, AREMFFS had a superior positive impact on the f-measure values of NB and DT models over the REMFFS method.
Based on the preceding experimental results as presented in
Figure 14,
Figure 15 and
Figure 16, the superiority of the proposed AREMFFS over REMFFS can be observed. Correspondingly, the observed superior performance of the proposed AREMFFS over REMFFS can be attributed to its backtracking ability to further remove irrelevant features from the generated optimal feature list. As established, the removal of irrelevant features will further improve the performance of the proposed AREMFFS method. However, it should be noted that REMFFS, which is a variant of AREMFFS, generated a good and competitive prediction performance.
Figure 17,
Figure 18 and
Figure 19 further analysed the performance of AREMFFS and REMFFS methods statistically. In other words, the Scott–KnottESD statistical rank test was used to determine the statistically significant differences in their respective performances based on accuracy, AUC, and f-measure values, respectively.
As presented in
Figure 17, it can be observed that there are no statistically significant differences in the average accuracy values of NB and DT models with the proposed AREMFFS and REMFFS methods. Nonetheless, AREMFFS still ranks higher than the REMFFS method when the order of superiority/arrangements (left to right) of the models from the statistical test were considered; that is, AREMFFS appeared first.
From
Figure 19, similar observations were recorded from statistical rank tests based on f-measure values, as there is no statistically significant difference in the average f-measure values of models based on the two methods (AREMFFS and REMFFS), although AREMFFS is better. The case is slightly different for Scott–KnottESD statistical rank tests based on AUC values as shown in
Figure 18. In terms of NB models (
Figure 18A), AREMFFS and REMFFS fall into the same grouping. That is, the difference between their respective AUC values is statistically insignificant. However, for DT models (
Figure 18B), AREMFFS is statistically superior to REMFFS, as there is a significant difference in their AUC values. In addition, a summary of the statistical test analyses on the performance of AREMFFS and REMFFS methods is presented in
Table 12. These findings indicate that models based on AREMFFS have superior performance over models based on the REMFFS method.
Based on the experimental and statistical test results, the proposed AREMFFS method recorded a superior positive impact on the prediction performances of SDP models (NB and DT) over its variant (REMFFS method) for the defect datasets that were studied.
In summary, from the experimental results and statistical test analyses, the proposed AREMFFS method had a superior positive effect on the prediction performances of SDP models (NB and DT) compared to the individual filter FS methods (CS, IG, REF, and NoFS), existing rank aggregation based multi-filter FS methods (Min, Max, Mean, Range, GMean, and HMean), and its variant (REMFFS) on the defect datasets studied. These findings, therefore, answer RQ1 and RQ2 (see
Section 3.6) as presented in
Table 13. Furthermore, the efficacy of AREMFFS solves the filter rank selection problem in SDP by integrating the power of individual filter FS methods. As a result, combining filter (multi-filter) methods is suggested as a viable choice for harnessing the power of the respective FFS and the strengths of filter–filter relationships in selecting germane features for FS methods as conducted in this report