We first introduce the metrics used in the experiments. Then, we present the experiments that we designed to answer the research questions of our study. For each experiment, we state the objective, overview the execution details, and present the results.
4.1 [RQ-1: Similarity Measurements for Buggy and Patched Code Using Embeddings]
Objective: . We investigate the capability of different learned embeddings to capture the (dis)similarity between buggy code fragments and the (in)correctly-patched ones. The experiments are performed towards providing answers for two sub-questions:
–
RQ-1.1 Is correctly-patched code actually similar to buggy code based on learned embeddings?
–
RQ-1.2 To what extent is buggy code more similar to correctly-patched code than to incorrectly-patched code?
Experimental Design for RQ-1.1: Using the four embedding models considered in our study (cf. Section
3.4), we produce the learned embeddings for buggy and patched code fragments associated to 36 k patches from five repair benchmarks shown in Table
2. In this case, the patched code fragment is the correctly-patched code fragment since it comes from labeled benchmark data (generally representing human-written patches). Given those learned embeddings (i.e., deep learned representation vectors of code), we compute the cosine similarity between the vectors representing the buggy and correctly-patched code fragments.
Results for RQ-1.1: Figure
7 presents the boxplots of the similarity distributions with different embedding models and for samples in different datasets. Doc2Vec and code2vec models appear to yield similarity values that are lower than BERT and CC2Vec models.
Figure
8 zooms in the boxplot region for each embedding model experiment to overview the differences across different benchmark data. We observe that, when embedding the patches with BERT, the similarity distribution for the patches in Defects4J dataset is similar to Bugs.jar and Bears dataset, but is different from the dataset ManySStBs4J and QuixBugs. The
Mann–Whitney–Wilcoxon (
MWW) tests [
43,
64] confirm that the similarity of median scores for Defects4J, Bugs.jar, and Bears is indeed statistically significant. MWW tests further confirm the statistical significance of the difference between Defects4J and ManySStBs4J/QuixBugs scores.
Defects4J, Bugs.jar, and Bears include diverse human-written patches for a large spectrum of bugs from real-world open-source Java projects. In contrast, ManySStuBs4J only contains patches for single statement bugs. Quixbugs dataset is further limited by its size and the fact that the patches are built by simply mutating the code of a small Java implementation of 40 algorithms (quicksort, levenshtein, etc.).
While CC2Vec and Doc2Vec exhibit roughly similar performance patterns with BERT (although at different scales), the experimental results with code2vec present different patterns across datasets. Note that, due to parsing failures of code2vec, we eventually considered only 118 Bears patches, 123 Bugs.jar patches, 46 Defects4J patches, 20,840 ManySStuBs4J patches, and 8 QuixBugs. The change of dataset size could explain the difference with the other embedding models.
Experimental Design for RQ-1.2: To compare the similarity scores of correctly-patched code fragment vs. incorrectly-patched code fragment to the buggy one, we consider combining datasets with correct patches and datasets with incorrect patches. Note that, all patches in our experiments are plausible since we are focused on correctness: plausibility is straightforward to decide based on test suites. Correct patches are provided in benchmarks. However, all the benchmarks in our study do not contain incorrect patches. Therefore, we rely on the dataset released by Liu et al. [
38]: 674 plausible but incorrect patches generated by 16 repair tools for 184 Defects4J bugs are considered from this dataset. Those 674 incorrect patches are selected within a larger set of incorrect patches by adding the constraint that the incorrect patch should be changed the same code location as the developer-provided patch in the benchmark: such incorrect patch cases may indeed be the most challenging to identify with heuristics.
We consider three scenarios to select correct patches for the comparison of the similarity scores. (1) Imbalanced-all, a quick intuition is that we compare the 674 incorrect patches against all correct patches from five benchmarks. (2) Imbalanced-Defects4J, we only use the correct patches from Defects4J. We design the second scenario because the correct patches from other benchmarks may create a sample bias. (3) Balanced-Defects4J, we use the correct patches for the 184 Defects4J bugs that the 674 incorrect patches target. In this scenario, incorrect and correct sets have the same number of patches. We design this to avoid the underlying bias of imbalanced sets. The comparison is done with different scenarios specified in Table
3.
Results for RQ-1.2: In this experiment, we further assess whether incorrectly-patched code exhibits different similarity score distributions than correctly-patched code. Figure
9 shows the distributions of cosine similarity scores for correct patches (i.e., similarity between buggy code fragments and correctly-patched ones) and incorrect patches (i.e., similarity between buggy code fragments and incorrectly-patched ones). The comparison is done with different scenarios specified in Table
3.
The comparisons do not include the case of learned embeddings for code2vec. Indeed, unlike the previous experiment where code2vec was able to parse enough code fragments, for the considered 184 correct patches of Defects4J, code2vec failed to parse most of the relevant code fragments. Hence, we focus the comparison on the other three embedding models (pre-trained BERT, trained Doc2Vec, and pre-trained CC2Vec). Overall, we observe that the distribution of cosine similarity scores is substantially different for correctly-patched and incorrectly-patched code fragments.
We observe that the similarity distributions of buggy code and patched code from incorrect patches are significantly different from the similarities for correct patches. The difference of median values is confirmed to be statistically significant by an MWW test. Note that the difference remains high for BERT, Doc2Vec, and CC2Vec whether the correctly-patched code is the counterpart of the incorrectly-patched ones (i.e., the scenario of Balanced-Defects4J) or whether the correctly-patched code is from a larger dataset (i.e., Imbalanced-Defects4J scenarios). As for the comparison with the dataset of Imbalanced-all, the heuristic remains valid but note it may be affected by other benchmarks, i.e., the different bugs caused the results.
4.2 [RQ-2: Filtering of Incorrect Patches Based on Similarity Thresholds]
Objective: . Following up on the findings related to the first research question, we investigate the selection of cut-off similarity scores to decide on which APR-generated patches are likely incorrect. Results from this investigation will provide insights to guide the exploitation of code learned embeddings in program repair pipelines.
Experimental Design: To select threshold values, we consider the distributions of similarity scores from the above experiments (cf. Section
4.1). Table
4 summarizes relevant statistics on the distributions on the similarity scores distribution for correct patches. Given the differences that were exhibited with incorrect patches in previous experiments, we use, for example, the 1
st quartile value as an inferred threshold value.
Given our previous findings that different datasets exhibit different similarity score distributions, we also consider inferring a specific threshold for the QuixBugs dataset (cf. statistics in Table
5).
Our test data is constituted of 64,293 patches generated by 11 APR tools in the empirical study of Durieux et al. [
11]. First, we use the four embedding models to generate learned embeddings of buggy code and patched code fragments and compute cosine similarity scores. Second, for each bug, we rank all generated patches based on the similarity scores between the patched code and the buggy one, where we consider that the higher the score, the more likely the correctness. Finally, to filter incorrect candidates, we consider two experiments:
(1)
Patches that lead to similarity scores that are lower to the inferred threshold (i.e., 1st quartile in previous experimental data) will be considered as incorrect. Patches where patched code exhibit higher similarity scores than the threshold are considered correct.
(2)
Another approach is to consider only the top-1 patches with the highest similarity scores as correct patches. Other patches are considered incorrect.
In all cases, we systematically validate the correctness of all 64,293 patches to have the correctness labels, for which the dataset authors did not provide (all plausible patches having been considered as valid). First, if the file(s) modified by a patch are not the same buggy files in the benchmark, we systematically consider it as incorrect: with this simple scheme, 33,489 patches are found incorrect. Second, with the same file, if the patch is not making changes at the same code locations, we consider it to be incorrect: 26,386 patches are further tagged as incorrect with this decision (cf. Threats to validity in Section
5). Finally, for the remaining 4,418 plausible patches in the dataset, we manually validate correctness by following the strict criteria enumerated by Liu et al. [
38] to enable reproducibility. Overall, we could label 900 correct patches. The remainders are considered as incorrect.
Results: By considering the patch with the highest (top-1) similarity score between the patched code and buggy code as correct, we were able to identify a correct patch for 10% (with BERT), 9% (with CC2Vec), and 10% (with Doc2Vec) of the bug cases. Overall we also misclassified 96% correct patches as incorrect. However, only 1.5% of incorrect patches were misclassified as correct patches.
Given that a given bug can be fixed with several correct patches, the top-1 criterion may not be adequate. Furthermore, this criterion makes the assumption that a correct patch indeed exists among the patch candidates. By using filtering thresholds inferred from previous experiments (which do not include the test dataset in this experiment), we can attempt to filter all incorrect patches generated by APR tools. Filtering results presented in Table
6 show the recall scores that can be reached. We provide experimental results when we use 1
st quartile and Mean values of similarity scores in the “training” set as threshold values. The thresholds are also applied by taking into account the datasets: thresholds learned on QuixBugs benchmark are applied to generated patches for QuixBugs bugs.
4.3 [RQ-3: Classification of Correct Patches with Supervised Learning]
Objective:. Cosine similarity between learned embeddings (which was used in the previous experiments) considers every deep learned feature as having the same weight as the others in the embedding vector. We investigate the feasibility to infer, using machine learning, the weights that different features may present with respect to patch correctness. To this end, we build a patch correctness prediction framework, Leopard (LEarn tO Predict pAtch coRrectness with embeDdings), with the embedding models and machine learning algorithms. We compare the prediction evaluation results of Leopard with the achievements of related approaches in the literature. The experiments are performed towards providing insights for the three sub-questions:
–
RQ-3.1 Can Leopardlearn to predict patch correctness by training classifiers based on the learned embeddings of code?
–
RQ-3.2 Can Leopardbe as reliable as a dynamic state-of-the-art approach such as PATCH-SIM in the patch correctness identification task?
–
RQ-3.3 To what extent learned embeddings of Leopardare providing different prediction results than the engineered features?
Experimental Design for RQ-3.1: To perform our machine learning experiments, we first require a ground-truth dataset. To that end, we rely on labeled datasets in the literature. Since incorrect patches generated by APR tools are only available for the Defects4J bugs, we focus on labeled patches provided by three independent teams (Liu et al. [
38], Ye et al. [
73], and Xiong et al. [
66]) and other patches generated by APR tools. Very few patches generated by the different tools are actually labeled as correct, which leads to an imbalanced dataset. To reduce the imbalance issue, we supplement the dataset with developer (correct) patches as supplied in the Defects4J benchmark. Note that one developer patch could include multiple fixing hunks for different files, but the extraction of engineered features only works on the patches with respect to changing a single file. Thus, we split such patches into sub patches by their changed files to ensure that one sub patch is only involved with one code file. In total, we collected 2,687 patches. After removing duplicates, 2,244 patches remained. A total of 97 patches failed to obtain their engineered feature. Eventually, the ground-truth dataset is built with 2,147 patches, shown in Table
7.
Our ground truth dataset patches are then fed to our embedding models in
Leopard to produce embedding vectors. As for previous experiments, the parsability of Defects4J patch code fragments prevented the application of code2vec:
Leopard uses pre-trained models of BERT (trained with natural language text) and CC2Vec (trained with code changes) as well as a retrained model of Doc2Vec (trained with patches). Since the representation learning models are applied to code fragments inferred from patches (and not to the patch themselves),
Leopard collects the embeddings of both buggy code fragments and patched code fragments for each patch. Then
Leopard must merge these vectors back into a single input vector for the classification algorithm. We follow an approach that was demonstrated by Hoang et al. [
15] in a recent work on bug fix patch prediction: the classification model performs best when features of patched code fragments and buggy code fragments are crossed together.
At first, and following related works in the literature, we used a 10-fold cross validation scheme to evaluate and compare our approach against the state-of-the-art. However, we found that, with this scheme, a patch set generated for the same bug can be split into both the training and testing sets. Such a scenario is actually unrealistic (and biased) since we should not train the model with some labeled patches of a bug that we intend to repair (test set). To address this bias, we propose instead a 10-group cross validation scheme: First, we randomly distribute all bugs into 10 groups. Every group contains unique bugs and their associated patches. Then, we use nine groups as train data and the remaining group as the test data. Finally, we repeat the selection of train and test groups for ten rounds and obtain the average score of the metrics.
Results for RQ-3.1: We compare the performance of different embedding models using different classification algorithms. Table
8 presents the results with a 10-group cross validation setup. All classical metrics used for assessing predictors are reported: Accuracy, Precision, Recall, F1-Measure,
Area Under Curve (
AUC). XGBoost applied to BERT embeddings yields the best performance on the most of metrics (e.g., AUC with 0.803 and F1-measure with 0.765), while DNN achieves the best performance on precision of 0.744.
Our previous work [
57] was conducted through a 5-fold cross validation. To evaluate performance change of the approach on the new augmented dataset, we re-conduct a 5-fold cross validation experiment. The results show that after increasing the number of training examples (1,147 more patches), the performance of the decision tree, logistic regression and naive bayes classifiers are improved. For instance, applying the three classifiers with BERT embeddings, their accuracy, precision, recall, and F1-measure are improved with 3 to 23.6 points (except the recall of Naive bayes + BERT embedding is decreased). Their AUC values are increased with 0.067, 0.06, 0.126, respectively. These results provide us the possibility of evolving the patch identification through datasets augmentation. Note that, for the following experiment, we proceed to focus on using 10-group cross validation because of its effectiveness for evaluating the approaches in practice.
Experimental Design for RQ-3.2: PATCH-SIM [
66] is the state-of-the-art work on predicting the patch correctness for APR tools. It is a dynamic-based approach, which generates execution traces of patched programs with new generated tests, and compares the execution traces across test cases to assess the correctness of APR-generated patches. We propose to apply PATCH-SIM to our collected patches (cf. Table
7). Unfortunately, PATCH-SIM is implemented to run on Defects4J-v1.2.0.
6 Therefore, it failed to process 476 patches generated for some bugs (e.g., JSoup bugs) in the latest version of Defects4J (i.e., Defects4J-v2.0.0). Furthermore, even when PATCH-SIM can run, we observe that it does not yield any prediction output for 1,022 patches.
7 Eventually, we were able to assess the performance of PATCH-SIM on 649 patches. To avoid a potential bias in comparisons, we also conduct the ML-based classification experiments for
Leopard on the 649 patches.
Results for RQ-3.2: Table
9 provides the comparing results on predicting patch correctness. In terms of Recall, PATCH-SIM achieved 78.9% that is a bit higher than the BERT embedding + Random forest of
Leopard, which demonstrates its ability of recalling correct patches from plausible patches as reported in [
66] by its authors. However, the accuracy, precision and AUC measurements are just 38.8%, 24.7%, and 52.8%, respectively. These results underperform the three ML classifiers of
Leopard. It indicates the many incorrect patches are wrongly identified as correct by PATCH-SIM. Figure
10 further gives an example on comparing the BERT embedding + the XGBoost classifier of
Leopard and PATCH-SIM in terms of the number of (in) patches correctly identified by them. XGBoost classifier of
Leopard can recall more correct and incorrect patches than the PATCH-SIM, and the 24 correct patches and 124 incorrect patches are exclusively correctly predicted by it.
Time cost. Note that we have recorded that, on average, PATCH-SIM takes ∼17.5 minutes to predict the correctness of each patch. In contrast, each of the ML classifiers of Leopard takes less than 1 minute for prediction. However, note that the training of Leopard requires the input of the learned embeddings of patches generated by pre-trained models (e.g., BERT). Such models, which are available on-the-shelf, have been trained using hundreds of TPUs that were run for several hours on a large corpus.
Experimental Design for RQ-3.3: As reported by Ye et al. [
71] in a recent study, post-processing APR-generated patches through engineered features achieves promising results. Therefore, in this study, we also use some of the engineered features (Prophet features and repair pattern) in[
71] to predict correct patches on a larger dataset: overall, our study is based on 2,147 patches while Ye et al. applied only 713 patches. Results in this study are given based on 10-group cross validation.
Results for RQ-3.3: Table
10 presents the results of predicting patch correctness with the engineered features. The naive bayes learning algorithm achieves an unusual performance compared to the other five learners. It yields the highest precision, but leads to a much lower recall than others. This suggests that a very small number of correct patches can be recalled via using this learner. The Random Forest and XGBoost learners achieve similarly high performance (e.g., F1-measure at 74.7%/74.1% and AUC at 76.9%/77.6%), and are followed by the DNN learner. Overall, the performance achieved with engineered features is generally comparable (in terms of global metrics) to that yielded by
Leopard using learned embeddings, except when using the Naive Bayes and Decision Trees learning algorithm.
Figure
11 further illustrates the differences between the XGBoost classifier with the BERT embeddings and the engineered features in terms of the number of identified (in)correct patches. More (in)correct patches can be correctly identified by the XGBoost classifier with both two scenarios. Nevertheless, there still is a big complementary space of identifying the patch correctness for the two scenarios.
4.4 [RQ-4: Combining Learned Embeddings and Engineered Features for More Accurate Classification of Correct Patches]
Objective: . Following up on the insights from the previous research question, which compared engineered features against learned embeddings, we investigate the potential of leveraging both feature sets to improve the classification of correct patches.
Experimental Design: Leveraging different feature sets can be achieved in several ways, e.g., by concatenating feature vectors or by performing ensemble learning. In this study, we investigate three different methods which are implemented in the upgraded version of
Leopard,
Panther (Predict pAtch correctNess wiTH the learned Embeddings and engineeRed features), as illustrated in Figure
12:
(1)
Ensemble learning.We rely on the six learning algorithms (cf. Tables
8 and
10) to predict the correctness of patches based either on the learned embeddings or on the engineered features. Eventually, to combine both, we simply compute the average prediction probability provided by a pair of classifiers (one trained with learned embeddings and the other with engineered features), and use this probability to decide on patch correctness.
(2)
Naïve Vector Concatenation. In the second method, we ignore the fact that learned embeddings vectors and engineered feature vectors are not from the same space and propose to Naïvely concatenate them into a single representation. Our intuition, indeed, is that both representations capture different features of patches and can therefore offer, together, a better representation. The yielded concatenated vectors are then used to train the classifiers (with the usual learning algorithms).
(3)
Deep Combination. In the last method, we consider that learned embeddings and engineered features are from different spaces. Therefore, we must learn their different weights as well as the common representations for them before concatenation. We resort thus to DNNs to attempt a deep combination of feature sets before classification.
In this RQ, given the performance of BERT in previous experiments (cf. Table
8), we focus on the BERT embedding model to learn the learned embeddings of patches. Similarly, we only consider Random forest and XGBoost as the best learners to be applied (cf. Tables
8 and
10). The
Deep Combination method is based on the work of Cheng et al. [
6] who proposed a deep learning fusion structure which combined layers that were specialized to explore memorization and generalization of features. Following up this idea of fusion, we design a Double-DNN-fusion structure where learned embeddings are considered useful for generalization and engineered features are considered for memorization. Eventually, we conduct 10-group cross validation for the experimental assessment.
Results: Table
11 presents the performance comparison for correctness identification when using combined features vs. using single feature sets. The comparison is done in terms of three main metrics: +Recall (to what extent correct patches can be identified), -Recall (to what extent incorrect patches can be filtered out), and AUC (area under the ROC curve, i.e., comprehensive performance of the predictor). Overall, the performance of classifying correct patches is improved after using each of the three combination strategies (except the -Recall of the random forest classifier with the
Naïve Vector Concatenation) for the learned (BERT) and engineered (ODS) feature. With respect to
+Recall (i.e., recalling the correct patches), the Random forest and XGBoost-based classifier with
Ensemble Learning achieve the highest value at 83.7%, improving by 1 to 6 percentage points the performance with single feature sets. With respect to
-Recall (i.e., filtering out the incorrect patches), the best classifier is DNN-based with the
Deep Combination of features: it achieves the highest recall in correctly excluding 69.6% of the incorrect patches. With respect to
AUC, the XGBoost-based classifier with the
Ensemble Learning presents the best performance at 82.2%, improving by 2–5 percentage points the performance with single feature sets. To sum up, combining the BERT embeddings of patches with their ODS features does improve the performance of identifying patch correctness. Note that the results show that, in general, Ensemble Learning applied to independently trained classifiers yields the highest performance gains. The McNemar’s statistical hypothesis test [
10] further confirms that the gains are statistically significant for the
Ensemble Learning and
Deep Combination while it is not the case for the
Naïve Vector Concatenation. This suggests that the features (learned and engineered) are from different spaces and are best exploited when applied standalone to model patch correctness, and can complement each other in terms of prediction.
Figure
13 further highlights the number of (in)correct patches identified based on BERT embeddings, engineered features and the combined features, respectively. Since the “Random forest” learner presents a similar performance with “XGBoost”, Figure
13 focuses on the latter.
From a qualitative point of view, with the Ensemble Learning, more (in)correct patches can be identified than each single feature set (i.e., BERT embeddings or engineered features). However, this combination does not help to identify patches that were not identified using at least one feature set. In contrast, with Naïve Vector Concatenation and the Deep Combination, which combine features before classification, we can identify some (in)correct patches that could not be identified using either feature set alone.
From a quantitative point of view, the Naïve Vector Concatenation helps to identify slightly more correct patches (among those that could not be identified by each feature set alone) than the Deep Combination. As for new identified incorrect patches, they achieve the same metrics. Nevertheless, overall, the Ensemble Learning method helps to identify more correct patches while the Deep Combination helps to identify more incorrect patches.
4.5 [RQ-5: Explanation of Improvements of Combination]
Objective: . The experimental results for previous RQs show that ML classifiers built based on learned embeddings, or on engineered features, or on both, yield promising performance in predicting patch correctness. The fact remains, however, that the classifier is a black box model for practitioners. In particular, when leveraging combined feature sets, it may be helpful to investigate the impact of different features on the identification of patch correctness. To that end, we propose to build on Explainable ML techniques to explore how the models are built. In this work, we focus on Shapley Values, which compute the contributions of each feature in a given prediction. Shapley values originate from the field of game theory and have been implemented in the SHAP framework [
40], which is widely used in the AI community.
Experimental Design: Our experiments are focused on the classifier yielded with the
Naïve Vector Concatenation method since it managed to recall more correct patches through combining learned embeddings and engineered features (cf. RQ-3.3 in Section
4.3). We consider the case where the classifier is trained with the XGBoost learning algorithm. Using SHAP values as a metric of feature importance, we investigate the top most important features that contribute to the combined model predictions. We further compare those important features against the features that are most contributing when the classifier is trained only with learned embeddings or only with engineered features. Finally, we present three specific patches that are identified by different feature sets to observe the contribution of the features to prediction.
Results: Figure
14 illustrates the top-10 most contributing features: a feature named B-
i refers to the i
th feature learned with BERT. Others (e.g.,
singleLine and
codeMove) refer to engineered features. The appearance of features from learned and engineered feature sets among the most contributing features suggests that both types of features are not only relevant but are also exploited in the yielded classifier.
Reading a SHAP explanation graph: In a given SHAP graph, each row is a distribution value for a given feature (Y-axis), where each data point is associated to one sample input data (i.e., a patch in our case). The color indicates the feature value, which is normalized: the more red, the higher the value. The X-axis represents the SHAP values, which indicate to what extent a given feature impacted the model output for a given patch. For example, most patches with high value (red) for feature singleLine are located on the left (negative SHAP value), which suggests negative impact of singleLine on correctness prediction. It should be noted that, eventually, it is the contributions of different features that will be merged to yield the final prediction for each sample.
In Figure
14, we note that
singleLine and
codeMove are the top contributing engineered features among the combined feature sets. As we see from the figure, their red (high value) points and blue (low value) points are clearly separated to two sides, which demonstrates their values have obvious positive or negative effects on the model output. In Figure
15, when leveraging only engineered features,
singleLine and
codeMove also have significant contributions and are appearing in the 1st and 4th positions among the top contributing features. This indicates that the engineered features must be high-contributors to the decision (e.g., in terms of information gain) as shown in Figure
15, in order to obtain an efficient combination with learned features. Therefore, in practice we suggest that the research community should focus more on devising few but effective engineered features instead of massive but inefficient features to improve the performance of models.
Overall, the SHAP explanations suggest that engineered features have an important effect on model prediction (because they appear among the top contributing features) but are complementary to the learned feature set. Indeed, the combination with Naive Vector Concatenation enables classifiers to identify correct patches that could not be identified when each feature set was used without the other. Therefore, we conclude that it is the interaction among the features that yields such a performance improvement. We propose to further investigate the interaction among pairs of features (one from the engineered features set and the other from the learned features set).
Figure
16 illustrates the interaction information provided by SHAP among
singleLine,
codeMove, and
B-1530. As it can be seen, in Figure
16(a), when the feature value of
singleLine is 0, higher (redder) feature values of
B-1530 will lead to a more negative SHAP value for
singleLine (i.e., it has negative impact on patch correctness prediction). In contrast, when the feature value of
singleLine is 1, the same higher feature values of
B-1530 will tend to draw a positive SHAP value (i.e., positive impact). This example illustrates how learned and engineered features can interact to balance their contributions for the final predictions based on their respective feature values. Figure
16(b) and (d) exhibits effective interaction while Figure
16(c) cannot because not enough of the test data are reaching both the two feature nodes in the tree-based boosting classifier. In the same direction, we cannot present the SHAP interaction between
singleLine and
codeMove. Overall, Figure
16 provides evidence for the impact of the interaction between learned and engineered features on the model prediction. In contrast, merging classifiers through
Ensemble Learning does not allow for features interaction and thus fails to identify patches that were not identified using one feature set. This motivates model trainers to combine different types of features through tree-based classifiers or DNNs to obtain efficient deep information for identifying previously-unidentified correct patches.
Finally, Figure
17 presents the SHAP analyses of three patches that are exclusively identified by classifiers built based either on learned feature set (a), or on engineered feature set (b), or on combined feature set (c). We note that contributions of each learned feature is small and it is the sum of contributions that lead to a prediction. In contrast, contributions of engineered features are significantly larger for several features. When the sets are combined, engineered features are contributing in the top, their contributions are impactful, while learned features still contribute, each, to a lesser extent. Overall, few engineered features make most of the contributions for good prediction which unsurprisingly imply that the quality and relevance of engineered features are more important than the number of features.