Our experimental evaluation bears some threats to validity, while the approach itself has limitations that can be improved in future work.
6.1 Threats to Validity
Internal Validity—Dataset. Our dataset selections are limited in terms of size and diversity of functionalities. For the clone detection variant in particular, even though the number of clones is rather high, the number of code fragments that the clone and non-clones pair are based on is still very small as the pairs are formed from those base code fragments by pairwise combination. This lack of diversity might negatively influence the generalizability of the evaluation results. Nevertheless, we mitigate this threat in the comparison experiment (RQ1) by using the same datasets as the state-of-the-art (i.e., ASTNN).
External Validity—Dataset. Even though the BCB is widely used throughout the literature, the judgment of whether or not a pair of code fragments form a clone remains biased and purely based on benchmark authors’ intuition. Further, there is no single or precise notion of what semantic similarity is. Thus, the semantic boundaries of the functionality classes might not be consistent across all the represented functionalities. Finally, the program semantics of a code fragment might be obscured by the usage of external libraries that are not included within the dataset, in which case the decision task is technically unfeasible.
External Validity—Neural Network Architecture. In this article, we have worked with the ResNet neural network architecture (cf. Section
2.2.4). However, we have also conducted the experiments with the DenseNet [
22] architecture. Although more sophisticated in its design, the results (Figure
13) show that DenseNet and ResNet50 have almost similar performance.
External Validity—Classification Algorithms. For the final learning and prediction stage of the code-clone detection task, we used a simple binary classification neural network but also two (02) classical classification algorithms among the several existing ones. The experimental results for this task can then be influenced by the chosen classification algorithms. Therefore, we selected the simplest and widely used algorithms (kNN and SVM) in order to reduce the potential biases.
External Validity—Presence of Clone Duplicates in BCB. During development, we noticed the existence of conceptually duplicated clones in BCB. This fact showed up in the form of identical visual representations of code for different code fragment IDs. It turned out that those fragments emerged from type-1 clone pairs, which are technically the same code. When both those clone fragments are combined with another code fragment to form clone pairs, those clone pairs are conceptually duplicated. Although we cannot provide precise statistics on the extent of clone duplicates present in BCB, we can approximate, based on the code fragments that are used in type-1 and type-4 clone pairs, an upper-bound of approximately 30% clone duplicates. If we consider that type-2 and type-3 code snippets can also build clone duplicates, this estimation goes up to even 60%. In our case, during the development, we experienced drops of performance of about 10%, on small development examples. Hence, we conclude that code clones should not be disregarded if precise and valid evaluations are desired. This conclusion is consistent with recent empirical results reported by Alamanis on the adverse effects of code duplication in machine learning models of code [
1].
To explore the impact of code duplicates on the performance of
WySiWiM, we build a dataset (based on the same three functionalities and numbers of clones/non-clone pairs) where we do not use any clones that contain code fragments that are also used in type-1 clone pairs. The results from Figure
14 show that the avoidance of clone duplicates slightly degrades the overall results. This makes sense since the existence of clone duplicates makes the task easier and allows one to achieve a higher score. This finding is further valid for SVM with
\( linear \) (Figure
14(a)) kernel and k-NN with
\( ball~tree \) algorithm (Figure
14(b)).
Construct Validity—Dataset. A recurrent construct validity issue in the machine learning literature is related to class imbalance. In clone detection, one must ensure that all functionalities are balanced in the dataset of clone and non-clone pairs. Some approaches may overfit to specific (and largely represented) classes. To check for this issue, we build a balanced dataset (with and without duplicates) and compared the performance of
WySiWiM clone detection on this dataset as well as the imbalanced dataset provided in ASTNN artifacts. Indeed, the ASTNN dataset is randomly sampled from the BCB (using a fixed random seed) and hence—more or less—keeps the unbalancing that is present in the BCB itself. Comparison results in Table
11 with balanced and imbalanced (i.e., ASTNN dataset) suggest that
WySiWiM keeps its promises on performance.
Construct Validity—Geometric Visual Representation. Due to the lack of sufficient geometric forms to map the vocabulary of terms in our datasets, we only replace language keywords with some geometric forms. Other terms are left as is. The mix of words and geometric forms may, however, lead to a deterioration of the learning performance.
Construct Validity—Cross validation. We did not perform any cross validation on our approach, as our goal was to rather convey the concepts behind the approach rather than achieve high results. It is probable that the exact results vary to a certain extent on different splits of the dataset, especially since the different code fragments are probably not “semantically equally diverse” to each other, without further specifying what that could mean.
Conclusion Validity—Lack of definitions of semantic similarity. The software engineering community faces a crucial challenge for defining what semantic similarity means. Since we do not dare to explicitly define what semantic similarity means, we have to rely on the semantic value that is embedded in our dataset, respectively as it was implied by the creators of BCB. Consequently, a specific selection of a subset of the dataset may even influence the overall semantics it carries. However, even when two approaches are applied on the same dataset, they might still view semantic similarity differently. These facts make it hard to evaluate and especially compare semantic approaches of any sort.
6.2 Limitations
Input size of ResNets. Image classification networks have technically and by construction a strong limitation on their input size. This is problematic as it introduces loss and distortion of our input data. In consequence, we may completely lose the fine-grained lexical information that is contained in the visual representations of our code fragments.
Code fragment granularity. The approach as presented is mainly designed to work with method granularity code fragments. Image classification networks are designed to assign a single most suitable label to a whole single input. This is consistent with generally accepted good coding style rules, which claim that a single method should always implement a single functionality (known as the single responsibility principle) [
42]. To enlarge the scope of the granularity, our core concept of code visualization could be leveraged to full programs by applying object localization instead of detecting what functionalities a software is composed of. This principle could also explain why our approach works slightly less well on the OJ dataset, which consists of whole programs, while the BCB rather provides method-level granularities.
Colors in visualizations. Our visualizations apply colors only very sparsely, in the case of the color syntax highlighting variant, or not at all for the other visual representations. The current implementation of WySiWiM is thus not fully leveraging the potential of ResNet, which is designed to operate on all three color channels.
Traditional classification algorithms. For the clone detection task, we apply very basic binary classification algorithms. These algorithms probably do not explore all semantics learned by the ResNet deep feature extractor.
Scope of the clone datasets. The datasets are not only a threat to validity but also a major limitation. Our hypothesis is that, due to the limited variety and size of the datasets available today, it is not possible yet to learn general semantic knowledge that can be applied to all possible data.
6.3 Lessons Learned and Future Work
As the current implementation of WySiWiM represents only a proof-of-concept with limited goals, it offers a lot of potential for extensions and improvements. Furthermore, the general concept of visualizing code and learning on those visual representations could be interesting also to other software engineering tasks, or could be combined with existing approaches. Beyond our approach, we identified some general current limitations on the task of semantic code-clone detection, such as the lack of suitable datasets and benchmarks but also the lack of more precise and actionable definitions of semantics or semantic similarity. Visual vs. semantics. It is rather intuitively acceptable that the visual representation of code works well for tasks such as code classification. However, our experiments also show that WySiWiM neural networks yield features that help to identify semantically equivalent code fragments that are actually visually different (type 4 clones). This shows the power of the generic features that are extracted from raw (straightforward) visual representations, which should be further investigated in other tasks that deal with semantics. Mitigating image classifier input limitation. As mentioned in the previous subsection, a major limitation of our approach is the fixed input image size of the ResNet classifier. One potential way to mitigate this limitation could be to slice the image into multiple images of the required input size. Those slices could then be used to generate a larger feature vector, representing the whole image. This would allow one to capture more fine-grained information as well. Of course, it might be necessary to apply also scaled versions of the images to capture large-scale structural information too. Visualizations. As our visualizations showed, the use of colors can have a positive effect on the results. However, as our condensed AST visualization has been one to yield the best overall results, it might be interesting to further apply color coding on ASTs to make better use of the full potential of the image classification neural networks. Datasets and benchmarks. A future work that is important beyond our approach is the development of datasets and benchmarks that are more suitable for semantic code-clone detection and semantic approaches in general. This includes a high number of different functionalities and a high number of diverse code examples per functionality. Especially sets providing a multitude of more basic functionalities that do not depend on external libraries would be desirable. They would allow one to learn models the way humans learn semantics of computing languages, by starting very small.
Data augmentation. Similarly to data augmentation done in image classification via generating variant images through rotation, cropping, and so on, we could envision applying a data augmentation, although at the meta level, such as mutating the code in semantically equivalent ways in order to increase the size of our dataset.
Actionable definitions of semantics (similarity). Another very important future work would be to make efforts toward actionable definitions of semantics, or semantic similarity. A possible approach to this could be the definition of semantics through software tests. As software tests represent an executable variant of software specifications, they give a good notion of the requirements we put into our semantics. Of course, there are a few problematic aspects in this approach. One aspect is that each application may require different abstractions of a certain functionality. Another aspect is that the code snippets for a certain functionality would all have to use test suites.