1 Introduction
Code embedding refers to the process of transforming the program elements into continuous vectors [
5,
24,
59]. This transformation is important for deep learning, as the subsequent model training and inference are performed on the embedding vectors [
10,
17,
40,
52,
54]. Despite much progress in this area [
5,
18,
20,
24,
29,
36,
48,
59], it is still unclear the effectiveness and advantages of different embedding designs. A side-by-side comparison would help one better design neural network-based methodologies and harness their power for embedding-based applications.
Our work uncovers the impacts of multiple embedding design choices on the API completion task, a foundational question in AI-based software engineering, through comprehensive comparative experiments. API completion aims to predict the next API method given the previous code sequence. It is a basic building block for many software engineering tasks, including code repair and code generation. In our experiments, we choose a specific application scenario, cryptographic API completion. Cryptographic APIs are widely known to be error-prone [
1,
30,
42,
55,
57]. Misuses, such as predictable random numbers and insecure hash algorithms, severely threaten software security. Thus, this task is more challenging and not well handled by existing solutions because, beyond correctness, security is also required. By experimenting on these challenging APIs, we observe and report the accuracy impacts of different embedding choices.
There are usually three key steps for training code embedding vectors. First, programs are preprocessed into certain representations (e.g., bytecode, control flow graphs) that contain meaningful features. This is usually achieved by program analysis techniques. Based on the preprocessed representations, a basic embedding training vectorizes every single token by gathering its context information across the entire corpus, which is referred to as
token-level embedding. Beyond embedding a single token, an extra step could be conducted to produce embedding vectors for a given sequence, which is called
sequence-level embedding in our article. It requires an extra sequence model pretraining compared with the basic token-level embedding. Therefore, we identify design choices of the three main aspects: (i)
program analysis preprocessing, (ii)
token-level embedding, and (iii)
sequence-level embedding to compare, as shown in Table
1. Such comparison is missing in the literature and needs to be systematically performed.
Our
first comparison group focuses on the impacts of program analysis preprocessing. Program analysis is often used to process programs before embedding [
4,
9,
23,
58]. This preprocessing is important, as it decides what information is used for embedding training. For example, Henkel et al. [
24] extract symbolic traces for embedding while the state-of-the-art code embeddings (e.g., GraphCodeBERT [
22], inst2vec [
7]) leverage data flows from graph representations to embed program elements. In our work, we compare three program representations—bytecode, program slices, and API dependence paths—obtained with different program analysis strategies for embedding. We explain why the three representations are selected in Section
3.1.
Our
second comparison group examines the impacts of token-level embedding. We make comparisons between token-level embedding and the one-hot encoding baseline. One-hot encoding is a basic vectorization approach that indexes
N tokens and represents the
ith token by an
N-dimensional vector that includes a single 1 at the
ith dimension and 0s for other dimensions. Compared with it, token-level embedding, such as word2vec [
31,
32,
33], is expected to result in low-dimensional semantic-aware vectors that could benefit the downstream task training. By the experimental comparison, we observe how much accuracy improvement the token-level embedding can gain.
Our
third comparison group learns the impacts of sequence-level embedding (also called contextualized embedding). We make comparisons between sequence-level embeddings and token-level embeddings. Compared with token-level embedding, sequence-level embedding is more advanced, because the polysemy issue is handled by assigning different vectors for different occurrences of a token. However, it also requires an extra expensive sequence language model and pretraining process to achieve that. For example, the state-of-the-art natural language sequence-level embedding BERT [
17] is obtained by pretraining the Transformer [
51] neural network. Our experimental comparisons aim to answer at what level the advantage of sequence-level embedding is over token-level embedding. Figure
2 concludes the workflow how we generate the one-hot vectors, token-level embeddings, and sequence-level embeddings.
To evaluate embeddings with different design choices, we perform API completion tasks on our Java cryptographic API benchmark. Our benchmark is composed of Java cryptographic code collected from 79,887 Android apps. To ensure verifiability and reproducibility, our Java cryptographic API benchmark is publicly available on GitHub.
1Next, we explain our research questions along with the comparative experiments designed to answer them.
RQ1: What are the accuracy impacts of token-level embedding obtained from bytecode, slices, and API dependence paths in cryptographic API completion, respectively? To answer this question, we pretrain three token-level embeddings, byte2vec, slice2vec, and dep2vec on bytecode, slices, and API dependence paths, respectively. Bytecode, program slices, and API dependence paths are the outcome of different program analysis preprocessing. The obtained embeddings are compared with the basic setting, one-hot encoding, with corresponding program analysis preprocessing.
RQ2: What are the accuracy impacts of sequence-level embedding obtained from bytecode, slices, and API dependence paths in cryptographic API completion, respectively? To answer this question, we pretrain three sequence-level embeddings, byteBERT, sliceBERT, and depBERT on bytecode, slices, and API dependence, respectively. They are fine-tuned for the cryptographic API completion and compared with an identical Transformer neural network without the pretraining knowledge.
RQ3: Are our embeddings effective for cryptographic API completion on new apps? To answer this, we perform the experiments not only under the basic within-app setting, but also under the cross-app setting. In the within-app setting, sequences are extracted from Android apps and randomly split for training and testing. In the cross-app setting, new Android apps are used to test the model.
RQ4: How well does the state-of-the-art general purpose code embedding work for cryptographic API completion? Besides the program analysis and embedding choices we covered in Table
1, we further evaluate two state-of-the-art code embeddings, GraphCodeBERT [
22] and CodeBERT [
20], for cryptographic API completion. They are two general purpose source code embedding models pretrained by Microsoft on six programming languages paired with natural language. We fine-tune the two pretrained models for our API completion task and form an end-to-end comparison.
Our major findings include:
—
Our findings show that program analysis preprocessing plays a significant role in cryptographic API embedding and completion. For both token-level embedding and sequence-level embedding, the API dependence paths produce higher prediction accuracy, compared with slices and bytecode. With program analysis, the token-level embedding dep2vec achieves an accuracy 36% higher than byte2vec. The sequence-level embedding depBERT achieves an accuracy 45.86% higher than byteBERT without program analysis preprocessing.
—
Our findings show that applying embeddings with program analysis significantly improves task accuracy compared with the one-hot baseline (no embedding). On dependence paths, the token-level embedding dep2vec and sequence-level embedding depBERT, both outperform the one-hot encoding baseline by the accuracy boost of 6% and 7%, respectively, although sequence-level embedding is slightly (0.55%) better than token-level embedding in our experiments. Considering the expensive cost of sequence-level embedding, token-level embedding is more desirable.
—
Our findings show that the improvements derived from program analysis and embedding are valid for cryptographic API completion on new apps. In the cross-app learning scenario, the program analysis guided embeddings depBERT and dep2vec still achieve good accuracy at 95.75% and 93.58%, respectively. Another observation is the advantage of depBERT over dep2vec is slightly more obvious by the 2.17% accuracy boost compared with 0.55% in the basic setting. The sequence-level embedding depBERT is most recommended to be used in the data scarce situation, as the largest improvement (5.10%) of depBERT compared with dep2vec is observed on the smallest task dataset with 26,357 dependence paths.
—
The state-of-the-art general purpose source code embedding solutions GraphCodeBERT and CodeBERT are insufficient in our cryptographic API completion tasks with a low accuracy of 59.94%. Experiments still show the advantage of applying program analysis preprocessing in their embedding solutions. GraphCodeBERT substantially outperforms its non-program-analysis counterpart CodeBERT by an accuracy boost of 20.07%, on average. The experiments also suggest the method-level context is more recommended than the class-level context for cryptographic API completion.
Significance of research contributions. Our work provides the first quantitative and systematic comparison of the prediction accuracy of multiple API embedding approaches for neural network-based code completion. Our rigorous experiments provide new empirical results that have not been previously reported, including how various domain-specific program analyses improve data-driven predictions. These quantitative findings help guide and design more powerful and accurate code completion solutions, leading to high quality and low vulnerability software projects in practice. As cryptographic API completion is more difficult and requires a deeper understanding of the code context, we expect our observations to be valid and useful for general code completion tasks as well. We keep the general evaluation as our further work. We also publish our new cryptographic API benchmark along with our deep learning models to help future research.
3 Our Measurement Setting
We perform comparative experiments to answer our research questions. As shown in Table
1, we compare different design choices of
program analysis preprocessing,
token-level embedding, and
sequence-level embedding.
3.1 Program Analysis Preprocessing Strategies
We examine the impacts of using program analysis to guide the embedding. There could be unlimited program analysis strategies to extract different program sequences. Specifically, we compare three types of program sequences: (i) bytecode, (ii) program slices, and (iii) API dependence paths. The bytecode is from Android apps without program analysis. The program slices are obtained by conducting interprocedural backward slicing on bytecode. Moreover, the API dependence paths are extracted from API dependence graphs we construct on program slices with the dataflow dependence between the API calls. We select these three because they embody the increasing levels of program analysis guidance.
Bytecode sequences. We extract the API sequences directly from the Android bytecode. For each method implementation, we extract the API methods and constants used in it into one sequence. There is no ordering between sequences collected from different method implementations. Based on our observation, the order of the API methods and constants in these sequences is close to their order in the source code. We cover the bytecode option because it reflects the effect of embedding without program analysis guidance.
Program slices. We apply a program analysis strategy,
interprocedural backward slicing, to obtain program slices. The slicing starts from the variables used with a cryptographic API invocation. By backwardly tracing the data flows reaching these variables, all the code statements influencing the API invocation are kept while irrelevant code statements are excluded. When reaching the entry point of the current method, we jump to its callers to continue the backward tracing until the tracked data facts are empty or there is no caller found. In this way, the influencing code context beyond a local method is also collected. When meeting a self-defined method (i.e., a method that is written by the developer and is not provided by Java libraries) call, we replace it with its implementation code if available. An example of program slices is shown in Figure
1(b). A major difference between program slices and bytecode is that the irrelevant predecessors are removed by program analysis.
API dependence paths. With program analysis, the code semantic information, such as program dependencies, can be extracted. We perform the
API dependence graph construction and extract the API dependence paths for embedding. The API dependence graphs are built through dataflow analysis. We add the data dependence edges between API calls on slices. An example of our API dependence graph is shown in Figure
1(c). It uses an API or a constant as a node. Two nodes having the data dependence (
def-use) relationship are connected directly. The API dependence paths are covered in our measurement as a representative of the state-of-the-art code semantic-based approaches. [
7,
22].
Experimental setup of program analysis preprocessing. We implement an interprocedural, context- and field-sensitive dataflow analysis to achieve our backward slicing and API dependence graph construction. The analysis is implemented with the Java program analysis framework Soot [
50]. Soot takes the Android bytecode as input and transforms it into an
intermediate representation (IR) Jimple. The program analysis (i.e., slicing or API dependence graph construction) is performed on Jimple IR. We use Soot 2.5.0, Java 8, and Android SDK 26.1.1.
3.2 Token-level Embedding Settings
We perform the token-level embedding training to produce vectors for tokens in an embedding vocabulary, as illustrated in Figure
2(b).
Cryptographic code identification. All the embeddings are produced from the cryptographic code corpus we extract from decompiled Android APKs. We refer to the code implemented with cryptographic API calls as cryptographic code. To identify cryptographic code from an Android App, we first search all the cryptographic API callsites within the codebase. All the method signatures within the Java package
java.security and
javax.crypto (see Table
2) are included in our search list. Then, we start from these cryptographic API callsites to find other standard API calls happening
before a cryptographic API callsite as its context. However, there might be different accurate levels and scopes of the context according to preprocessing. In bytecode sequences, we can only extract all the previous API calls within the same method of a cryptographic callsites as its context. When program analysis technique is applied, we are able to generate more meaningful API call context based on its program dependency. In our experiments, a cryptographic API callsite and its program-wide dependency code is extracted as an inter-procedural (cross-method) program slice. The entire slice are regarded as cryptographic code and all the previous API calls within this slice will be gathered as the context of a cryptograhpic API call.
Embedding vocabulary. The embedding vocabulary is collected during the cryptographic code identification. The vocabulary initially includes the standard JAVA cryptographic APIs. Then, we scan the App and perform interprocedural backward slicing from the detected cryptographic API callsites as entry points. In this way, the vocabulary expands with all the encountered API calls and constants during backward program slicing. When an API call is encountered, we first check whether it is a self-defined method.
2 If it is, then the analysis jumps into the implementation of this method according to the interprocedural slicing algorithm. Otherwise, the API method will be collected as an element in our vocabulary. For the collected API methods, we further filter those that appear less than five times. For constants, we manually identified 104 reserved string constants used as the arguments of cryptographic APIs. Other constants that appear more than 100 times in the slices are also kept in the embedding vocabulary. Finally, we have a vocabulary of 4,543 tokens (3,739 APIs and 804 constants). The API methods include the standard APIs from Java and Android platforms, as well as some third-party APIs that cannot be inlined because of recursion or phantom methods (whose bodies are inaccessible during the analysis). Table
2 shows the library distribution of these API methods.
We train the skip-gram embedding model [
32] to obtain the word2vec-like embedding. With different program analysis preprocessing, three types of token-level embeddings—
byte2vec,
slice2vec, and
dep2vec—are produced.
—
byte2vec is the baseline embedding version that applies
word2vec [
31,
32] directly on the bytecode corpus.
—
slice2vec is the embedding with the inter-procedural backward slicing as the pre-processing method.
—
dep2vec applies API dependence graph construction to guide the embedding training.
Experimental setup for token-level embeddings. We follow the convention of the natural language embedding word2vec to set hyperparameters. The embedding vector length is 300. The sliding window size for neighbors is 5. We also applied subsampling and negative sampling to randomly select 100 false labels to update in each batch. Based on our preliminary experiments, we train embeddings with a mini-batch size of 1,024. The embedding terminates after 10 epochs, because we did not observe significant improvement by longer epochs and smaller batch size. Our embedding model is implemented using Tensorflow 1.15. Training runs on the Microsoft AzureML GPU clusters, which support distributed training with multiple workers. We use a cluster with 8 worker nodes. The VM size for each node is the (default) standard NC6.
3.3 Sequence-level Embedding Settings
We obtain sequence-level embeddings by applying the method of training the well-known natural language embedding BERT [
17] on program sequences, as shown in Figure
2(c).
byteBERT vs. sliceBERT vs. depBERT. On bytecode, program slices, and API dependence paths, we obtained three different versions of BERT embeddings for API elements:
byteBERT,
sliceBERT, and
depBERT. These BERT-like API embeddings are produced by pretrained Transformer neural networks. We apply the
masked language modeling (MLM) task to pretrain them. MLM is a task that reconstructs language sequences with masked tokens. It predicts the missing tokens for a given sequence with random masks. The masked tokens in the input sequence are either replaced by a special token
[MASK] or an arbitrary random token in the vocabulary or kept in original sequences. We set the probabilities of the three situations as 80%, 10%, and 10% and follow the convention in NLP. The masked tokens are randomly selected with a probability of 30%, and one sequence is limited to having two masked tokens at most. These are similar to the setting of the MLM for training BERT [
17]. We discard the
next sentence prediction (NSP) of BERT, as there is no corresponding concept of the “next sentence” between two code sequences. Three types of sequence-level embeddings are trained with identical hyperparameters. Same with the LSTM training with token-level embedding, the neural network is trained for 10 epochs with a batch size of 1,024. When training the Transformer model, the input tokens are represented as our token-level embedding. To apply these sequence-level embeddings in cryptographic API completion, the pretrained neural networks are fine-tuned by the given task-specific data.
Training parameters. Due to the resource constraint, we cannot thoroughly try every possible parameter combination (grid search). Instead, we select the optimal parameters with some preliminary experiments. We considered different numbers of epochs (up to 20), batch sizes (512, 1,024), and learning rates (0.1, 0.01, 0.001). Choosing different learning rates improves the accuracy by no more than 0.02. After training for 10 epochs, the accuracy increment is less than 0.001 for each extra epoch. Therefore, we chose the final set of parameters (10 epochs, 0.001 learning rate, and 1,024 batch size) to achieve a balance between computational resources and model performance.
For the pretraining of our byteBERT, sliceBERT, and depBERT models, we choose the ratio of masking strategies (80%, 10%, and 10%) following the original BERT paper. We choose the mask ratio of 30% because some of the sequences are short and we want each sample having one or two masked tokens. As those parameters are optimized by the authors of BERT, they are also the most commonly used settings in the NLP field.
Differential evolution (DE) is not a common practice in choosing hyperparameters for deep learning models, while it is more frequently used for tuning kernel parameters for SVM. Applying DE on top of LSTM models and evaluating its impact on model performance can form an interesting research topic by itself. We leave this extension as our future research direction.
3.4 Dataset Overview
We conduct experiments on Android apps collected from the Google Play store. We choose the Android platform because of its widespread use and popularity among users. We collect apps from various categories to ensure the dataset reflects a diverse usage of Java cryptographic API in practice. According to the way we split data for training and testing, we have a basic dataset and an advanced cross-app data setup. Table
3 gives an overview.
3.4.1 Basic Data Split Setting.
The basic dataset is composed of 16,048 Android apps from three categories, 5,176 apps from the business category, 4,581 Apps from the communication category, and 6,291 apps from the finance category. From these apps, we extracted 707,775 API sequences from bytecode, 926,781 API sequences from program slices, and 566,279 API sequences from API dependence graphs. The number of tokens in the three types of sequences is shown in Table
4. The tokens refer to the APIs or constants in our embedding vocabulary.
For embedding, we use all of the API sequences to produce the token-level embeddings and sequence-level embeddings. For API completion tasks, we randomly split all the sequences for training and testing following the ratio of 4:1.
3.4.2 Advanced Data Split Setting.
We create an advanced dataset to enable cross-app learning and validate our findings on new apps. Under this setup, the collected apps are split for embedding and API completion tasks, respectively. This guarantees that the apps used for API completion tasks are not seen in the embedding training phase. Our embedding experiments are conducted on 64,478 apps (app set 2), which are much more than the app sets 3, 4, 5, and 6 that we used for API completion tasks. This consideration is because embeddings are often pretrained with huge data volumes and released for fine-tuning with smaller task-specific datasets in the real world. Then, the apps for API completion tasks are split into training and testing sets. This guarantees that the apps used for testing are never seen in the training. Compared with the basic dataset, the cross-app setting is more practical and challenging. It evaluates whether the model trained on a set of apps can be applied to new apps.
In addition, to observe the impacts of the task data volume, we perform API completion training and testing on four app sets (app sets 3, 4, 5, 6) varying in data sizes. The largest one, App set 3, is a diverse App set including 11,997 apps from 12 App categories. Besides, there are three smaller App sets (App sets 4, 5, 6) consisting of 1,819 apps from the personalization category, 1,055 apps from the social category, and 538 apps from the weather category.
Data duplication. For both the basic and advanced dataset, we deduplicate the data in the class file level to guarantee that the reused class files (e.g., libraries) only appear once when extracting the bytecode sequences. However, we did not deduplicate the program slices and the API dependence paths extracted by program analysis. The presence of duplicate slices or paths in the training set suggests common coding patterns. The frequency of API occurrence helps the embedding model learn their relationship. Different source code could follow a similar cryptographic function usage pattern in some cases, as many security principles do not change for various scenarios. Letting the model directly learn the processed highly frequent sequences can significantly reduce the expensive data size and training resource requirement. Moreover, since the apps we collected are all real-world apps, this duplication should also hold for apps in the wild and will not affect the performance after deployment. In the cross-app experiments, our goal is to show the model does not make predictions because of the duplicated code sequences from the same app. If a usage pattern is general across multiple apps, then it is reasonable to keep their duplicated occurrence in our dataset [
2].
4 Evaluation Results
In this section, we report the accuracy of the cryptographic API completion to compare the impacts of different embedding choices and answer our research questions (RQs). In the evaluation, we calculate top-1 accuracy that only considers the correctness of the top-1 prediction of the model. It is calculated as the number of correct top-1 predictions over the total number of predictions. The top-1 prediction is considered correct if it matches the ground truth from the sequence itself. For API dependence paths from a graph, there might be multiple correct answers due to the branches of the graph.
4.1 Performance Improvement from Token-level Embedding (RQ1)
The impact of applying token-level embedding (RQ1) is measured by comparing it with one-hot encoding on bytecode, slices, and dependence paths, respectively. The accuracies of the API completion tasks are shown in Tables
5 and
6, and a comparison is shown in Figure
3. We visualize the accuracy differences brought by various design choices, namely, applying token-level embedding, program analysis preprocessing (i.e., program slicing and API dependence graph build), or increasing the model sizes in Figure
8 in the Appendix.
Experimental setup for cryptographic API completion tasks. We train LSTM-based models for the task. For
next API completion task, we train the LSTM-based sequence model to accept a sequence of API methods or constants (
\(t_1, t_2, \dots , t_{n-1}\) ) and output the next API
\(t_n\) . For
next API sequence completion task, we train the LSTM-based seq2seq (encoder-decoder) model to accept the first half API sequence (
\(t_1, t_2, \dots , t_{n}\) ) and predict the last half of the sequence (
\(t_{n+1},t_{n+2}, \dots , t_{2n}\) ).
We filter our code dataset using CryptoGuard [
42], which is a static cryptography API misuse detection tool. We exclude insecure cryptographic API usage to prevent the embeddings and models from learning these vulnerable patterns [
44]. It also helps eliminate the situations that when the models predict secure APIs and the ground truth itself (from the original data) is insecure, such predictions are counted as incorrect answers. This step contributes to a more accurate and meaningful evaluation of model performance. We limit the maximum number of LSTM steps to 10. We use a batch size of 1,024 and a learning rate of 0.001. The highest accuracy achieved within 10 epochs is recorded. These hyperparameters are selected because no obvious accuracy improvement is observed by longer epochs, smaller batch size, or learning rate. We use the stacked LSTM architecture with vanilla LSTM cells for the LSTM-based models.
4.1.1 Bytecode vs. Program Slices vs. API Dependence Paths.
Tables
5 and
6 show the accuracy results of the
next API completion and the
next API sequence completion, respectively. To uncover the impact of the program analysis preprocessing, both the token-level embedding (i.e.,
byte2vec,
slice2vec,
dep2vec) and the one-hot encoding baseline are used to train the LSTM models on bytecode, slices, and dependence paths.
We observe that program analysis preprocessing shows significant benefits. Table
5 shows the accuracy based on dependence paths is 92%, which is 9% and 36% higher than using slice- and bytecode-based token-level embedding, respectively. The API completion accuracy with one-hot encoding is also substantially improved by program analysis. The accuracy with one-hot encoding increases from 56% on bytecode to 72% on slices, and further to 86% on dependence paths. The results of the next API sequence completion (Table
6) are also consistent with the conclusion. It shows that the accuracy achieved with
byte2vec improved by 40.39% with
slice2vec, and improved by 44.60% with
dep2vec.
4.1.2 Token-level Embedding vs. One-hot Vectors.
On each program analysis preprocessing representation, we compare the token-level embedding and the one-hot encoding baseline. We observe significant improvements by applying token-level embeddings on slices and dependence paths. However, the improvement in bytecode is limited. Table
5 shows that
slice2vec improves the accuracy by 11% from its one-hot baseline.
dep2vec improves the accuracy by 6% from its one-hot baseline. These improvements suggest that
slice2vec and
dep2vec capture useful information. This conclusion is also observed in the next API sequence recommendation task.
slice2vec and
dep2vec improve the accuracy from their baselines by around 21% and 6%, respectively. In contrast,
byte2vec does not show any significant improvement from its one-hot baseline.
We also observe higher accuracy achieved by longer LSTM units, which is as expected. However, the accuracy benefits gained by increasing the model size from LSTM-64 to LSTM-128, from LSTM-128 to LSTM-256, and from LSTM-256 to LSTM-512, are smaller and smaller.
Overall, the best accuracy is achieved by dep2vec in both tasks, an accuracy of 92.04% in the next API completion task and an accuracy of 89.23% in the next API sequence completion task. Compared with the basic one-hot encoding on bytecode (no program analysis preprocessing), they achieve substantial accuracy improvements (36% and 46%, respectively) in both tasks. Although all the measures, including token-level embedding, program analysis preprocessing, and increasing model sizes, improve the accuracy, the two program analysis preprocessing strategies, program slicing, and API dependence graph construction, are most effective, resulting in 22.03% and 12.10% accuracy differences, on average, respectively.
4.2 Performance Improvement from Sequence-level Embedding (RQ2)
Next, we evaluate the effectiveness of sequence-level embedding (RQ2) from the comparison with token-level embedding on bytecode, slices, and dependence paths, respectively. We fine-tune the sequence-level embedding (i.e.,
byteBERT,
sliceBERT, or
depBERT) with the task-specific training before applying them to the API completion task. Then, the models are compared with unpretrained Transformer networks with token-level embeddings. We use two Transformer neural networks with different sizes, namely, Transformer-base and Transformer-small. The Transformer-base model has 12 hidden layers with size 768 and 12 attention heads. The Transformer-small model has 4 hidden layers with size 512 and 4 attention heads. Results are shown in Tables
7. We also show a comparison in Figure
4 and visualize the accuracy differences in Figure
9 in the Appendix.
4.2.1 Bytecode vs. Program Slices. vs. API Dependence Paths.
Table
7 shows that program analysis preprocessing is still necessary even with sequence-level embeddings. The accuracy of using bytecode sequences is low (45.21% and 57.59%) compared with program slices and API dependence paths. With the program analysis, the small and base Transformer neural networks with
depBERT achieve the accuracy of 91.07% and 93.53%, respectively. When there is only token-level embedding, this conclusion still holds. The small and base Transformer neural networks with
dep2vec achieve the accuracy of 90.96% and 92.80%, respectively, which are 46.58% and 36.04% higher than the
byte2vec on bytecode sequences.
According to Figure
4, we observe the impact of program analysis is still the most significant way to improve the accuracy. The average accuracy differences achieved by program slicing and API dependence path construction are 33.55% and 7.80%, respectively, which are much more effective than the sequence-level embedding and a larger Transformer neural network. In Table
7, the small
depBERT that has program analysis preprocessing achieves an accuracy of 91.07%, which is 34.31% higher than the larger model without program analysis, namely, the base
byteBERT.
By comparing the Transformer with token-level embeddings in Table
7 and the LSTM with token-level embeddings in Table
5, we found that the LSTM-512 achieves slightly higher accuracy than the Transformer-small with a comparable size (hidden size 512).
4.2.2 Sequence-level Embedding vs. Token-level Embedding.
Sequence-level embeddings only show slight advantages over token-level embeddings. As shown in Table
7, the accuracy trained with the sequence-level embeddings is only slightly higher (0.55%, on average) than the Transformer neural network with their token-level baselines. One possible reason for this slight improvement observed may be attributed to the strong learning ability of the Transformer model. Through the use of its attention mechanism, the model can effectively learn and comprehend contextual information, even with limited embedded information in the input (token-level embedding) and no pretraining. Another possible reason is the simplicity of programming languages compared to natural languages. Sequence embedding helps capture the different meanings of the same word in various positions or contexts. One example is the different interpretations of the word “like” in the sentence “I like the way you look like.” However, such conditions are less likely to occur in a programming language, leading to a smaller improvement when being applied to programming languages than natural languages. Therefore, considering the cost, sequence-level embedding is not recommended in this case.
Besides, the impact of the neural network size is also more obvious than the impact of applying sequence-level embedding. As shown in Table
7, the base Transformer improves the accuracy by 12.38%, 1.43%, and 1.84%, on bytecode, slices, and dependence paths, respectively, compared with the small Transformer.
4.3 Cross-app Evaluation (RQ3)
Cross-app learning is a practical scenario in which we expect a pretrained model can be applied to other projects unseen in the training phase. Therefore, we conduct experiments to verify whether our conclusions still hold for new apps that never appear in the training.
Tables
8 and
9 show the API completion experiments on our advanced dataset (see Section
3.4), which follows the cross-app learning scenario. App sets 3, 4, 5, and 6 include apps that generate task-specific data. For every app category, we randomly select 80% apps of this category to generate training data and 20% apps to generate testing data. In other words, our training and testing data is cross-app but within a category.
Tables
8 and
9 compare sequence-level embeddings (i.e.,
depBERT and
byteBERT) with the corresponding token-level embeddings (i.e.,
dep2vec and
byte2vec).
DepBERT and
byteBERT are Transformer neural networks pretrained on app set 2 (see Table
3) with
Masked Language Model (MLM). We use the small Transformer neural network for all the experiments. Figure
10 in the Appendix shows the accuracy differences achieved by program analysis and sequence-level embedding on App sets 3, 4, 5, and 6, respectively.
We observe similar conclusions with the basic dataset about program analysis. The experiments on API dependence paths (Table
8) again show significant advantages compared with bytecode sequences (Table
9). Program analysis preprocessing makes significant accuracy differences (16.95%, on average) in all situations.
A minor difference we observe is that sequence-level embedding brings more obvious improvement than on the basic dataset. As shown in Table
8, the average improvement of applying the sequence-level embedding is 2.17%. This indicates that sequence-level embedding is more significant when we train our models in the cross-app scenario. We observe that the sequence-level embedding substantially improves the accuracy for small data sizes. It achieves an accuracy 5.10% higher than the Transformer with
dep2vec.
From Table
9, we also observe the improvement of applying sequence-level embedding
byteBERT on bytecode sequences. However, without program analysis, the improvements (0.86%, on average) are quite small.
4.4 Comparison with State-of-the-art (RQ4)
Besides the design choices we covered, we further experiment on two state-of-the-art sequence-level embeddings, GraphCodeBERT [
22] and CodeBERT [
20]. GraphCodeBERT and CodeBERT are general-purpose code embedding models pretrained by Microsoft. They adopt the Transformer-based neural architecture and pretrain it on CodeSearchNet dataset, which includes 2.3 million functions of six programming languages paired with natural language description. The differences between them are their code preprocessing parts and sequence-level embedding tasks. CodeBERT treats code as a sequence of tokens and is pretrained MLM. GraphCodeBERT uses program analysis to extract dataflow information as input and is pretrained by two extra structure-aware tasks introduced by the authors.
Table
10 shows the next API completion experiments on our app sets 4, 5, and 6. We decompiled .apk files into source code for the neural network inputs, although there might be lost information due to obfuscation. However, the amount of information loss caused by obfuscation is equal to our three methods (i.e., bytecode sequences, slices, and dependence paths), CodeBERT, and GraphCodeBERT. Therefore, we think it still forms a fair comparison. For each cryptographic API call, we extract two types of source code context for it: the method-level context and the class-level context. The former extracts the previous code within the wrapper method where the target call locates while the latter collects the previous code lines found in the same class of the target call. We fine-tune the two models with our data for 10 epochs with batch size 16. We use this setting because no substantial improvement is observed by longer epochs or smaller batch sizes.
We have three observations from Table
10. First, the best accuracy is achieved by GraphCodeBERT with the method-level context. However, the accuracy is still at a low level, an average of 59.94%. Second, GraphCodeBERT substantially outperforms CodeBERT in identical data and context settings. When using method-level context, GraphCodeBERT has an accuracy of 20.07% higher accuracy than CodeBERT, on average. When using class-level context, GraphCodeBERT achieves 6.34% higher accuracy, on average. This confirms our findings 1 and 3 that program analysis contributes a substantial improvement to the embeddings. Another observation is that method-level context is much better than class-level context. With GraphCodeBERT, the method-level context outperforms the class-level context by 22.29% accuracy improvement, on average. With CodeBERT, the method-level context results in an 8.80% higher accuracy, on average. The reason might be that the class-level context includes much more irrelevant information and makes the prediction worse.
Furthermore, with the rapid development of large language models, their application in code completion and code repair has been discussed widely. Recent work [
45] evaluates ChatGPT, a conversational language model, on bug fixing and code repairing in Python. The results show that ChatGPT is able to fix 19 out of 40 simple bugs, comparable to other state-of-the-art solutions. However, while the results look promising, the queries are simple code snippets that have a few lines. It remains unclear how well ChatGPT can parse complex code context in large programs. Its performance in identifying vulnerable code (beyond simple syntactic and logic bugs) and providing secure code suggestions by itself could form an interesting research topic. Other large language model-powered code completion tools, such as Copilot, have also been published in recent years. These models are trained with a huge amount of source code without any program analysis preprocessing. Due to limited resources, we are unable to train comparable models from scratch on program analysis processed data. We leave those comparisons as our further work.
We summarize our major findings from experiments.
—
Program analysis preprocessing is very important even with advanced embedding options. With all the embedding options (sequence-level embedding, token-level embedding, or one-hot encoding), program analysis makes big improvements in API completion accuracy. Without program analysis, the best accuracy on bytecode with the most advanced byteBERT is only 57.59%. With the API dependence graph construction, depBERT on dependence paths achieves the highest accuracy of 93.52% on the basic dataset.
—
Applying token-level embedding in API completion task training makes substantial improvement on program analysis process code corpora. On slices and dependence paths, the LSTM models trained with the token-level embedding slice2vec and dep2vec show significant accuracy improvements by 12% and 5%, respectively, compared with the one-hot vectors.
—
The accuracy improvement of sequence-level embedding (0.55%, on average) is not obvious under the basic setting. Hence, we do not recommend sequence-level embedding in that case. Meanwhile, we observe more significant improvements (5.10%) in sequence-level embedding under the cross-app scenario when the task-specific data size is small (App set 6). Thus, we recommend it for cross-app scenarios with small task-specific data.
4.5 Analogy Tests of Token-level Embedding
We perform the analogy tests to intuitively show the quality of token-level embeddings. Besides the impact on downstream tasks, good embedding vectors should also reflect the semantics of a token and its relationship with other tokens. In natural language processing, the quality of embedding is usually evaluated through analogous pairs (e.g.,
\(men- women \approx king- queen\) ) [
31,
32,
33]. Therefore, following the practice in the natural language field, we design a few analogy tests to help understand the quality of API embeddings based on different program analysis methods. In our work, we define analogous pairs as two pairs of APIs or constants, (
a and
\(a^{\prime }\) ) with (
b and
\(b^{\prime }\) ), having a high degree of relational similarity (i.e., analogous) in terms of some programming property. For Java cryptographic code, we identify four categories of analogous pairs as follows. We show examples in Table
11.
Direct Dependency. For two APIs where one always accepts the other’s output, they form a pair having a direct dependency. For example, after a KeyGenerator instance is created by KeyGenerator.getInstance(.), it always needs to be initialized through KeyGenerator.init(.). The analogous relation could also be found between KeyStore.getInstance(.) and KeyStore.load(.) where the latter loads the required information to the KeyStore instance created by the former. We view the two pairs as analogous pairs under this category.
Semantic Symmetry. For two classes KeyGenerator and KeyPairGenerator, the former generates secret keys for symmetric cryptography while the latter generates keys for asymmetric cryptography. There is a symmetry relationship between their APIs. For example, they both have the APIs getInstance(String) to create instances and APIs to generate the key.
Argument Symmetry. There is an analogous relation between API - constant pairs. For example, symmetric cipher ”AES” can be passed to javax.crypto.KeyGenerator: javax.crypto.KeyGenerator getInstance(java.lang.String) as an argument. For asymmetric ciphers, ”RSA” and API java.security.KeyPairGenerator: java.security.KeyPairGenerator getInstance(java.lang.String) have a similar relation.
Syntactic Variants. Some APIs share the same name but differ in their full signatures. These APIs are functionally equivalent but have different types of arguments or return values. We name them syntactic variants. For example, there are several APIs with the same name doFinal(.) of the Java class Cipher and Java class MAC.
Based on the analogous pairs, we define 14 tests. We calculate the vector of the embedded object
\(b^{\prime }\) based on the other three vectors of
a,
\(a^{\prime }\) , and
b. If the actual embedding vector of
\(b^{\prime }\) appears in the top
k nearest list of the calculated one (ideal value of
\(b^{\prime }\) ), then we say this analogy achieves rank
k. Examples of how to calculate rank
k are shown in Figure
5. The results of the 14 tests for
dep2vec,
slice2vec, and
byte2vec are listed in Table
12.
In this small-scale analogous pairs evaluation, dep2vec performs the best. dep2vec achieves the best rank 12 times of the 14 test cases. slice2vec does well in some cases but performs poorly in the syntactic variants category. This is likely because the syntactic variant APIs usually appear in different contexts in slices, making slice2vec fail to recognize their similarity. For other more complicated relationships such as semantic symmetry or argument symmetry, the APIs and constants belonging to a pair often appear far away from each other in the code, increasing the difficulty of the test.
6 Related Work
There are two main branches of code embedding solutions.
Embedding without program analysis. First, a line of research develops pure data-driven solutions on general source code tokens without program analysis [
3,
12,
14,
20,
26,
27,
46,
49]. They train neural network solutions to take as input programs that are treated as sequences of source code tokens. In Reference [
12], Buratti et al. claimed that the language model built on top of raw source code is able to discover
abstract syntax tree (AST) features automatically.
Embedding with program analysis. Second, some studies (e.g., References [
4,
9,
23,
58]) leverage the program structural information through program analysis. For example, the authors of Reference [
58] learned code embedding after constructing the graph representations (e.g., control flow graphs, data flow graphs) of code. Hellendoor et al. [
23] advocated a hybrid embedding method that considers both the graph structure and the raw sequences to overcome the size limit of graphs. To remove noises in code, Henkel et al. performed intra-procedural symbolic execution first and trained embedding vectors of symbolic abstractions from symbolic traces [
24]. However, there have not been systematic studies on how various hybrid approaches compare with a pure data-driven approach or with each other, in terms of downstream task performance.
Since there are various program
intermediate representations (IRs) under program analysis, the embedding objects also vary from approach to approach. For example, Henkel et al. obtained embeddings for self-defined symbolic abstractions. Ding et al. [
18] obtained embedding vectors
asm2vec for assembly code instructions. Ben-Nun et al. [
7] embedded LLVM IR instructions of code. Although the idea of leveraging the program structural information in embeddings is identical, these embeddings for low-level instructions or LLVM IRs cannot be directly compared with embeddings for API elements. Our
dep2vec and
depBERT can be viewed as graph-based embedding approaches applied to API elements.
A line of work focuses on API embeddings and related tasks [
6,
11,
13,
19,
21,
35,
36,
56]. Our work also lies in this category. Nguyen et al. [
35,
36] use API sequences in source code to produce embeddings for Java APIs and C# APIs. Using these vectors, they successfully mapped the semantic similar Java APIs with C# APIs. Our
byte2vec can be viewed similarly to it, as our API call sequences from bytecode are similar to their source code order. Chen et al. [
13] trained the API embedding based on the API description (name and documents) and usage semantics. The obtained API embeddings are used to infer the likely analogical APIs between third-party libraries. However, these solutions employ embeddings to help map analogical APIs, which is different from our task, API completion. In API completion work [
34,
37,
38,
43,
47], there is either no discussion about the impacts derived from different embedding options.