We argue that the dependency among the generated code tokens is not as strong as in NMT (where NAR has been successfully applied), and the left-to-right generation order is not always optimal for code completion. To verify those assumptions, in this section, we conducted an empirical study to analyze the dependency among target tokens in code completion and answer the following questions:
2.1 Model Architecture Design Choice
There are two alternative architectures for performing full-line code completion: the encoder-decoder architecture and the decoder-only architecture. These two architectures differ in the following ways:
(1)
Training process: The training process of the decoder-only model is the same as language models, that is, the model is trained to predict probability distributions of the next token given the previous context. Every token in the training programs is used to train the model in a supervised manner. While the encoder-decoder model is trained to predict the target code sequence given the source sequence, the cross entropy loss will only be built upon the target tokens. In other words, the decoder-only model has a denser supervision signal.
(2)
Cross-attention: Except for the self-attention, the encoder-decoder model also contains a cross-attention, which introduces information from the input sequence to the decoding layers.
In this section, we conducted a preliminary experiment to investigate the suitability of different architectures for line-level code completion and their performance with varying target lengths. The results are shown in Table
1.
ES and
EM represent the
Edit Similarity score and
Exact Match accuracy, respectively. A comprehensive explanation of these metrics can be found in Section
4.2. The results indicate that the decoder-only architecture performs better when completing shorter code sequences with a maximum length of 10 in all the metrics. However, as the target length increases, the encoder-decoder model outperforms the decoder-only model significantly. Moreover, the EM scores for both architectures show a significant decrease as the length of the sequences increases, clearly indicating the challenge in accurately completing longer sequences. Completing longer code sequences is more challenging as it requires a full understanding of the semantic information in the contextual code sequence to accurately predict the target token. With an explicit cross-attention mechanism, the encoder-decoder model can effectively model the dependency between the contextual tokens and the target tokens. On the other hand, completing shorter code sequences, which are comparatively easier, benefits from the decoder-only model as it is trained more extensively with a larger number of training data points (every token in the training programs is used).
Based on the results and observations, we ultimately decided to employ the encoder-decoder architecture for both the empirical study and SANAR. Hopefully, this choice will not only improve efficiency but also enhance the quality of code completion, particularly for longer code sequences.
2.2 Reversing the Order of Code Generation
To answer the first question, we conduct an experiment to analyze the impact of different generation orders. Following existing work [
43,
49], we employ autoregressive Transformer architecture to perform line-level code completion using two reverse orders: left-to-right (Transformer-L2R) and right-to-left (Transformer-R2L). The model architecture and configurations remain the same as in Transformer-base [
47] and our approach (Section
4). Specifically, we utilize a 6-layer encoder and a 6-layer decoder with a model size of 512 and an intermediate size of 2,048, along with 64 attention heads per block. The model is trained from scratch on our data.
As shown in Figure
1, these two models have the same architecture, and the only difference is the generation order of the target code sequence, where Transformer-L2R conducts completion in a left-to-right manner, and Transformer-R2L conducts completion in a right-to-left manner. Transformer-L2R predicts code sequences from left to right, capturing the left-to-right dependency in the target code snippets. Oppositely, Transformer-R2L captures the right-to-left dependency in the target. From an empirical standpoint, we argue that the left-to-right order is not always the optimal coding order for programmers, and therefore may not be the optimal decoding order for the decoder. Different generation orders result in distinct conditional contextual information, thereby affecting the difficulty of correctly predicting subsequent tokens. With this in mind, this experiment aims to determine if there are differences in difficulty when learning the dependency in different directions, by comparing the performance of various generation orders.
We use Python [
38] and Java [
2] benchmark datasets to evaluate their performance, where the model is employed to predict the next line of code given previous (up to) 10 lines. We adopt the BLEU-4 score, Exact Match accuracy (EM), and Edit Similarity score (ES) as metrics to evaluate the quality of the generated code. Table
2 shows the completion results of each model. As seen from the results, the overall performances of completing code with different generation orders are comparable. Surprisingly, the right-to-left order outperforms the left-to-right order in terms of BLEU-4 score. The results suggest that the standard left-to-right generation process is not always optimal for code completion. Other generation orders or manners are also feasible.
Furthermore, we present the percentage of target code lines that can only be correctly predicted by each model in Table
3. For Python language, 2.65% and 2.47% of programs can only be correctly generated by L2R and R2L, respectively; 6.57% and 7.21% of the programs can only be approximately generated (edit similarity >50%) by L2R and R2L, respectively. Also, there are more than 60% target code lines that can be approximately generated by both L2R and R2L. The results on the Java dataset are similar. These results further confirm that the standard left-to-right generation process is not optimal for completing correct code lines. In this article, we take the first step to explore completing code in parallel, i.e., generating code tokens non-autoregressively.
2.3 Quantitative Dependency among Code Tokens
To answer the second question, following Ren et al. [
39], we build a
Dependency Analysis Model (DAM) to measure the quantitative dependency among the generated tokens by attention density ratio and compare the measured dependency among code tokens against that among natural language tokens. Considering that NAR models have been successfully applied to NMT tasks, we select NMT for comparison.
To measure the dependency among target tokens, and compare it with the dependency on source tokens, we have the following considerations in the design of DAM: (1) Predicting the current masked target token based on bi-directional target context and source tokens; (2) Ensure the dependency on source and target tokens to be comparable. However, neither the encoder nor the decoder of the encoder-decoder architecture can fulfill this target:
—
Encoder: cannot attend to target tokens.
—
Decoder: although cross-attention and self-attention are available (attend to both source tokens and previous target tokens), they cannot attend to the right context of the target sequence, and also are hard to ensure the dependency on source and target tokens to be comparable.
For this reason, we build a variant of the Transformer encoder following Ren et al. [
39], which applies mix-attention to calculate the attention weights on both source and target tokens in a single softmax function. Specifically, we build a Transformer encoder model, which takes the whole source sequence and the partially masked target sequence as its input. Then we make it learn to predict the masked target tokens. DAM utilizes a mix-attention [
18] mechanism, where source tokens can only attend to source tokens while target tokens can attend to all source and target tokens. By learning to predict masked target tokens based on source context and target context, DAM learns to allocate different ratios of attention weights to source tokens and target tokens in the mix-attention. After convergence, we measure the dependency among target tokens using the trained DAM. This is done by calculating the ratio of attention density
\(\alpha _i\) in the target context to that in the full context when predicting a specific target token
\(y_i\). It is defined as follows:
where
\(A_{i,j}\) denotes the attention weights from token
\(i\) to token
\(j\) in mix-attention,
\(j \in [1,N]\) represents the target token, and
\(j \in [N+1,N+M]\) represents the source token.
\(M\) and
\(N\) are the lengths of the source and target input, respectively.
\(\sum _{j=1}^{N+M} A_{i,j} = 1\). It is worth noting that the attention score of the multi-heads is aggregated through averaging. According to existing study [
8,
50,
54], Transformer has been loosely shown to encode local syntax (e.g., attend more to adjacent tokens) in the lower layers, and more information about semantic knowledge and task-specific knowledge in the higher layers. In this experiment, we aim to analyze the token dependency from a semantical perspective, instead of just focusing on shallow information like the lexical relationship, thus we decided to compute the attention from the last layer.
Following Ren et al. [
39], we modified the attention mask in the Transformer model to change the attention scope and applied the mask on all the attention layers and heads. DAM is trained to predict the masked tokens and counts
\(\alpha _i\) for those masked tokens. Different masking ratios
\(p\) might give different statistical results. For a given masking probability
\(p\), the final attention density ratio
\(R(p)\) is calculated by averaging
\(\alpha _i\) on all test data:
where
\(\mathcal {M}^p\) indicates the masked token set under masking probability
\(p\). Since
\(\alpha _i\) denotes the attention density ratio on the target context when predicting the target token
\(i\), a higher value of
\(\alpha _i\) indicates that the prediction of token
\(i\) places more emphasis on the target context, implying a stronger dependency among the target context elements. The function
\(R(p)\) represents the average attention density ratio
\(\alpha _i\) across all the predicted tokens in the test set, with a masked probability of
\(p\). A larger value of
\(R(p)\) signifies that, on average, there is a greater dependency on the target context when predicting the target tokens across the entire test dataset.
Thus, a bigger attention density ratio \(R(p)\) indicates a larger dependency among target tokens. In the extreme case of
\(p=1\), DAM reads an all-masked target sequence as input and makes predictions based only on the source sequence, resembling the Fully-NAR model.
We employ DAM to conduct experiments on Python [
38] and Java [
2] datasets for the line-level code completion task. Additionally, we apply DAM to the IWSLT 2014
German-English (De-En) translation dataset
2 for the NMT task. Since a larger
\(p\) naturally gives a lower
\(R(p)\) because of the limited target context, which further diminishes the difference of
\(R(p)\) between tasks, we make statistics on relative low masking probabilities {0.15, 0.3, 0.5}.
The results are shown in Figure
2. The subfigures on the left and right sides correspond to the results obtained from the last and first layers, respectively. For the last layer, we found that the attention density ratio for NMT is bigger than code generation for all masking probability
\(p\), which demonstrates the dependency among the target tokens in code completion is weaker than NMT. For the code completion task, we have observed a notable distinction in the attention density ratio between Java and Python. Specifically, the attention density ratio for Java is higher compared to Python. This finding implies that the dependency among the target tokens in Python is comparatively weaker than that in Java. We suspect that this disparity could be attributed to Python being a dynamic language, where the overall relevance of tokens in Python code tends to be lower compared to Java. As a result, the interdependence between tokens may be less pronounced in Python, leading to a lower attention density ratio in the code completion task. Also, the
\(R(p)\) on code completion is closer to 0.5, which means that DAM pays more balanced attention to both source and target sides compared with NMT.
Regarding the results of the first layer, we observe that the overall attention density ratio scores computed from the first layer are consistently higher than those from the last layer across all three tasks. This suggests that the inter-dependency among target tokens is greater in the lower layers, as the attention weights tend to give more importance to the adjacent tokens. Furthermore, we observed a consistent order of scores for the three tasks in both the first and last layers. These findings reinforce our previous results and provide additional evidence of the distribution of attention throughout the model.
Based on the above results, we argue that it is possible to predict code tokens in parallel. To this end, we propose SANAR, a Non-autoregressive model for statement-level code completion.