Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Review 4 Repair

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Information and Software Technology 143 (2022) 106765

Contents lists available at ScienceDirect

Information and Software Technology


journal homepage: www.elsevier.com/locate/infsof

Review4Repair: Code review aided automatic program repairing


Faria Huq a , Masum Hasan a , Md Mahim Anjum Haque a , Sazan Mahbub a , Anindya Iqbal a ,∗,
Toufique Ahmed b
a Bangladesh University of Engineering and Technology, Bangladesh
b University of California, Davis, CA, USA

ARTICLE INFO ABSTRACT

Keywords: Context: Learning-based automatic program repair techniques are showing promise to provide quality fix
Automatic program repair suggestions for detected bugs in the source code of the software. These tools mostly exploit historical data of
Deep learning buggy and fixed code changes and are heavily dependent on bug localizers while applying to a new piece of
Code review
code. With the increasing popularity of code review, dependency on bug localizers can be reduced. Besides, the
code review-based bug localization is more trustworthy since reviewers’ expertise and experience are reflected
in these suggestions.
Objective: The natural language instructions scripted on the review comments are enormous sources of
information about the bug’s nature and expected solutions. However, none of the learning-based tools has
utilized the review comments to fix programming bugs to the best of our knowledge. In this study, we
investigate the performance improvement of repair techniques using code review comments.
Method: We train a sequence-to-sequence model on 55,060 code reviews and associated code changes. We
also introduce new tokenization and preprocessing approaches that help to achieve significant improvement
over state-of-the-art learning-based repair techniques.
Results: We boost the top-1 accuracy by 20.33% and top-10 accuracy by 34.82%. We could provide a
suggestion for stylistics and non-code errors unaddressed by prior techniques.
Conclusion: We believe that the automatic fix suggestions along with code review generated by our approach
would help developers address the review comment quickly and correctly and thus save their time and effort.

1. Introduction automating these processes [8,9]. Classic automatic program repair


techniques attempt to modify a program with the help of a specifica-
Code Review has been prevalent in the Software Engineering com- tion for the intended program behavior, such as a test suite [10–19].
munity for a long time now. In 1976, Fagan introduced a highly Practically, a well-specified test suite is challenging to create, and
structured process for reviewing code [1], which involved intensive the generated solutions overfit to a weakly specified test suite [8,20,
line-by-line code inspection. Therefore, it required a significant amount 21]. Recent improvements in advanced machine learning techniques,
of development time. Over the last few decades, the nature of code especially deep learning, and the availability of many patches are
review has changed a lot, becoming more informal and tool-based. encouraging learning-based repair. Instead of relying on a test suite,
Big companies such as Microsoft [2,3], Google [4], Facebook [5] and these techniques rely on previous code fixes in similar code defects.
also the Open Source Software (OSS) projects [3,6] have adopted However, they are yet to achieve acceptable quality in most cases. One
lightweight review practice accelerating the review process. Moreover, beautiful code review attribute is the symbiosis of informal natural
we are now blessed with some wonderful review tools (e.g., Gerrit,
comments by the reviewer and a more formal, well-defined structured
ReviewBoard, Github pull-based reviews, and Phabricator). These tools
code authored by the developer. Can we reduce the developers’ effort
have enabled the reviewer to make inline comments to enlighten
by partially automating the bug-fix or refactoring using the review
specific issues that the developer needs to address.
given by the reviewer? Most of the learning-based repair tools have
Fixing defects in a program are inherently very tedious and ex-
to depend on bug-localizers to apply fixes to a new code. Code review
pensive, accounting for nearly 50% of the total cost in Software De-
comments can be utilized as trustworthy bug-localizers because some
velopment [7]. Researchers are trying to provide better solutions by

∗ Corresponding author.
E-mail addresses: 1505052.fh@ugrad.cse.buet.ac.bd (F. Huq), masum@ra.cse.buet.ac.bd (M. Hasan), mahim@vt.edu (M.M.A. Haque),
1505020.sm@ugrad.cse.buet.ac.bd (S. Mahbub), anindya@cse.buet.ac.bd (A. Iqbal), tfahmed@ucdavis.edu (T. Ahmed).

https://doi.org/10.1016/j.infsof.2021.106765
Received 30 September 2020; Received in revised form 4 February 2021; Accepted 29 October 2021
Available online 24 November 2021
0950-5849/© 2021 Elsevier B.V. All rights reserved.
F. Huq et al. Information and Software Technology 143 (2022) 106765

Fig. 1. Impact and importance of code review comment in generating correct code change.

experience developers verify the bugs’ location. We envision that code 1. We develop and publicly release 55,060 training data and 2,961
review comments can play an essential role in improving the quality test data of code changes and their associated code reviews.1
of fix suggestions by providing insight from the reviewer’s experience 2. We develop sequences-to-sequence learning models based on
and expertise, localizing the bugs effectively. We illustrate two real- one of the best performing summarization networks followed
world examples in Fig. 1 (taken from Eclipse [22]) where the source by extensive preprocessing, new tokenization, and vocabulary
code snippets were similar, but the reviewer’s comments were dif- creation. We show that utilizing code review and source code
ferent, which led to two very different solutions. This paper aims to improves the repair accuracy by 20.33% in Top-1 prediction and
utilize such communication between the reviewer and the developer, 34.82% in Top-10 prediction. Our tool significantly outperforms
improving state-of-the-art deep learning-based automatic program re- state-of-the-art learning-based program repair techniques [24,
pair approaches. Note that our designed models can work both with 25]. The source code of the developed tool has also been released
and without code reviews. Whereas the fix suggestions made by the to encourage reproduction.2
model following review comments might be termed as ‘‘translation" or 3. We provide fixes utilizing code reviews for stylistic and non-code
‘‘improvement of a program based on natural language specification" issues along with bug fixes and refactoring, whereas prior works
have limited capability to address only the last two types. We
instead of ‘‘repair", the other model that works only analyzing the code
conduct a systematic analysis of 501 randomly selected samples
falls under the category of program repair techniques. Therefore, we
to develop a taxonomy of fixes. We found 47 subcategories of
would like to term our approach as a program repair technique in the
generated fixes depicting our model’s ability to learn a wide
broader sense that covers both translation and repair of source code.
variety of solutions.
To investigate the impact of code review, We have designed a
Neural Machine Translation (NMT) model based on pointer generator 2. Motivating example
network [23] that learns jointly from code review comments written in
Natural Language and corresponding code changes. When the reviewer In Fig. 3, we illustrate an example demonstrating the utility of
submits a review, the model generates candidate fix suggestions for our approach. In this scenario, the reviewer commented that the if
the intended change. These will be visible to the program author, condition should be changed to prevent always evaluating to the true
who can select the best one from the suggestions. Thus, the time situation. Given this comment, the developer is supposed to omit the
and effort needed for the program repair can be reduced, especially if condition and leave other portions unchanged. We aim to mimic
for the inexperienced developers who might not be readily aware of this activity in our approach, i.e., given the source code and reviewer’s
the solution. We conduct several data preprocessing steps, including comment, our approach should generate the fix that addresses the
new tokenization techniques termed as hard and soft tokenization. The issues mentioned in the comment.
entire workflow of our system is shown in Fig. 2. This also presents the This research explores if the quality of fix suggestions can be im-
usage scenario of the trained model where along with a code review proved, utilizing the suggestion given in code review comments. First,
multiple fix suggestions are generated from which the developer/code we built a sequence-to-sequence (seq2seq) network [23] (see 5.1) using
author can easily select the appropriate one. only the code changes and the model achieves comparable performance
Our approach covers a wide range of commonly reported issues in with state-of-the-art techniques [24]. We analyzed the examples that
code reviews and addresses more types of defects than other works. could not be properly fixed and considered that there may be some
Specifically, we could suggest the fix for stylistic changes (e.g., in- improvements in some cases if the review is also fed to our model.
Accordingly, we designed another model that takes review comments
dentation and formatting) that increase the readability of the code
as input in addition to code, and we observed improvements in some
and non-code issues (e.g., comment, annotation, logs, copyright issues,
cases. Fig. 3 shows one example of an exact fix generated when we pass
etc.).
both the source code and the reviewer’s comment. Note that although
We also have systematically developed a taxonomy of fixes gener-
the generated result without the addition of review comment is correct
ated by the tool by studying 501 random samples. We have identified
four categories (bug fix, refactoring, stylistic change, and non-code
change) and 47 sub-categories. 1
Zenodo: https://doi.org/10.5281/zenodo.4445747.
2
Our contributions are as follows: Github: https://github.com/Review4Repair/Review4Repair.

2
F. Huq et al. Information and Software Technology 143 (2022) 106765

Fig. 2. Complete workflow of our system.

Fig. 3. Example of how code review comment may help to generate better fix suggestion.

in logic and syntax, it is not the intended correct solution for the issue In a code review platform, such as Gerrit, when a developer submits
mentioned in this particular scenario. The review comment thoroughly a code patch, a reviewer is assigned and notified to review the code.
guides this change and any model trained without the review comments The reviewer inspects the code, and if the reviewer identifies a defect,
is not likely to be able to address this issue. We show some more he or she highlights one or multiple lines in the code and submits a code
examples successfully generated by our model in Table 1. review. By defect, in this paper, we imply any issue discussed in the
review comment that can be related to program functionality, a naming
convention, coding style, or even spelling mistakes. The developer
3. Overview
addresses the comment and submits a follow-up patch. Finally, when
there are no more issues in the code, the reviewer approves the code,
In this section, we present an overview of our application scenario and it is merged with the main codebase.
and the problem formulation. By analyzing these code patches, we can identify the code fragment
that was changed due to the code review. In this study, our goal is
3.1. System overview to create a learning-based system that can predict the changed code
automatically by observing a large number of code changes and code
The objective of this study is two-fold. review comments in historical data. Once deployed in a production en-
vironment, when a reviewer highlights a defect in a code file and writes
1. Suggesting fixes for the broader range of changes (defects) raised a review comment, our model will produce multiple fix suggestions for
in peer code reviews. the defect. The developer can choose one of the model’s suggestions or
2. Exploiting code review comments written in natural language to write his/her code fix. Fig. 4 shows different steps of training phase of
improve the fix suggestions’ quality. our model.

3
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 1
Some examples successfully generated by our model when we pass both source code and review comment to it.
Deleted and inserted tokens are marked by red and green color respectively.

Fig. 4. Steps showing training of Automatic Program Repair with code review. Model learns to predict code change (𝑓 ), from code before change (𝐶𝑑 ), defect location, and code
review comment (𝑅). Replacing the defect 𝑑 (red round box) with the code change (𝑓 ) creates the fixed code (𝐶𝑓 ).

3.2. Problem definition review comment will localize the bugs, and our deep learning model
will fix the error.
We design the task of program repair as a sequence-to-sequence Hence, the prediction by 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 can be defined as,
problem. We create two sequence learning models that attempt to
repair a defect. Our first model 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 is given the code before change ( )
𝑓̂ = arg max 𝑃 𝑓 ∣ 𝐶𝑑 , 𝑙, 𝑅
𝐶𝑑 and defect location 𝑙, along with the code review 𝑅 as input, and 𝑓
tasked to predict the code change 𝑓 for the defect 𝑑 (Fig. 4).
Model II:
Model I:
In training time the location 𝑙 is identified with our ‘change local- Our second model 𝑚𝑜𝑑𝑒𝑙_𝑐 is given the code before change 𝐶𝑑 , and
ization’ method (Section 4.2.1). After deployment, we assume that the the location 𝑙 for defect 𝑑 as input and tasked to predict the code change

4
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 2 Table 3
Project-wise data distribution. List of Irrelevant Review Comment.
Project name #CR #Java CR Train Test Irrelevant Review Comment
Acumos [29] 6773 1387 881 47 same as above, same as the above, same here, see comment above,
Android [30] 246253 23683 12512 689 same question here, perhaps this as well, see comment above, as
Asterix [31] 68033 23058 8509 453 discussed, new comment as above, same, see above, similar to
Cloudera [32] 151010 8623 3538 197 above, same concern as above, same comment as above, and here,
Couchbase [33] 68864 1347 808 45 here too, same comments as above, same thing, same complaint
Eclipse [22] 51919 16903 11612 621 here, same as below, nit, ditto, thanks, fixed with the next upload,
Fd IO [34] 26281 866 612 34 uh no, nice, nice thanks, love it, ‘ok, fixed with next update’,
Gerrithub [35] 116464 2102 1334 66 ‘yes, you are right’, done, likewise, i see, and again
Googlereview [36] 141410 23857 13849 734
Iotivity [37] 61462 1286 847 48
Others [38–42] 10201 878 558 27
Total 948670 103990 55060 2961

𝑓 . Hence, the prediction by 𝑚𝑜𝑑𝑒𝑙_𝑐 can be defined as,


( )
𝑓̂ = arg max 𝑃 𝑓 ∣ 𝐶𝑑 , 𝑙
𝑓

By replacing the defect 𝑑 from defected code 𝐶𝑑 with the fix


suggestion 𝑓 , we get the fixed code 𝐶𝑓 .

𝐶𝑓 = 𝐶𝑑 − 𝑑 + 𝑓

Using beam search decoding [26], the developer will be offered the
top 𝑁 fixed code suggestion {𝐶𝑓1 , … , 𝐶𝑓𝑁 } to choose, where 𝑁 ∈ N.

4. Data preparation
Fig. 5. Line distance distribution between corresponding line of code review and
In this section, we briefly describe data collection, data preprocess- nearest code change.
ing, training and test set preparation for this study.

4.1. Data collection and cleaning 4.2.1. Change localization:


To identify the exact location of changes in the codes of training
A learning-based automated code repair approach based on code data, we build a ‘‘Change Calculation’’ tool using Java DiffUtils [46]. The
review requires a large pool of review comments and the associated tool takes the code file before and after the change and calculates the
source code before and after the fix in the training dataset. We choose two files’ differences. We consider that the code change that is closest
Gerrit [27] for collecting the data as it is a standard and widely used to the code review location is a result of the code review. Bosu et al. [2]
code review tool. We created a GerritMiner with Java using Gerrit REST demonstrated that useful code review comments trigger a change close
API [28]. We mined 28 publicly available projects that had Gerrit repo to the line where the comment was submitted. We refer this line as
for code review. Out of these, 15 projects (Table 2) were selected review_line. To investigate the case with our dataset, we observe the
finally as they contained at least one Java file. We mined code review line difference between review_line and the place of the nearest code
comments and associated code files submitted roughly from December change. This is shown in Fig. 5. The distribution shows that 91.27% of
2008 to November 2019. The mining process took approximately 2.5 the nearest changes are within 5 line difference from the review_line.
months on a Intel® Core™ i7-7700 Processor. Hence, we consider each change starting within the window of 5 lines
We mine 1,068,536 code reviews in total from the 15 projects. To to develop training data and discard the sample corresponding to the
ensure that our model is learning only meaningful changes, we carefully changes starting outside this window.
discard all code reviews that did not trigger any change within 5 line After deciding on the relevancy of code change and review com-
ments, we explicitly concentrate on the code change. We term the
proximity before or after the line of code where the review comment
buggy source code as code before change, and the lines changed there as
was made. We also discard all follow-up conversations to a previous
focus. We mark focus with two special tokens, i.e., <|startfocus|>|
review as identifying useful follow-up comments is beyond the scope
and <|endfocus|>|. We call the fixed code as code after change and
of this paper. Following previous studies in the literature [24,25,43–
changed portion within the focus as the target. We elaborate on the
45], we intend to work on program repair for Java code. Hence, for
terms in an example presented in Fig. 6. Observing the data, the model
our experiments, we selected reviews corresponding to .java files only.
learns to change the content of the focus into the target using the code
After these filtering steps, we manually examined around 3000
review comment and the surrounding code as context.
selected comments to check the quality of the dataset. We have found
As sequence to sequence networks are designed to transform one
that 1.32% inline comments of the filtered dataset were not relevant to
sequence to another, we formulated all three operations (i.e: insert,
any change as shown in Table 3. We consider that such a small noise update and delete) as ‘update’ operation. Thus our model can learn
will not cause significant problems for training our model. across all types of changes uniformly and utilize the knowledge gained
from one type of change to another. To be formulated as an update
4.2. Input representation operation, a defected portion of the input stream is marked as focus,
and the modified portion is termed target. We take different measures
In this section, we discuss how raw source code files were formatted for identifying the focus and the target depending on whether the code
for the learning model. change is an insert, delete, or update operation. Fig. 6 demonstrates

5
F. Huq et al. Information and Software Technology 143 (2022) 106765

these three measures. We design our system so that in a production 2. Splitting camelCase and snake_case Identifiers: Identifier
environment, a reviewer can select one or multiple lines of the code names such as variable, function, or class names contain human
and submit a comment. The model will consider the selected lines as language components that carry meaning about the identifier’s
focus and try to predict some solutions for it. Replacing the focus with functionality. Reviewers often make comments about the atomic
one of the predicted solutions is expected to generate a syntactically, components, instead of the full identifier name. Hence, we split
semantically, and stylistically correct code. The author will select a all camelCase and snake_case identifiers so our model can
suitable one from the predicted solutions. Now, we discuss how we deal identify those atomic components from code and execute the
with the three types of edits. instruction given in the review comment. An example of the
splitting process is presented in Fig. 7. It also reduces the size of
1. Insert: The challenge of posing an insert operation as an up- vocabulary [47,48]. Our dataset’s total number of unique tokens
date operation is that in the input source code file, there is was reduced from 1,99,361 to 43,753 (78.05% reduction) due to
no explicit focus to select, as completely new lines are being identifier splitting.
inserted in between other lines. Therefore for insert operation,
the line where the reviewer submits the code review is as the We implement it using multiple tokenizers used in NLP (TweetTo-
focus, and the inserted line(s) accompanied with the focus is kenizer, WordPunctTokenizer, MWETokenizer in NLTK Library [49]),
considered target. In this manner, by replacing focus with tar- which allow us to tokenize both code and code review in the same
get at production time, we still get the corrected code. One format.
limitation of this approach is that the model cannot make an
insert operation in places non-adjacent to the line of comment. 4.2.3. Input sequence:
Hence, reviewers are recommended to submit a review adjacent In our proposed design, an essential task would be to provide the
to the intended insertion. In our entire mined dataset, 86% of the code change’s surrounding context to the learning model. The ability to
insert operations already satisfied this requirement. Fig. 6 shows provide the right context would help the model understand the defect
an example of insert operation where the reviewer submits a better, copy tokens from the surrounding code, reduce overfitting, and
comment on Line 5, suggesting the author to add an else block. improve generalization. It also needs to consider the following goals:
Accordingly, the author inserts an else block after line 5. In this
1. Reduce code and code review into a reasonably concise sequence
case, we consider line 5 as the focus and the inserted code,
of tokens as a sequence to sequence neural network suffers from
along with the selected line, are considered the target.
a long input size.
2. Delete: In delete operation, the code inside focus is no longer
2. Subsume as much useful information as possible to allow the
present in the changed commit. Hence, to indicate deletion, our
model to capture the context better.
model produce a special token <|del|>| as the target. In Fig. 6,
the reviewer selects Line 6,7,8 and requests the author to We feed a context of window size 𝑊 to the model. The context
delete the else block. Here Line 6,7,8 is considered as the consists of the 𝑓 𝑜𝑐𝑢𝑠 and its surrounding tokens. We apply the following
focus and a special token <|del|>| is considered as the target. rules to generate the context for a review comment.
3. Update: The update operation is more straightforward, i.e., the
lines in the original code that require change are considered 1. If the focus is written within a function scope and the function is
focus, and the corresponding changed lines are considered the smaller than 𝑊 tokens, we consider the entire function as input.
target. 2. If the focus is inside a function scope and the function is larger
than W tokens, we keep up to W tokens (up to W/2 tokens from
the preceding part of the focus and W/2 tokens from the focus
4.2.2. Code review aware tokenization: and subsequent part) within that function scope as input.
The commonly used tokenization method (mentioned as soft to- 3. If the focus in the global scope, we can follow a similar strategy,
kenization in this paper) applied in [24,25,43,44] does not contain i.e., taking up to 𝑊 ∕2 tokens from the preceding part of the
whitespace tokenization or identifier splitting. In a shared program- focus and 𝑊 ∕2 tokens from the focus and subsequent part for
ming environment, maintaining a consistent style is an essential task input.
for the programmers. Therefore, we propose a tokenization method
(named as hard tokenization). Hard tokenization method has two unique The seq2seq network structure we adopt (Section 5.1) commonly
features, as discussed below. uses 400 to 800 tokens in the input sequence and 100 tokens as
output for similar applications such as code summarization [50,51].
1. Whitespace Tokenization: Programmers frequently use consec- We limit the context window 𝑊 of code to 400. Another element of
utive whitespaces (tabs and space) to indent their code. As we input sequence, i.e., code review comments are within 200 tokens in
want to preserve these coding styles, we need to consider the 98.725% cases in our dataset (Fig. 8(b)). Hence, we limit comment
whitespaces in our model. However, considering each whites- size as 200 tokens and consider the first 200 tokens of comments only
pace character as an individual token would significantly in- if it exceeds this length. Thus, the input sequence reaches up to 600
crease the input stream and affect the model’s learning process. tokens when comments are added to code. We empirically observed
Therefore, we replace consecutive whitespaces with predefined that longer sequences result in deteriorating performance. Fig. 8 also
tokens. For example, <s>, <2s>, <4s>and <16s>indicate 1, presents the distribution of focus length and target length.
2, 4 and 16 spaces, respectively. Similarly, <t>, <2t>, and
<4t>indicate 1, 2 and 4 tabs, respectively. <n>indicates a new- 4.3. Test set generation
line. Any number of consecutive whitespaces can be minimally
represented with the combination of our defined tokens. This We created a standard test dataset to evaluate different models
method helps us preserve all whitespace information, while with different parameters and settings on the same ground. The dataset
significantly reducing the total number of tokens, For exam- is suitable to be tested by the models with code review, without
ple, 3 consecutive space will be considered as <2s><s>in our code review, and with hard and soft tokenization. We removed all
model prediction. This method helps us preserve all whitespace duplicate data points from the dataset and sorted our data in reverse
information, while significantly reducing the total number of chronological order of the comment submission time. Then we selected
tokens. the most recent 5% data from each project. This ensures that our

6
F. Huq et al. Information and Software Technology 143 (2022) 106765

Fig. 6. Change localization for insert, delete, and update operation.

reported performance represents our model’s ability to predict future repair (Table 15). Therefore, we exclude it from our recommended
code changes by learning the past code change patterns. Also, by taking model presented in the next subsections.
5% data from each project, we ensure that the test set represents all
projects in the dataset. The size of the test data collected from each 5.2. Open vocabulary problem
project is shown in Table 2.
Vocab size is one of the primary factors that prevent us from
5. Proposed neural network architecture applying natural language models to programming language directly
due to unbounded open vocab (identifiers and strings) and arbitrary
In this section, we discuss the Neural Machine Translation (NMT) identifier names in programming language [48,57]. One solution for
model that learns code transformation both with and without peer code this problem is to increase the vocab size and train a model with a
review. We also describe the NMT model’s changes to incorporate both large softmax layer. However, some open vocabs are rarely used or
programming language and natural language in the same input stream. show multiple and conflicting usages (any open vocab can replace other
open vocab if refactored correctly). The model faces difficulties with
5.1. Pointer generator architecture generalizing the task well using such tokens. In those cases, an abstract
version (i.e., <unk>) may be more useful in the learning process.
We train a pointer generator network as implemented by Gehrmann However, we need to recover the exact tokens to produce the correct
et al. [23] available in OpenNMT Pytorch distribution [52], which patch for providing fix suggestions. Copy mechanism [55] is one of
is a state-of-the-art sequence-to-sequence (seq2seq) architecture for the solutions proposed by the prior researchers [24] to address this
text summarization. Seq2seq networks used for summarization can problem. It utilizes the given context and recovers the unknown tokens.
generate desired output text by deriving and summarizing informa- Note that most of the unknown tokens have to be identifiers or some
tion from the most relevant parts of the input (i.e., code and review parts of the strings (all common programming languages: C, C++, Java,
comment). This architecture’s ability to generate desired text/code is Python, etc., have less than 100 closed vocab: keywords, operators, and
already established in literature [24,25,44]. However, they applied the delimiters). Open tokens are replaceable by other open tokens. Copy
seq2seq model only for source code, whereas we have adopted both the mechanism can track down the expected open token from the context
source code and natural language code reviews. For this purpose, we and discard other probable tokens for identifiers and strings. Therefore,
incorporate a custom vocabulary as detailed later. effectively solving the open vocabulary problem.
We apply LSTM [53] as base RNN with attention mechanism [54]
and copy mechanism [55] for handling out-of-vocab (OOV) tokens. 5.3. Augmented vocabulary
In the original implementation of the pointer generator network, cov-
erage mechanism [56] is used to limit the repetition of the tokens Our model uses the separate vocabulary for source and target be-
in network output. However, programming language keywords repeat cause it has to encode information from both code and natural language
frequently. Hence, the use of a coverage mechanism affects program code review comments but generate only code tokens. Moreover, code

7
F. Huq et al. Information and Software Technology 143 (2022) 106765

Fig. 7. Demonstration of hard and soft tokenization method. Hard tokenization splits tokens in atomic Natural Language units, and considers whitespace groups as special tokens.

Fig. 8. Token Size Distribution in our dataset. Our token size limit covers majority of our dataset.

review has a less dominant presence in our input sequence. The average 93.56% and 98.86% tokens out of total source code and code review
length for a code review and source code in our dataset are 36.80 and comment tokens, respectively.
320.38 tokens, respectively. If we apply the standard procedure to cre-
ate vocabulary followed in the literature [24,25,43,44,50,54,58–61], 5.4. Training with code change and code review comment (𝑚𝑜𝑑𝑒𝑙_𝑐𝑐)
most code review tokens will be considered as out-of-vocabulary (OOV)
tokens. Hence, our model will fail to gain a contextual understanding We create a baseline model with the pointer generator network [23]
that takes both the code (before change) and the code review comment
of the code review instructions, even with a copy mechanism [55].
as an input. This model is termed as 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. To separate review
To combat this issue, we propose a larger vocabulary for comments
comment and code from each other, we wrap the review comments
compared to code segments. We consider various combinations of
with two special tokens <|startcomment|>| and <|endcomment|>| and
vocabulary size as discussed in Section 9. Similar to SequenceR [24], the code with special tokens <|startcode|>| and <|endcode|>|. Finally,
we find that large code vocabulary affects model performance. We have we concatenate them to produce a single input stream. As discussed
experimented with different combinations of vocab sizes (see Section 9) earlier in Section 4.2.3, the code review and code are limited to 200
and found the best configuration containing the most frequent 2,000 tokens and 400 tokens, respectively. Thus the network has an input
tokens from the codes and 8,000 tokens from the comments. They cover size of 600. The input vocabulary of baseline 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 contains 10,000

8
F. Huq et al. Information and Software Technology 143 (2022) 106765

tokens; 2,000 are from code and 8,000 are from review comment as 6.3. Hardware and training time
described in 5.3. The output of 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 has a maximum length of 100
tokens and 2,000 vocabularies.
We trained our model on NVIDIA® V100 Tensor Core GPU with
16 GB VRAM, 16 GB RAM and eight-core CPU on Google Cloud Plat-
5.5. Training with code change only (𝑚𝑜𝑑𝑒𝑙_𝑐)
form. Training each sequence to sequence model up to 80,000 training
steps took nearly 72 h of training time.
We create a second baseline model with the pointer generator
network [23] that predicts code change by watching only the code
before the change. This network is termed as 𝑚𝑜𝑑𝑒𝑙_𝑐. Since this model 6.4. Comparison with state-of-the-art models
does not consider the review comment, it has to deal with smaller
input vocabulary and input sequences than 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. Specifically, the
network’s input code and output are limited to 400 tokens and 100 In this section, we discuss the methodology of comparing our mod-
tokens, respectively. The most frequent 2,000 code tokens in the train- els with two recent comparable works [24,25].
ing dataset are considered for both the input and output vocabulary of Tufano et al. [25] create two different neural machine translation
𝑚𝑜𝑑𝑒𝑙_𝑐. The output of 𝑚𝑜𝑑𝑒𝑙_𝑐 is identical to 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. models, one with functions less than 50 tokens, and the other with
functions between 50 to 100 tokens. The dataset for their study is
5.6. Inference and detokenization collected from three large code repositories: Android [30], Google
Source [36], and Ovirt [62]. One of these (Android) is common with
After training the model, we use the trained model to generate our dataset (Table 2). Since our model requires review comments, it is
suggestions. During inference, we prepare our input following hard infeasible to use the exact dataset proposed by them [25]. Therefore,
tokenization, as discussed in Section 4.2.3. We use beam search decod- we decided to use test data extracted from Android project only for
ing [26] to generate multiple possible suggestions similar to previous a fair comparison. We created two code change test datasets from
works [24,25,43]. We generate our target patches by detokenizing the Android project with two settings of token size, as mentioned below:
suggestions from the model. Our Hard Tokenization method prevents
any information loss. Thus, the source code can be reproduced trivially, 1. 𝑇 𝑒𝑠𝑡𝑠𝑚𝑎𝑙𝑙 : containing 292 instances where token count is: 0 <
preserving whitespace, indentation, and coding style from the token 𝑡𝑜𝑘𝑒𝑛_𝑐𝑜𝑢𝑛𝑡 ≤ 50;
stream. 2. 𝑇 𝑒𝑠𝑡𝑚𝑒𝑑𝑖𝑢𝑚 : containing 246 instances where token count is: 50 <
𝑡𝑜𝑘𝑒𝑛_𝑐𝑜𝑢𝑛𝑡 ≤ 100.
6. Experimental setup
Both of these test datasets contain only functions with a single
This section describes our neural network model’s specific imple-
code review comment and a single code change. These data points are
mentation details, evaluation criteria, and comparison method for state-
carefully removed from our training dataset. We reproduce the two
of-the-art models.
models proposed by Tufano et al. [25] with the data and source code
released by the authors [63] and achieve nearly identical validation
6.1. Evaluation criteria
result as reported in their paper. We replaced identifier names and kept
We evaluate each of our models in the standard test set 𝑇 described the mappings to reduce the vocabularies exactly following [25]. After
in Section 4.3. For each 𝑡 ∈ 𝑇 we perform inference with Beam Search validating their model, we compare our approach with them by training
Decoding [26] with beam size 𝑘 = 10, which was commonly used in their model with our dataset. The results comparing Tufano et al. and
literature [43] and our empirical finding also supports it. We measure our models on 𝑇 𝑒𝑠𝑡𝑠𝑚𝑎𝑙𝑙 and 𝑇 𝑒𝑠𝑡𝑚𝑒𝑑𝑖𝑢𝑚 datasets are shown in Table 5.
the Top-1 accuracy, i.e., the percentage of fixes that our model predicts SequenceR [24] performs single line update operations inside func-
as the top-most suggestion, Top-5 accuracy, i.e., the percentage of fixes tions with defects, given the code of the function and the line number
that it predicts as one of the first five suggestions, and similarly measure of the defect. Their model cannot handle insert, delete, multi-line
Top-10 accuracy. operations, and defects outside of function scope (i.e., for comments
We further manually analyzed the predictions made by the mod- and global data). To make a comparison with SequenceR [24], we
els and evaluate their quality for different types of code changes selected 349 instances from our standard test dataset (Section 4.3)
( Appendix). that comprise single line update operations only. We implemented
SequenceR preprocessing, training, and test pipeline with the help
6.2. Network parameters of their source code [64] and achieve similar performance with the
reported performance of in their paper. We created two different im-
We experiment with our model with different parameter settings
plementations of SequenceR after validating the original work. The first
and justify the choices in an ablation study (Section 9). The best
model was trained with the 35,578 training data provided with their
performance is obtained with the following model architecture.
paper and termed as 𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑅. The second model is trained with
• Input Embedding: 2002 × 256 (𝑚𝑜𝑑𝑒𝑙_𝑐), 10002 × 256 (𝑚𝑜𝑑𝑒𝑙_𝑐𝑐); training data collected from our mined data that satisfy the SequneceR
2,000 and 10,000 vocabulary and 2 special tokens each. dataset constraints. We have trained it with 56,000 data to make a
• Input sequence length: 400 (𝑚𝑜𝑑𝑒𝑙_𝑐), 600 (𝑚𝑜𝑑𝑒𝑙_𝑐𝑐) fair comparison with our model. This implementation of SequenceR
• Output sequence length: 100 (both 𝑚𝑜𝑑𝑒𝑙_𝑐 and 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐) is referred as 𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑅𝑛𝑒𝑤 . We test both these models and our best
• Encoder Bidirectional LSTM size: 256 × 128 × 2 models with and without code review on the prepared test set of 349
• Bridge between Encoder and Decoder: 128 × 128 × 2 + 128 × 2 data. The comparison is shown in Table 6.
• Decoder LSTM size: 512 × 256
• Global Attention: 256 × 256 × 3 + 512 × 256 + 256 × 2
7. Evaluation and results
• Token Generator Decoder: 2000 × 256
• Copy Generator: 256 × 2000 + 2000 + 256 × 1 + 1
• Coverage Attention: False In this section, we discuss the experimental results and findings of
• Beam size during inference: 10 our research.

9
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 5
Comparison of Tufano et al. [25] and our models.
Model Top-n Prediction 𝑇 𝑒𝑠𝑡𝑠𝑚𝑎𝑙𝑙 (292) 𝑇 𝑒𝑠𝑡𝑚𝑒𝑑𝑖𝑢𝑚 (246)
1 2 (0.68%) 1 (0.41%)
Tufano et al. [25] 5 6 (2.05%) 3 (1.22%)
10 7 (2.40%) 4 (1.63%)
1 21 (7.19%) 11 (4.47%)
𝑚𝑜𝑑𝑒𝑙_𝑐 5 52 (17.80%) 38 (15.44%)
10 80 (27.40%) 46 (18.69%)
1 31 (10.61%) 24 (9.76%)
𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 5 71 (24.31%) 55 (22.36%)
10 93 (31.85%) 63 (25.61%)

Fig. 9. Top-1, Top-5, and Top-10 test accuracy of model trained without code review Table 6
(c) and with code review (cc) for both hard and soft tokenization method. Comparison with SequenceR [24].
Model Top 1 Top 5 Top 10
SequenceR𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 1.27% 1.52% 2.02%
Table 4 SequenceR𝑛𝑒𝑤 3.03% 6.58% 7.34%
Baseline model accuracy (in percent) for model_c and model_cc in hard tokenization, and 𝑚𝑜𝑑𝑒𝑙_𝑐 3.54% 12.91% 16.96%
relative improvement of model_cc over model_c. 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 8.86% 18.48% 25.31%
Model Top-1 Top-5 Top-10
Baseline model_c 16.29 20.94 23.37
Baseline model_cc 19.59 27.73 31.51
Relative improvement +20.33 +32.41 +34.82

7.1. How effective is code review in automatic code repair?

In this study, we aim to show whether code review can improve


automatic code repair performance. We train our model with two
different settings, with and without code review termed as 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 and
𝑚𝑜𝑑𝑒𝑙_𝑐, respectively. The construction of these models is discussed in
Section 5.1.
Fig. 9 and Table 4 clearly show that incorporating the code review
comments improve the prediction accuracy for both hard tokenization
and soft tokenization methods in all Top-1, Top-5, and Top-10 predic- Fig. 10. Manually created taxonomy for 501 randomly sampled data from test set.
tions. Since Hard Tokenization has better Top-1 accuracies on both
𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 and 𝑚𝑜𝑑𝑒𝑙_𝑐, we decided to use it for further analysis.
Top-1 prediction of our 𝑚𝑜𝑑𝑒𝑙_𝑐 is comparable to the performance of
7.2. How effectively does the model perform in comparison with state-of-
SequenceR𝑛𝑒𝑤 . However, the Top-1 prediction of 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 is signifi-
the-art techniques?
cantly better because of the addition of code review comments.
We aim to evaluate our model on a benchmark against well-
established approaches [24,25]. To ensure that we replicate the exact 7.3. Which types of changes can our models correctly predict?
settings used by previous architectures, we generate separate test cases
for comparing our models, as described in Section 4.3. We expect our models to suggest fixes for all types of issues reported
in code reviews. To inspect the ability to address different types of
7.2.1. Comparing with the methodology proposed by tufano et al. [25]
issues in real scenarios, we conducted a study. We randomly selected
As discussed earlier in Section 6.4, We show the comparison among
501 samples from our test set. We present the study outcome on the
Tufano et al. [25] and our models on 𝑇 𝑒𝑠𝑡𝑠𝑚𝑎𝑙𝑙 and 𝑇 𝑒𝑠𝑡𝑚𝑒𝑑𝑖𝑢𝑚 in
generated fixes of models 𝑚𝑜𝑑𝑒𝑙_𝑐 and 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐.
Table 5. The result shows both 𝑚𝑜𝑑𝑒𝑙_𝑐 and 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 outperform Tufano
Two authors performed the manual categorization of different code
et al. [25] on both test sets.
reviews. To begin with, they jointly labeled 100 samples by discussing
each review. The purpose was to develop a shared understanding and
7.2.2. Comparing with the methodology proposed by chen et al. [24]
We evaluate two different implementations of SequenceR, as men- to remove individual bias as much as possible. Based on the under-
tioned in Section 6.4. First, we apply the SequenceR model, released standing, they labeled 100 more samples independently. The Cohen
by the authors, and evaluate it on the test data (named it as kappa [65,66] value is 0.64, which indicates substantial agreement
SequenceR𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 ). To make a fair comparison, we also create a training between them. Furthermore, the two authors discussed the reviews
data set for SequenceR from our corpus and train the SequenceR again where disagreements occurred with other authors and converged to a
(named as SequenceR𝑛𝑒𝑤 ). Both of the results are displayed in Table 6. common ground. Next, the remaining 400 samples were labeled equally
We can see that the original SequenceR model performs very poorly by the authors independently.
on our test set. This poor performance is attributed to the differ- We categorized the possible code changes in four major classes: 1)
ence in the vocabulary of the training and test dataset. Our models Bug Fix [24,43,44] 2) Refactoring [25] 3) Stylistic change (changes re-
perform significantly better than SequenceR𝑛𝑒𝑤 . We can see that the lated to indentation and formatting) and 4) Non-code change (changes

10
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 7
An example of bug-fix successfully generated by our 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. Deleted tokens are marked by red color.

Table 8
An example of refactoring successfully generated by our 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. Deleted tokens are marked by red color.

in documentation and annotation). Although CodeBuff [67] and Nat- and BluetoothDevice.PHY_LE_2M. Similarly, a second example
uralize [68] fix formatting errors exploiting the style and coding con- breaks a long line into two lines as the reviewer commented. For
vention present in the existing code repository, however, none of them Stylistic change, we achieved 52.4% Top-10 accuracy. We also observed
consider the reviewer’s intention and act upon it. Fig. 10 presents the that stylistic change constitutes 17.3% of all the changes in our dataset
distribution of the major classes. We show how our models 𝑚𝑜𝑑𝑒𝑙_𝑐 & (see Table 9).
𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 perform for changes of different categories in Appendix when
we consider Top-10 accuracy. We illustrate some of the successfully 7.3.4. Non-code change
generated examples by 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 for each major class sub-categories. We consider changes in non-code regions such as string value,
log, code comment, documentation, annotation, and copyright license
7.3.1. Bug fix header under this category [72,73]. We analyze an example of this cate-
This category consists of code changes that are necessary to over- gory, where the reviewer mentions ‘2017’ as the appropriate copyright
come system glitch, incorrect output, and unwanted behavior [69]. We license year for the file. Our model was able to capture the domain-
observe a total of 14 sub-categories under Bug Fix. In the successful specific context and generate the intended target in this case. Similarly,
case illustrated in Table 7 from the project GoogleReview, we see the the second example removes an unnecessary annotation. For non-code
reviewer shows an argument why the exception should be thrown con- changes, we achieved 24.2% Top-10 accuracy. We also observed that
ditionally as a high-level overview. Our model successfully generates
non-code change constitutes 16.8% of all the changes in our dataset
the target code as specified by the reviewer. For bug Fix, we achieved
(see Table 10).
13.3% Top-10 accuracy. We also observed that bug fix constitutes
25.3% of all the changes in our dataset.
8. Discussion
7.3.2. Refactoring
Refactoring includes code changes intended for code maintenance, 8.1. Is our system human-competitive?
which does not change external behavior [70] of the system. We
observe a total of 23 sub-categories under Refactoring. We illustrate According to [74], for a generated code change to be human-
a successful sample generated by 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐 from the project Android. competitive: 1) the system has to synthesize the code change faster
For Refactoring, we achieved 32.4% Top-10 accuracy respectively. We than the human developer 2) the changed code has to be judged good-
also observed that refactoring constitutes 40.6% of all the changes in enough by the human developer and permanently merged in the code
our dataset (see Table 8). base. To generate Top-10 suggestions for a code, our system takes
around 0.36–4.8 s which is significantly faster than a human developer.
7.3.3. Stylistic change To judge the quality of our generated code, we design a study with
Code changes required to ensure proper indentation and formatting four professional software developers working in a software company
such as newline insertion, tab spacing, whitespace addition/deletion specialized in the development of automatic code fix suggestions. We
are considered under this category [71]. We illustrate this with two prepared two separate sets of samples and distributed them to the
successful test samples. In the first example, the reviewer emphasizes to developers. In the first survey, we randomly selected ten instances from
add whitespace before BluetoothDevice.PHY_LE_2M. Our model our test set where our tool provides suggestions that match the actual
generates a correct solution by adding a whitespace token between != solution. We asked the professional human evaluators to comment on

11
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 9
Two examples of stylistic change successfully generated by our 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. The red highlighted portion indicates the
code region where a space was added between the tokens and The green highlighted portion indicates the code
region where a newline was added.

Table 10
Two examples of non-code change successfully generated by our 𝑚𝑜𝑑𝑒𝑙_𝑐𝑐. Deleted and inserted tokens are marked
by red and green color respectfully.

Table 11
Survey Criterion.
Criteria Description Scale
Whether the suggestion addresses 1 - Low (it is completely incorrect)
Correctness the issue raised in the code 2 - Medium (no severe issue is noticed, but needs reasonable effort/intervention to get correct solution)
review on a scale of 3 3 - High (it addresses the issue satisfactorily and provides workable solution)
1 - Low (it is NOT readable)
Whether the suggestion is easier
Readability 2 - Medium (no severe issue is noticed, but is not easily readable)
to understand on a scale of 3
3 - High (it is satisfactorily readable)
Given the suggestion, how much time will 1 - Short time (the final code can be committed below 5 min, with little/no change required)
Commit time be needed to commit the final code in the 2 - Medium time (the final code can be committed within 5 to 10 min making the required changes)
original codebase on a scale of 3 3 - Long time (the final code can be committed after manual changes requiring over 10 min)

the following 3 criteria about the suggestion generated by our model


(Table 11). Table 12
Survey result when developers were given correct solutions generated by our system.
In most cases, the developers feel that the proposed solutions are
Criteria Set 1 Set 2
correct (50%–90% are highly correct), readable (50%–90% are highly
readable), and can be solved in less than five minutes (60%–100%). Developer 1 Developer 2 Developer 3 Developer 4
This result indicates that even though the models are proposing the Low: 2 Low: 2 Low: 1 Low: 2
right solutions, the human evaluators still fail to recognize all the Correctness Medium: 1 Medium: 1 Medium: 0 Medium: 3
High: 7 High: 7 High: 9 High: 5
correct solutions. To some extent, it indicates that for a new developer
Low: 2 Low: 0 Low: 1 Low: 2
in a new setting, it is not easy for them to address all the review
Readability Medium: 2 Medium: 1 Medium: 0 Medium: 3
comments and for some of them, our tool can reduce significant burden High: 6 High: 9 High: 9 High: 5
from the developers (see Tables 12 and 13).
<5: 8 <5:10 <5:9 <5: 6
For many cases, our tool fails to generate the exactly correct solu- Commit time 5–10: 2 5–10: 0 5–10: 1 5–10: 2
tion. Does it mean that in all the failed cases, the model is useless? >10: 0 >10: 0 >10: 0 >10: 2
To evaluate our tool’s performance in such failed cases, we randomly
chose 20 samples from our test set for which our tool failed to generate
exact solution and distributed those among the developers. We also pro-
vided the correct solution and made an additional query on closeness

12
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 13
Survey result when developers were given incorrect solutions generated by our system
and actual solutions taken from original codebase.
Criteria Set 1 Set 2
Developer 1 Developer 2 Developer 3 Developer 4
Low: 19 Low: 13 Low: 18 Low: 13
Correctness Medium: 1 Medium: 7 Medium: 1 Medium: 4
High: 0 High: 0 High: 1 High: 3
Low: 13 Low: 4 Low: 8 Low: 12
Readability Medium: 6 Medium: 14 Medium: 8 Medium: 5
High: 1 High: 2 High: 4 High: 3
<5: 18 <5: 17 <5: 1 <5: 3
Commit time 5–10: 2 5–10: 3 5–10: 8 5–10: 5
>10: 0 >10: 0 >10: 11 >10: 12
Low: 19 Low: 14 Low: 18 Low: 14
Fig. 11. Syntactic and stylistic correctness on the code fixes synthesized by our system.
Closeness Medium: 1 Medium: 5 Medium: 1 Medium: 3
High: 0 High: 1 High: 1 High: 3

our final models as baselines in the table (please see Section 6.2 for the
details about the hyperparameters of these two models). We have two
(similarity of the suggestion to the correct solution). We observed that major modifications in our baselines including: customized vocabulary
though the model does not produce correct solutions, it provides some (Section 5.3) and exclusion of coverage mechanism (Section 5.1). To
useful hints to the code authors. For Set 1, the expected commit time analyze the effect of the coverage mechanism we train a model
is less than 5 min for almost 90% of the samples. For Set 2, we also with coverage mechanism enabled. The result shows that performance
found that the human validator rates some examples as highly correct drops after enabling coverage mechanism (ID8). This might be because
because the differences between the model’s outputs and real solutions coverage penalizes repetition of tokens in output, whereas programs
are minor, i.e., related to formatting or redundant parentheses. usually have repetitions. Without custom vocabulary, the model’s per-
formance also decreases (ID3). We consider the review comment and
8.2. Evaluation based on dataset size code separately in our custom vocabulary and take the most frequent
tokens from them separately. Whereas in ID3, review comment and
To assess the size of project history required to achieve an ex- code vocabularies are merged and most frequent tokens are taken from
pected performance of the model, we selected two projects with the them jointly. The accuracy decreases in ID3 because the programming
longest histories (Eclipse and GoogleReview), trained the model on language tokens are larger in frequency, so the natural language tokens
the first 5000 data and tested them on the samples 5001 to 5500. are not much prevalent in the vocabulary of ID3. In Table 15, ID5 and
Next we trained the model on the first 10,000 data and test them ID12 show that a smaller vocabulary than baselines perform worse. ID4
on samples 10,001 to 10,500 and so on. We found that the accuracy and ID11 show that a larger vocabulary performs poorly as well. ID6
indeed increased with the number of instances in most cases, but not and ID12 show that a smaller embedding size than 256 performs worse.
very significantly. Even with 5000 training data, we achieved 15%– ID7 shows that larger vocabulary and smaller embedding size together
20% Top-1 accuracy, which is close to the accuracy we found after also reduces the performance. Thus, it is clear that our baseline models
training with all the training samples from a particular project (see adopt the best choice of vocabulary and embedding size.
Table 14). Therefore, 5000 samples are enough if we train and test
on the same project. The reported results are generated using the best 10. Threats to validity
model configuration (model_cc).
In this section, we discuss possible threats that may affect our
8.3. Syntactic and stylistic correctness methodology and the measures taken to mitigate them.
Internal Validity: Overlapping data in training and test set is a
In this section, we present the findings on the syntactic and stylistic major problem in deep learning-based source code analysis. After data
correctness of the generated codes. To conduct this study, we consider has been mined from the repositories, we ensured that all <code before
100 samples generated by our model and perform a compilation test the change, code review, code fix>tuples in our dataset are unique. The
on IntelliJ. We found that 86 out of 100 samples were compilable. training and test dataset were created after the deduplication process.
For ensuring stylistic/code formatting quality, the first and the third Furthermore, the latest 5% data from each project was selected as the
authors manually investigated 100 samples of generated codes. Note test data, so training and test data are from different time periods. Thus
that both of them have 4+ years of experience in writing Java codes there is no overlap between training and test data.
and one of them has around one year experience as a professional There is a possibility that the projects using other code review tools
java developer. They followed relevant recommendations of [75] as (e.g., ReviewBoard, Github pull-based reviews, and Phabricator) might
guidelines to assess stylistic quality. With a perfect agreement (0.88 make a difference in our tool’s performance. However, we do not use
Cohen Kappa score [65]) between them, they found 94 out of 100 any feature exclusive to Gerrit only, and most of the code review tools’
samples were stylistically correct (see Fig. 11). basic workflow is very much similar. Hence, we believe that this threat
is minimal.
9. Ablation study External Validity: Another threat to the validity of our project
is the availability of the code review comments. Also, how will the
Our primary motivation in this study is to show that code review models operate without the review comments? First of all, our 𝑚𝑜𝑑𝑒𝑙_𝑐
comments as a supportive information for code repair helps improve can operate on bugs without the review comments, and we observed
the repair performance as shown in 4. In this section we perform an that it achieved similar performance to other approaches on their
ablation study to understand the relative importance of each design dataset [24,25]. Note that all the learning-based tools evaluated using
choices for our model. We show these results in Table 15. We define an oracle (perfect) fault-localizer to have a fair comparison. While

13
F. Huq et al. Information and Software Technology 143 (2022) 106765

Table 14
Accuracy of two individual projects (Eclipse and GoogleReview) for different train and test splits.
Project name Dataset size Top-1 accuracy Top-5 accuracy Top-10 accuracy
5000 train, 500 test 15.8% 21.4% 23.2%
Eclipse
10000 train, 500 test 17.8% 27.4% 29.6%
5000 train, 500 test 19.8% 26.0% 29.0%
GoogleReview 10000 train, 500 test 21.6% 31.4% 34.4%
14000 train, 500 test 19.2% 28.2% 33.2%

Table 15
Comparison with different network parameters and properties.
ID Modified property Top-1 Top-5 Top-10
1 Baseline model_cc – – –
2 Soft tokenization −3.03 +5.33 +11.22
3 Without custom vocabulary selection −3.10 −12.91 −16.82
4 Larger vocabulary(10 k from code, 10 k from CR) −3.44 −9.74 −11.89
model_cc
5 Smaller vocabulary(1 k from code, 5 k from CR) −7.06 −3.89 −1.82
6 Smaller embedding size (128) −4.65 −2.8 −1.82
7 Larger vocabulary (20 k) andsmaller embedding size (128) −1.66 −3.28 −4.12
8 With coverage mechanism −2.93 −4.01 −2.46
9 Baseline model_c – – –
10 Soft tokenization −10.81 −4.2 −1.87
model_c
11 Larger vocabulary (10 k from code) +0.82 −3.7 −4.47
12 Smaller vocabulary (1 k from code)and smaller embedding size (128) −1.65 −2.25 −4.04

Debroy and Wong [80], Kern and Esparza [81] proposed mutation-
Table 16 based repair techniques. SemFix [10] and Angelix [82] attempted
Percentage of single and multiple code changes with respect to single and multiple automated program repair based on symbolic execution. The PAR
code reviews.
system [83] automatically fixed java code with ten repair templates.
Category Single code change Multiple code change
Könighofer and Bloem [84] considered assertions in programs and used
Single code review 92 7
an SMT solver to fill holes in repair templates. Samimi et al. [85]
Multiple code review 1 0
repaired PHP programs using correctly generated HTML format. Liu
et al. [86] parameterized a manually written bug report and extracted
necessary values from the report to repair the program. Many program
applying to a real scenario, our 𝑚𝑜𝑑𝑒𝑙_𝑐 can also utilize similar fault- repair methods used a reference implementation for repair [87,88].
localizers used by others. Therefore, not having the review comments Many methods have been tried in the literature for program repair,
does not make our tool inapplicable at all. Secondly, Table 2 shows whether as test suites or formal restrictions. However, before the re-
that we have around 950K code reviews from 15 projects. Code review cent advancement in Deep Learning and Natural Language Processing,
comments are prevalent, and all the code reviews need to be addressed using an unstructured natural language text for program repair was an
before merging to the main codebase or abandoning the code. If we unthinkable concept.
can reduce the developers’ effort for a significant amount of the review Machine Learning-based Automatic Program Repair: Applying
requests, that will reduce software developers’ burden. a Deep learning-based approach to detect and fix bugs has shown
Automatically finding the cases where one code review resulted promising results in recent years, mainly due to the availability of
in multiple change, or multiple code reviews resulted in single code large datasets. Pu et al. [89] proposed a seq2seq network for automatic
change is a non-trivial problem. We manually analyzed 100 gerrit program correction in MOOCs. Hata et al. [90] performed automatic
review comments and their associated changes. Table 16 shows that the patch generation with neural machine translation. ENCORE [45] used
majority of the code changes are in the single code review, single code an ensemble of multiple Convolutional Neural Network [91] based
change category. Therefore, in our system, each code review comment Neural Machine Translation models to improve the performance of
and its nearest code change is considered a single datapoint. If a review Deep Learning-based APR.
comment results in multiple changes, we only consider the nearest one. In two consequent papers, Tufano et al. [25,43] empirically demon-
Also, if multiple code reviews result in a single code change, our system strated the applicability of NMT for program repair. They released a test
will take the same code change with the two review comments and dataset named Bugs2Fix and achieved 9% accuracy in it [43]. They also
create two separate data points. applied the seq2seq model with attention mechanism [54] for repairing
Java functions that are less than 50 tokens or 50 to 100 token long after
11. Related work applying their proposed [25] tokenization method.
SequenceR [24] proposed an abstraction method to capture contexts
Automatic Program Repair (APR) is an active area of research for from the source code. Their model can fix in-line bugs inside functions
ages. In recent years, large companies such as Facebook have started with 20% accuracy on Bugs2Fix dataset. They also demonstrated how
such tools in the production environment [61]. Along with classic ap- Copy mechanism [24] can be used to solve infinite vocabulary prob-
proaches of dynamic and static analysis based repair, Machine Learning lem [57] for program repair. Our study has a much broader context
based techniques are also showing immense promise. than both Tufano et al. [25,43] and SequenceR [24] since we can
Classic Automatic Program Repair (APR): Researchers have been perform multiline changes inside/outside functions and can handle
trying to automatically repair software systems by generating an actual functions of any size.
fix for more than two decades [9]. Genprog [76–78] is a Genetic Algo- Getafix [61] is a tool developed and internally used by Facebook. It
rithm based automated repair technique using test suites. Arcuri [79], firstly splits a given set of recurring example fixes into AST-level edits.

14
F. Huq et al. Information and Software Technology 143 (2022) 106765

By applying the agglomerative clustering technique, this algorithm Acknowledgment


produces a hierarchy of fix patterns where the child nodes produce
the most specific fixes. The higher the nodes are in the hierarchy; the The research was partially supported by a grant ‘‘Code Review
patterns get more and more generalized. Finally, given a bug under Measurement’’ granted by Samsung Research Bangladesh. The cloud
fix, it finds the most suitable fix patterns from that hierarchy, ranks computational resource was supported by Intelligent Machines.
candidate fixes, and suggests developers’ top-most fixes.
CODIT [44] models code changes with tree-based machine transla-
Appendix. Taxonomy for top-10 predictions
tion. Instead of the source code, they work on the underlying syntax
tree of the code. They divide the task of predicting code changes in
two steps: Firstly, they learn and predict the edited code’s syntax tree.
Secondly, given the predicted tree structure, they generate the tokens,
i.e., variables, keywords etc. using the Seq2seq mechanism.
HOPPITY [92] models the problem of bug-fixing as learning a
sequence of graph transformations. Given a buggy program modeled
by a graph structure, their approach makes a sequence of predictions to
identify bug nodes’ position and corresponding graph edits to produce a
fix. To model the graph structure of the source code, they pass the pro-
cessed syntax tree(AST) through a Graph Neural Network(GNN) [93]
and produce a fixed dimensional vector space.
DLFix [94] adopts a two-tier Deep learning model for APR using
Tree-LSTM [95]. First, the changed sub-tree in the AST is summarized
in a single node to learn the local context. Second, the summarized node
information and the sub-tree difference before and after the change are
used to learn the code transformation. Additionally, they also deploy a
classification model to re-rank the generated patches.
Although there has been plenty of noticeable works in Machine
Learning-based program repair, to the best of our knowledge, none of
the previous works has exploited the additional information of code
review to improve the performance of the techniques. Since there is a
vast scope of improvement for the quality of suggested fixes, these and
the future approaches can be improved using our proposed mechanism
of utilizing code review.

12. Conclusions

This research has increased the quality of fix suggestions by ex-


ploiting the code review comments. We have prepared a dataset of
55,060 triplets (code before the change, code after the change, and
code review) and show that our approach improves up to 34.82% over
state-of-the-art APR techniques. This is the first step towards learning to
repair programs using Natural Language instructions to the best of our
knowledge. We systematically analyzed the generated fix suggestions
and found some categories left out by other techniques.
In the future, we will aim to increase our model’s performance by
harnessing advanced Deep Learning architecture, such as using differ-
ent encoders for code review and code or using pre-trained encoders
such as BERT [96]. Our current model repairs code using the first
review comment in a comment thread. In future, we would like to
explore if learning from subsequent comments in the thread improves
the performance of our model. We hope that the released dataset
and code will help research in Automatic Program Repair using code
review.

CRediT authorship contribution statement


References
Faria Huq: Methodology, Data curation, Writing – original draft,
Software. Masum Hasan: Methodology, Data curation, Writing – orig-
[1] M.E. Fagan, Design and code inspections to reduce errors in program devel-
inal draft. Md Mahim Anjum Haque: Visualization, Investigation. opment, IBM Syst. J. 15 (3) (1976) 182–211, http://dx.doi.org/10.1147/sj.153.
Sazan Mahbub: Methodology, Investigation, Software. Anindya Iqbal: 0182, URL https://doi.org/10.1147/sj.153.0182.
Supervision, Writing – review & editing. Toufique Ahmed: Methodol- [2] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical
ogy, Writing – original draft. study at microsoft, in: Proceedings of the 12th Working Conference on Mining
Software Repositories, in: MSR ’15, IEEE Press, 2015, pp. 146–156.
Declaration of competing interest [3] A. Bosu, J.C. Carver, C. Bird, J. Orbeck, C. Chockley, Process aspects and social
dynamics of contemporary code review: Insights from open source development
and industrial practice at microsoft, IEEE Trans. Softw. Eng. 43 (1) (2017) 56–75.
The authors declare that they have no known competing finan- [4] N. Kennedy, Google mondrian: web-based code review and storage – niall
cial interests or personal relationships that could have appeared to kennedy, 2020, https://www.niallkennedy.com/blog/2006/11/google-mondrian.
influence the work reported in this paper. html, (Accessed on 04/23/2020).

15
F. Huq et al. Information and Software Technology 143 (2022) 106765

[5] A. Tsotsis, Meet phabricator, the witty code review tool built inside facebook [26] A. Rush, Y.-W. Chang, M. Collins, Optimal beam search for machine translation,
| TechCrunch, 2020, https://techcrunch.com/2011/08/07/oh-what-noble-scribe- in: Proceedings of the 2013 Conference on Empirical Methods in Natural
hath-penned-these-words/, (Accessed on 04/23/2020). Language Processing, 2013, pp. 210–221.
[6] O. Kononenko, O. Baysal, M.W. Godfrey, Code Review quality: How devel- [27] Gerrit, Gerrit code review | gerrit code review, 2020, https://www.
opers see it, in: 2016 IEEE/ACM 38th International Conference on Software gerritcodereview.com/, (Accessed on 03/27/2020).
Engineering (ICSE), 2016, pp. 1028–1038. [28] Gerrit, Gerrit code review - REST API, 2020, https://gerrit-review.googlesource.
[7] T. Britton, L. Jeng, G. Carver, P. Cheak, Reversible debugging software ‘‘quantify com/Documentation/rest-api.html, (Accessed on 03/26/2020).
the time and cost saved using reversible debuggers’’, 2020. [29] Acumos, gerrit.acumos code review, 2020, https://gerrit.acumos.org/r/q/status:
[8] C.L. Goues, M. Pradel, A. Roychoudhury, Automated program repair, Commun. open, (Accessed on 04/23/2020).
ACM 62 (12) (2019) 56–65, http://dx.doi.org/10.1145/3318162, URL https: [30] A.O.S. Project, Android gerrit code review, 2020, https://android-review.
//doi.org/10.1145/3318162. googlesource.com/q/status:open+-is:wip, (Accessed on 04/23/2020).
[9] L. Gazzola, D. Micucci, L. Mariani, Automatic software repair: A survey, IEEE [31] Asterix, gerrit.asterix code review, 2020, https://asterix-gerrit.ics.uci.edu/#/q/
Trans. Softw. Eng. 45 (1) (2019) 34–67. status:open, (Accessed on 04/23/2020).
[10] H.D.T. Nguyen, D. Qi, A. Roychoudhury, S. Chandra, Semfix: Program repair [32] Cloudera, gerrit.cloudera code review, 2020, https://gerrit.cloudera.org/#/q/
via semantic analysis, in: Proceedings of the 2013 International Conference on status:open, (Accessed on 04/23/2020).
Software Engineering, in: ICSE ’13, IEEE Press, 2013, pp. 772–781. [33] Couchbase, gerrit.couchbase code review, 2020, http://review.couchbase.org/#/
[11] W. Weimer, T. Nguyen, C. Le Goues, S. Forrest, Automatically finding patches q/status:open, (Accessed on 04/23/2020).
using genetic programming, in: 2009 IEEE 31st International Conference on [34] Fd.io, gerrit.fd.io code review, 2020, https://gerrit.fd.io/r/q/status:open,
Software Engineering, 2009, pp. 364–374. (Accessed on 04/23/2020).
[12] D. Kim, J. Nam, J. Song, S. Kim, Automatic patch generation learned from [35] Gerrithub, Gerrit.gerrithub code review, 2020, https://review.gerrithub.io/q/
human-written patches, in: 2013 35th International Conference on Software status:open, (Accessed on 04/23/2020).
Engineering (ICSE), 2013, pp. 802–811. [36] G.R. GoogleSource, Gerrit.googlesource code review, 2020, https://gerrit-review.
[13] F. DeMarco, J. Xuan, D. Le Berre, M. Monperrus, Automatic repair of buggy googlesource.com/q/status:open, (Accessed on 04/23/2020).
if conditions and missing preconditions with SMT, in: Proceedings of the 6th [37] Iotivity, Gerrit.iotivity code review, 2020, https://gerrit.iotivity.org/gerrit/,
International Workshop on Constraints in Software Testing, Verification, and (Accessed on 04/23/2020).
Analysis, in: CSTVA 2014, Association for Computing Machinery, New York, [38] Omnirom, Gerrit.omnirom code review, 2020, https://gerrit.omnirom.org/#/q/
NY, USA, 2014, pp. 30–39, http://dx.doi.org/10.1145/2593735.2593740, URL status:open, (Accessed on 04/23/2020).
https://doi.org/10.1145/2593735.2593740. [39] Opencord, Gerrit.opencord code review, 2020, https://gerrit.opencord.org/#/q/
[14] S. Sidiroglou-Douskos, E. Lahtinen, F. Long, M. Rinard, Automatic error elim- status:open, (Accessed on 04/23/2020).
ination by horizontal code transfer across multiple applications, SIGPLAN Not. [40] Polarsys, Gerrit.polarsys code review, 2020, https://git.polarsys.org/r/#/q/
50 (6) (2015) 43–54, http://dx.doi.org/10.1145/2813885.2737988, URL https: status:open, (Accessed on 04/23/2020).
//doi.org/10.1145/2813885.2737988. [41] D. Gerrit, Gerrit.unicorn code review, 2020, https://gerrit.dirtyunicorns.com/#/
[15] S. Sidiroglou-Douskos, E. Lahtinen, F. Long, M. Rinard, Automatic error elimi- q/status:open, (Accessed on 04/23/2020).
nation by horizontal code transfer across multiple applications, in: Proceedings [42] Carbonrom, Gerrit.carbonrom code review, 2020, https://review.carbonrom.org/
of the 36th ACM SIGPLAN Conference on Programming Language Design and q/status:open, (Accessed on 04/23/2020).
Implementation, in: PLDI ’15, Association for Computing Machinery, New York, [43] M. Tufano, C. Watson, G. Bavota, M.D. Penta, M. White, D. Poshyvanyk, An
NY, USA, 2015, pp. 43–54, http://dx.doi.org/10.1145/2737924.2737988, URL empirical study on learning bug-fixing patches in the wild via neural machine
https://doi.org/10.1145/2737924.2737988. translation, ACM Trans. Softw. Eng. Methodol. (TOSEM) 28 (4) (2019) 19.
[16] V. Dallmeier, A. Zeller, B. Meyer, Generating fixes from object behavior [44] S. Chakraborty, M. Allamanis, B. Ray, Codit: Code editing with tree-based
anomalies, in: 2009 IEEE/ACM International Conference on Automated Software NeuralMachine translation, 2018, arXiv preprint arXiv:1810.00314.
Engineering, 2009, pp. 550–554. [45] T. Lutellier, L. Pang, V.H. Pham, M. Wei, L. Tan, Encore: Ensemble learning using
[17] T. Ackling, B. Alexander, I. Grunert, Evolving patches for software repair, convolution neural machine translation for automatic program repair, 2019,
in: Proceedings of the 13th Annual Conference on Genetic and Evolutionary arXiv preprint arXiv:1906.08691.
Computation, in: GECCO ’11, Association for Computing Machinery, New York, [46] java-diff utils, java-diff-utils/java-diff-utils: Diff utils library is an OpenSource
NY, USA, 2011, pp. 1427–1434, http://dx.doi.org/10.1145/2001576.2001768, library for performing the comparison / diff operations between texts or some
URL https://doi.org/10.1145/2001576.2001768. kind of data: computing diffs, applying patches, generating unified diffs or
[18] F. Long, M. Rinard, Staged program repair with condition synthesis, in: Proceed- parsing them, generating diff output for easy future displaying (like side-by-
ings of the 2015 10th Joint Meeting on Foundations of Software Engineering, side view) and so on, 2020, https://github.com/java-diff-utils/java-diff-utils,
in: ESEC/FSE 2015, Association for Computing Machinery, New York, NY, USA, (Accessed on 03/31/2020).
2015, pp. 166–178, http://dx.doi.org/10.1145/2786805.2786811, URL https: [47] M. Allamanis, E.T. Barr, C. Bird, C. Sutton, Suggesting accurate method and
//doi.org/10.1145/2786805.2786811. class names, in: Proceedings of the 2015 10th Joint Meeting on Foundations of
[19] Y. Qi, X. Mao, Y. Lei, Z. Dai, C. Wang, The strength of random search on Software Engineering, in: ESEC/FSE 2015, Association for Computing Machinery,
automated program repair, in: Proceedings of the 36th International Conference New York, NY, USA, 2015, pp. 38–49, http://dx.doi.org/10.1145/2786805.
on Software Engineering, in: ICSE 2014, Association for Computing Machinery, 2786849, URL https://doi.org/10.1145/2786805.2786849.
New York, NY, USA, 2014, pp. 254–265, http://dx.doi.org/10.1145/2568225. [48] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, A. Janes, Big code !=big vo-
2568254, URL https://doi.org/10.1145/2568225.2568254. cabulary: Open-vocabulary models for source code, in: International Conference
[20] E.K. Smith, E.T. Barr, C. Le Goues, Y. Brun, Is the cure worse than the on Software Engineering (ICSE), 2020.
disease? Overfitting in automated program repair, in: Proceedings of the 2015 [49] E. Loper, S. Bird, Nltk: The natural language toolkit, in: Proceedings of the
10th Joint Meeting on Foundations of Software Engineering, in: ESEC/FSE ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural
2015, Association for Computing Machinery, New York, NY, USA, 2015, pp. Language Processing and Computational Linguistics - Volume 1, in: ETMTNLP
532–543, http://dx.doi.org/10.1145/2786805.2786825, URL https://doi.org/10. ’02, Association for Computational Linguistics, USA, 2002, pp. 63–70, http://
1145/2786805.2786825. dx.doi.org/10.3115/1118108.1118117, URL https://doi.org/10.3115/1118108.
[21] Z. Qi, F. Long, S. Achour, M. Rinard, An analysis of patch plausibility and 1118117.
correctness for generate-and-validate patch generation systems, in: Proceedings [50] A. See, P.J. Liu, C.D. Manning, Get to the point: Summarization with
of the 2015 International Symposium on Software Testing and Analysis, in: ISSTA pointer-generator networks, 2017, arXiv preprint arXiv:1704.04368.
2015, Association for Computing Machinery, New York, NY, USA, 2015, pp. 24– [51] R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive
36, http://dx.doi.org/10.1145/2771783.2771791, URL https://doi.org/10.1145/ summarization, 2017, arXiv:1705.04304.
2771783.2771791. [52] G. Klein, Y. Kim, Y. Deng, J. Senellart, A. Rush, OpenNMT: Open-source
[22] Eclipse, gerrit.eclipse code review, 2020, https://git.eclipse.org/r/#/q/status: toolkit for neural machine translation, in: Proceedings of ACL 2017, System
open, (Accessed on 04/23/2020). Demonstrations, Association for Computational Linguistics, Vancouver, Canada,
[23] S. Gehrmann, Y. Deng, A. Rush, Bottom-up abstractive summarization, in: 2017, pp. 67–72, URL https://www.aclweb.org/anthology/P17-4012.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language [53] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8)
Processing, 2018, pp. 4098–4109. (1997) 1735–1780.
[24] Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, M. Monperrus, [54] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
Sequencer: Sequence-to-sequence learning for end-to-end program repair, IEEE to align and translate, 2014, arXiv:1409.0473.
Trans. Softw. Eng. (2019). [55] J. Gu, Z. Lu, H. Li, V.O. Li, Incorporating copying mechanism in sequence-
[25] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, D. Poshyvanyk, On learning to-sequence learning, in: Proceedings of the 54th Annual Meeting of the
meaningful code changes via neural machine translation, in: Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), Association
41st International Conference on Software Engineering, IEEE Press, 2019, pp. for Computational Linguistics, 2016, http://dx.doi.org/10.18653/v1/p16-1154,
25–36. URL http://dx.doi.org/10.18653/v1/P16-1154.

16
F. Huq et al. Information and Software Technology 143 (2022) 106765

[56] Z. Tu, Z. Lu, Y. Liu, X. Liu, H. Li, Modeling coverage for neural machine [75] Java style guide, https://www.cs.cornell.edu/courses/JavaAndDS/JavaStyle.
translation, in: Proceedings of the 54th Annual Meeting of the Association for html#Format, (Accessed on 02/01/2021).
Computational Linguistics (Volume 1: Long Papers), Association for Compu- [76] S. Forrest, T. Nguyen, W. Weimer, C. Le Goues, A genetic programming approach
tational Linguistics, 2016, http://dx.doi.org/10.18653/v1/p16-1008, URL http: to automated software repair, in: Proceedings of the 11th Annual Conference on
//dx.doi.org/10.18653/v1/P16-1008. Genetic and Evolutionary Computation, 2009, pp. 947–954.
[57] V.J. Hellendoorn, P. Devanbu, Are deep neural networks the best choice for [77] W. Weimer, S. Forrest, C. Le Goues, T. Nguyen, Automatic program repair with
modeling source code? in: Proceedings of the 2017 11th Joint Meeting on evolutionary computation, Commun. ACM 53 (5) (2010) 109–116.
Foundations of Software Engineering, in: ESEC/FSE 2017, ACM, New York, [78] W. Weimer, T. Nguyen, C. Le Goues, S. Forrest, Automatically finding patches
NY, USA, 2017, pp. 763–773, http://dx.doi.org/10.1145/3106237.3106290, URL using genetic programming, in: 2009 IEEE 31st International Conference on
http://doi.acm.org/10.1145/3106237.3106290. Software Engineering, IEEE, 2009, pp. 364–374.
[58] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, [79] A. Arcuri, Automatic software generation and improvement through search based
Y. Bengio, Learning phrase representations using RNN encoder-decoder for techniques, (Ph.D. thesis), University of Birmingham, 2009.
statistical machine translation, 2014, arXiv preprint arXiv:1406.1078. [80] V. Debroy, W.E. Wong, Using mutation to automatically suggest fixes for
[59] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of faulty programs, in: 2010 Third International Conference on Software Testing,
neural machine translation: Encoder-decoder approaches, 2014, arXiv preprint Verification and Validation, IEEE, 2010, pp. 65–74.
arXiv:1409.1259. [81] C. Kern, J. Esparza, Automatic error correction of java programs, in: International
[60] C. Dos Santos, M. Gatti, Deep convolutional neural networks for sentiment Workshop on Formal Methods for Industrial Critical Systems, Springer, 2010, pp.
analysis of short texts, in: Proceedings of COLING 2014, the 25th International 67–81.
Conference on Computational Linguistics, Technical Papers, 2014, pp. 69–78. [82] S. Mechtaev, J. Yi, A. Roychoudhury, Angelix: Scalable multiline program
[61] J. Bader, A. Scott, M. Pradel, S. Chandra, Getafix: Learning to fix bugs patch synthesis via symbolic analysis, in: Proceedings of the 38th International
automatically, Proc. ACM Programm. Lang. 3 (OOPSLA) (2019) 159. Conference on Software Engineering, 2016, pp. 691–701.
[62] Ovirt, Status:open | gerrit.ovirt code review, 2020, https://gerrit.ovirt.org/#/q/ [83] D. Kim, J. Nam, J. Song, S. Kim, Automatic patch generation learned from
status:open, (Accessed on 04/23/2020). human-written patches, in: 2013 35th International Conference on Software
[63] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, D. Poshyvanyk, Michelet- Engineering (ICSE), IEEE, 2013, pp. 802–811.
ufano/NeuralCodeTranslator: Neural code translator provides instructions, [84] R. Könighofer, R. Bloem, Automated error localization and correction for imper-
datasets, and a deep learning infrastructure (based on seq2seq) that aims ative programs, in: 2011 Formal Methods in Computer-Aided Design (FMCAD),
at learning code transformations, 2020, https://github.com/micheletufano/ IEEE, 2011, pp. 91–100.
NeuralCodeTranslator, (Accessed on 04/23/2020). [85] H. Samimi, M. Schäfer, S. Artzi, T. Millstein, F. Tip, L. Hendren, Automated
[64] Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, M. repair of HTML generation errors in PHP applications using string constraint
Monperrus, KTH/chai: sequence-to-sequence learning for end-to-end program solving, in: 2012 34th International Conference on Software Engineering (ICSE),
repair. open-science repo., 2020, https://github.com/KTH/chai, (Accessed on IEEE, 2012, pp. 277–287.
04/23/2020). [86] C. Liu, J. Yang, L. Tan, M. Hafiz, R2fix: Automatically generating bug fixes from
[65] J. Cohen, Weighted kappa: nominal scale agreement provision for scaled bug reports, in: 2013 IEEE Sixth International Conference on Software Testing,
disagreement or partial credit, Psychol. Bull. 70 (4) (1968) 213. Verification and Validation, IEEE, 2013, pp. 282–291.
[66] A. Bosu, A. Iqbal, R. Shahriyar, P. Chakraborty, Understanding the motivations, [87] R. Könighofer, R. Bloem, Repair with on-the-fly program analysis, in: Haifa
challenges and needs of blockchain software developers: A survey, Empir. Softw. Verification Conference, Springer, 2012, pp. 56–71.
Eng. 24 (4) (2019) 2636–2673. [88] R. Singh, S. Gulwani, A. Solar-Lezama, Automated feedback generation for intro-
[67] T. Parr, J. Vinju, Towards a universal code formatter through machine learning, ductory programming assignments, in: Proceedings of the 34th ACM SIGPLAN
in: Proceedings of the 2016 ACM SIGPLAN International Conference on Software Conference on Programming Language Design and Implementation, 2013, pp.
Language Engineering, in: SLE 2016, Association for Computing Machinery, 15–26.
New York, NY, USA, 2016, pp. 137–151, http://dx.doi.org/10.1145/2997364. [89] Y. Pu, K. Narasimhan, A. Solar-Lezama, R. Barzilay, sk_p: a neural program
2997383, URL https://doi.org/10.1145/2997364.2997383. corrector for MOOCs, Companion Proc. 2016 ACM SIGPLAN Int. Conf. Syst.
[68] M. Allamanis, E.T. Barr, C. Bird, C. Sutton, Learning natural coding conven- Program. Lang. Appl.: Soft. Humanit. - SPLASH Companion 2016 (2016)
tions, in: Proceedings of the 22nd ACM SIGSOFT International Symposium on http://dx.doi.org/10.1145/2984043.2989222, URL http://dx.doi.org/10.1145/
Foundations of Software Engineering, in: FSE 2014, Association for Computing 2984043.2989222.
Machinery, New York, NY, USA, 2014, pp. 281–293, http://dx.doi.org/10.1145/ [90] H. Hata, E. Shihab, G. Neubig, Learning to generate corrective patches using
2635868.2635883, URL https://doi.org/10.1145/2635868.2635883. neural machine translation, 2018, arXiv preprint arXiv:1812.07170.
[69] E. Murphy-Hill, T. Zimmermann, C. Bird, N. Nagappan, The design of bug fixes, [91] Y. LeCun, Y. Bengio, et al., Convolutional networks for images, speech, and time
in: 2013 35th International Conference on Software Engineering (ICSE), 2013, series, Handb. Brain Theory Neural Netw. 3361 (10) (1995) 1995.
pp. 332–341. [92] E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, K. Wang, Hoppity: Learning
[70] B.D. Bois, P.V. Gorp, A. Amsel, N.V. Eetvelde, H. Stenten, S. Demeyer, T. Mens, graph transformations to detect and fix bugs in programs, in: 8th International
A discussion of refactoring in research and practice, 2004. Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia,
[71] R.J. Miara, J.A. Musselman, J.A. Navarro, B. Shneiderman, Program indentation April 26-30, 2020, OpenReview.net, 2020, URL https://openreview.net/forum?
and comprehensibility, Commun. ACM 26 (11) (1983) 861–867, http://dx.doi. id=SJeqs6EFvB.
org/10.1145/182.358437, URL https://doi.org/10.1145/182.358437. [93] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph
[72] J. Raskin, Comments are more important than code: the thorough use of internal neural network model, IEEE Trans. Neural Netw. 20 (1) (2008) 61–80.
documentation is one of the most-overlooked ways of improving software quality [94] Y. Li, W. Shaohua, T.N. Nguyen, Dlfix: Context-based code transformation
and speeding implementation., Queue 3 (2) (2005) 64–65, http://dx.doi.org/10. learning for automated program repair, Softw. Eng. (ICSE) (2020).
1145/1053331.1053354. [95] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from
[73] S.C.B. de Souza, N. Anquetil, K.M. de Oliveira, A study of the documentation es- tree-structured long short-term memory networks, in: Proceedings of the 53rd
sential to software maintenance, in: Proceedings of the 23rd Annual International Annual Meeting of the Association for Computational Linguistics and the 7th
Conference on Design of Communication: Documenting & Designing for Pervasive International Joint Conference on Natural Language Processing (Volume 1: Long
Information, in: SIGDOC ’05, Association for Computing Machinery, New York, Papers), Association for Computational Linguistics, Beijing, China, 2015, pp.
NY, USA, 2005, pp. 68–75, http://dx.doi.org/10.1145/1085313.1085331, URL 1556–1566, http://dx.doi.org/10.3115/v1/P15-1150, URL https://www.aclweb.
https://doi.org/10.1145/1085313.1085331. org/anthology/P15-1150.
[74] M. Monperrus, S. Urli, T. Durieux, M. Martinez, B. Baudry, L. Seinturier, Human- [96] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
competitive patches in automatic program repair with repairnator, CoRR (2018) bidirectional transformers for language understanding, 2018, arXiv:1810.04805.
arXiv:1810.05806, URL http://arxiv.org/abs/1810.05806.

17

You might also like