Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

SP-Deep Code Comment Generation

The paper presents DeepCom, a novel approach for automatically generating code comments for Java methods using deep learning techniques. By leveraging Natural Language Processing and a deep neural network, DeepCom analyzes structural information from a large corpus of Java code to produce informative comments, significantly outperforming existing methods. The study highlights the importance of accurate code comments for software maintenance and comprehension, addressing common issues like mismatched or outdated comments.

Uploaded by

pokat10168
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

SP-Deep Code Comment Generation

The paper presents DeepCom, a novel approach for automatically generating code comments for Java methods using deep learning techniques. By leveraging Natural Language Processing and a deep neural network, DeepCom analyzes structural information from a large corpus of Java code to produce informative comments, significantly outperforming existing methods. The study highlights the importance of accurate code comments for software maintenance and comprehension, addressing common issues like mismatched or outdated comments.

Uploaded by

pokat10168
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

5-2018

Deep code comment generation


Xing HU
Peking University

Ge LI
Peking University

Xin XIA
Monash University

David LO
Singapore Management University, davidlo@smu.edu.sg

Zhi JIN
Peking University

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Software Engineering Commons

Citation
HU, Xing; LI, Ge; XIA, Xin; LO, David; and JIN, Zhi. Deep code comment generation. (2018). ICPC '18:
Proceedings of the 26th Conference on Program Comprehension, Gothenburg, Sweden, May 27-28.
200-210.
Available at: https://ink.library.smu.edu.sg/sis_research/4292

This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
cherylds@smu.edu.sg.
Deep Code Comment Generation∗
Xing Hu1 , Ge Li1 , Xin Xia2 , David Lo3 , Zhi Jin1
1 Key Laboratory of High Confidence Software Technologies (Peking University), MoE, Beijing, China
2 Faculty of Information Technology, Monash University, Australia
3 School of Information Systems, Singapore Management University, Singapore
1 {huxing0101,lige,zhijin}@pku.edu.cn, 2 xin.xia@monash.edu, 3 davidlo@smu.edu.sg

ABSTRACT 1 INTRODUCTION
During software maintenance, code comments help developers In software development and maintenance, developers spend around
comprehend programs and reduce additional time spent on reading 59% of their time on program comprehension activities [45]. Previ-
and navigating source code. Unfortunately, these comments are ous studies have shown that good comments are important to pro-
often mismatched, missing or outdated in the software projects. gram comprehension, since developers can understand the meaning
Developers have to infer the functionality from the source code. of a piece of code by using the natural language description of the
This paper proposes a new approach named DeepCom to automat- comments [35]. Unfortunately, due to tight project schedule and
ically generate code comments for Java methods. The generated other reasons, code comments are often mismatched, missing or
comments aim to help developers understand the functionality outdated in many projects. Automatic generation of code comments
of Java methods. DeepCom applies Natural Language Processing can not only save developers’ time in writing comments, but also
(NLP) techniques to learn from a large code corpus and generates help in source code understanding.
comments from learned features. We use a deep neural network Many approaches have been proposed to generate comments for
that analyzes structural information of Java methods for better methods [24, 35] and classes [25] of Java, which is the most popu-
comments generation. We conduct experiments on a large-scale lar programming language in the past 10 years1 . Their techniques
Java corpus built from 9,714 open source projects from GitHub. We vary from the use of manually-crafted [25] to Information Retrieval
evaluate the experimental results on a machine translation met- (IR) [14, 15]. Moreno et al. [25] defined heuristics and stereotypes to
ric. Experimental results demonstrate that our method DeepCom synthesize comments for Java classes. These heuristics and stereo-
outperforms the state-of-the-art by a substantial margin. types are used to select information that will be included in the
comment. Haiduc et al. [14, 15] applied IR approaches to generate
CCS CONCEPTS summaries for classes and methods. IR approaches such as Vector
Space Model (VSM) and Latent Semantic Indexing (LSI) usually
• Software and its engineering → Documentation; • Comput-
search comments from similar code snippets. Although promising,
ing methodologies → Neural networks;
these techniques have two main limitations: First, they fail to ex-
tract accurate keywords used for identifying similar code snippets
KEYWORDS when identifiers and methods are poorly named. Second, they rely
program comprehension, comment generation, deep learning on whether similar code snippets can be retrieved and how similar
the snippets are.
ACM Reference Format: Recent years have seen an emerging interest in building proba-
Xing Hu1 , Ge Li1 , Xin Xia2 , David Lo3 , Zhi Jin1 . 2018. Deep Code Comment bilistic models for large-scale source code. Hindle et al. [17] have
Generation. In Proceedings of IEEE/ACM International Conference on Program addressed the naturalness of software and demonstrated that code
Comprehension, Gothenburg, Sweden, May 27 - May 28, 2018 (ICPC’18). ACM, can be modeled by probabilistic models. Several subsequent studies
New York, NY, USA, 11 pages. https://doi.org/10.475/123_4
have developed various probabilistic models for different software
tasks [12, 23, 40, 41]. When applied to code summarization, different
from IR-based approaches, existing probabilistic-model-based ap-
∗ Thisresearch is supported by the National Basic Research Program of China (the proaches usually generate comments directly from code instead of
973 Program) under Grant No. 2015CB352201, and the National Natural Science Foun-
dation of China under Grant Nos. 61232015 and 61620106007. Zhi Jin and Ge Li are
synthesizing them from keywords. One of such probabilistic-model-
corresponding authors. based approaches is by Iyer et al. [19] who propose an attention-
based Recurrent Neural Network (RNN) model called CODE-NN.
It builds a language model for natural language comments and
Permission to make digital or hard copies of all or part of this work for personal or
aligns the words in comments with individual code tokens directly
classroom use is granted without fee provided that copies are not made or distributed by attention component. CODE-NN recommends code comments
for profit or commercial advantage and that copies bear this notice and the full citation given source code snippets extracted from Stack Overflow. Experi-
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, mental results demonstrate the effectiveness of probabilistic models
to post on servers or to redistribute to lists, requires prior specific permission and/or a on code summarization. These studies provide principled methods
fee. Request permissions from permissions@acm.org. for probabilistically modeling and resolving ambiguities both in
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden
© 2018 Association for Computing Machinery.
natural language descriptions and in the source code.
ACM ISBN 123-4567-24-567/08/06. . . $15.00
https://doi.org/10.475/123_4 1 https://www.tiobe.com/tiobe-index/
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.

In this paper, to utilize the advantage of deep learning techniques,


we propose a novel approach DeepCom to generate descriptive com- tanh Ct-1 × ! Ct
ht tanh
ments for Java methods which are functional units of Java language.
xt Unfold ft × ×
DeepCom builds upon advances in Neural Machine Translation it ^ ot
Ct
(NMT); NMT aims to automatically translate from one language … tanh tanh … 𝜎 𝜎 tanh 𝜎
ht-2 ht-1 ht ht-1 ht
(e.g., Chinese) to another language (e.g., English) and it has been
… xt-1 xt … xt
shown to achieve great success for natural language corpora [6, 37].
Intuitively, generating comments can be considered as a variant (a) Standard RNN model and its unfolded (b) The LSTM unit
architecture through time step
of the NMT problem, where source code written in a program-
ming language needs to be translated to text in natural language.
Compared to CODE-NN which only builds a language model for Figure 1: An illustration of basic RNN and LSTM
comments, the NMT model builds language models for both source
code and comments. The words in comments align with the RNN The brackets represent the structure of the AST and we can restore
hidden states which involve the semantics of code tokens. Deep- a tree unambiguously from a sequence generated using SBT.
Com generates comments by automatically learning from features Moreover, to address the vocabulary challenge, we propose a
(e.g., identifier names, formatting, semantics, and syntax features) new method to represent unknown tokens. The tokens in AST
extracted from a large-scale Java corpus. Different from traditional sequences include terminal nodes, non-terminal nodes, and brackets
machine translation, our task is challenging since: in our work. The unknown tokens come from the terminal tokens
(1) Source code is structured: In contrast to natural language of ASTs. We replace the unknown tokens with their types instead
text which is weakly structured, programming languages are of a universal special ⟨UNK⟩ token.
formal languages and source code written in them are unam- DeepCom generates comments word-by-word from AST se-
biguous and structured [3]. Many probabilistic models used quences. We train and evaluate DeepCom on the Java dataset that
in NMT are sequence-based models that need to be adapted consists of 9,714 Java projects from GitHub. The experimental re-
to structured code analysis. The main challenge and oppor- sults show that DeepCom can generate informative comments.
tunity is how to take advantage of rich and unambiguous Additionally, the results show that DeepCom achieves the best per-
structure information of source code to boost effectiveness formance when compared with a number of baselines including
of existing NMT techniques. the state-of-the-art approach by Iyer et al. [19].
(2) Vocabulary: In natural language (NL) corpora normally The main contributions of this paper are as follows:
used for NMT, the vocabulary is usually limited to the most • We formulate code comments generation task as a machine
common words, e.g., 30,000 words, and words outside the translation task.
vocabulary are treated as unknown words – often marked • We customize a sequence-based model to process structural
as ⟨UNK⟩. It is effective for such NL corpora because words information extracted from source code to generate com-
outside the dominant vocabulary are so rare. In code corpora, ments for Java methods. In particular, we propose a new AST
the vocabulary consists of keywords, operators, and iden- traversal method (namely structure-based traversal) and a
tifiers. It is common for developers to define various new domain-specific method to deal with out-of-vocabulary to-
identifiers, and thus they tend to proliferate. In our dataset, kens better.
we get 234,146 unique tokens after replacing numerals and
strings with generic tokens ⟨NUM⟩ and ⟨STR⟩. In a codebase Paper organization. The remainder of this paper is organized
used to build probabilistic models, there are likely to be many as follows. Section II presents background materials on language
out-of-vocabulary identifiers. As Table 1 illustrates, there models and NMT. Section III elaborates on the details of DeepCom.
are 234,055 unique identifiers in our dataset. If we use most Section IV and Section V present the experiment setup and results.
common 30,000 tokens as the code vocabulary, about 85 % Section VI discusses strengths of DeepCom, and threats to validity.
identifiers will be regarded as ⟨UNK⟩. Additionally, about Section VII surveys the related work. Finally, Section VIII concludes
30% tokens in source code are ⟨UNK⟩. Hellendoorn and De- the paper and points out potential future directions.
vanbu [16] have demonstrated that it is unreasonable for
source code to use such a vocabulary. 2 BACKGROUND
To address these issues, DeepCom customizes a sequence-based 2.1 Language Models
language model to analyze Abstract Syntax Trees (AST) which Our work is inspired by the machine translation problem in the NLP
capture structures and semantics of Java methods. The ASTs are field. We exploit the language models learning from a large-scale
converted into sequences before they are fed into DeepCom. It is source code corpus. The models generate code comments from
generally accepted that a tree cannot be restored from a sequence the learned features. The language models learn the probabilistic
generated by a classical traversal method such as pre-order tra- distribution over sequences of words. They work tremendously
versal and post-order traversal. To better present the structure of well on a large variety of problem (e.g., machine translation [6],
ASTs, and keep the sequences unambiguous, we propose a new speech recognition [9], and question answering [46]).
structure-based traversal (SBT) method to traverse ASTs. Using For a sequence x = (x 1 , x 2 , ..., x n ) (e.g., a statement), the lan-
SBT, a subtree under a given node is included into a pair of brackets. guage model aims to estimate the probability of it. The probability
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden

a. Data Processing b. Training a sequence-to-sequence model c. Comments generation with


the trained model
Natural Language
Y
Annotations Extracts
Decoder
request
P(Y|X)
method <EOS>
-( -) -* … -.
Java Method
Method
0( 0) 0* … 0.
AST Declaration
Source Parse y0 -( -) … -./(
Simple SingleVariable
Files !"#$%& Type Declaration
/&012
( MethodDeclaration( Modifier_public )
<GO> Extracts request … identifier

Modifier_public
34%.'55!0+6
( SimpleType Attention Training Model
String #!$%&'()%' !"
(#,7,'$'+,
SimpleName_String )

SimpleName_String ) SimpleType X
*+,'-'.
Method ,( ,) ,* … ,+
Invocation
( SingleVariableDeclaration ( SimpleType
'( ') '* … '+
( SimpleName_Integer ) 8 Code Comment
SBT Encoder P(X)

Figure 2: Overall framework of DeepCom.

of a sequence is computed via each of its tokens. That is, 1(b) illustrates a typical LSTM unit and for more details of LSTM,
please refer to [10, 18].
P(x) = P(x 1 )P(x2|x 1 )...P(x n |x 1 ...x n−1 ) (1)
In this paper, we adopt a language model based on the deep neural 2.2 Neural Machine Translation
network called Long Short-Term Memory (LSTM) [18]. LSTM is
NMT [44] is an end-to-end learning approach for automated trans-
one of the state-of-the-art RNNs. LSTM outperforms general RNN
lation. It is a deep learning based approach and has made rapid
because it is capable of learning long-term dependencies. It is a
progress in recent years. NMT has shown impressive results surpass-
natural model to use for source code which has long dependencies
ing those of phrase-based systems while addressing shortcomings
(e.g., a class is used far away from its import statement). The details
such as the need for hand engineered features. Its architecture typi-
of RNN and LSTM are shown in Figure 1.
cally consists of two RNNs, one to consume the input text sequences
2.1.1 Recurrent Neural Networks. RNNs are intimately related and the other one to generate the translated output sequences. It is
to sequences and lists because of their chain-like natures. It can in often accompanied by an attention mechanism that aligns target
principle map from the entire history of previous inputs to each with source tokens [6].
output. At each time step t, the unit in the RNN takes not only the NMT bridges the gap between different natural languages. Gen-
input of the current step but also the hidden state outputted by its erating comments from the source code is a variant of machine
previous time step t − 1. As Figure 1(a) illustrates, the hidden state translation problem between the source code and the natural lan-
of time step t is updated according to the input vector x t and its guage. We explore whether the NMT approach can be applied
previous hidden state ht −1 , namely, ht = tanh(W x t + U ht −1 + b) to comments generation. In this paper, we follow the common
where W , U , and b are the trainable parameters which are updated Sequence-to-Sequence (Seq2Seq) [37] learning framework with at-
while training, and tanh is the activation function: tanh(z) = (e z − tention [6] which helps cope effectively with the long source code.
e −z )/(e z + e (−z) ).
A prominent drawback of the standard RNN model is that gra- 3 PROPOSED APPROACH
dients may explode or vanish during the back-propagation. These
The transition process between source code and comments is simi-
phenomena often appear when long dependencies exist in the se-
lar to the translation process between different natural languages.
quences. To address these problems, some researchers have pro-
Existing research has applied machine translation methods trans-
posed several variants to preserve long-term dependencies. These
lating code from one source language (e.g., Java) to another (e.g.,
variants include LSTM and Gated Recurrent Unit (GRU). In this
C#) [13]. A few studies adopt machine translation method for gener-
paper, we adopt the LSTM which has achieved success on many
ating natural language descriptions from the source code. Oda et al.
NLP tasks [6, 37].
[30] present a machine translation approach to generate natural
2.1.2 Long Short-Term Memory. LSTM introduces a structure language Pseudo-code of the source code at the statement level.
called the memory cell to solve the problem that ordinary RNN is In this paper, DeepCom translates the source code to a high-level
difficult to learn long-term dependencies in the data. The LSTM is description at the method level.
trained to selectively “forget” information from the hidden states, The overall framework of DeepCom is illustrated in Figure 2.
thus allowing room to take in more important information [18]. DeepCom mainly consists of three stages: data processing, model
LSTM introduces a gating mechanism to control when and how training, and online testing. The source code we obtained from
to read previous information from the memory cell and write new GitHub is parsed and preprocessed into a parallel corpus of Java
information. The memory cell vector in the recurrent unit preserves methods and their corresponding comments. In order to learn the
long-term dependencies. In this way, LSTM handles long-term structural information, the Java methods are converted into AST
dependencies more effectively than vanilla RNN. LSTM has been sequences by a special traversal approach before input into the
widely used to solve semantically related tasks and has achieved model. With the parallel corpus of AST sequences and comments,
convincing performance. These advantages motivate us to exploit we build and train generative neural models based on the idea of
LSTM for building models for source code and comments. Figure NMT. There are two challenges during training process:
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.

Decoder P(Y|X) Tree structure


'()" '( '(+" 1
P(Y|X) Pr(Y)
… *()" *( *(+" … Sequence generated by SBT

'()" … 2 3 ( 1(2 (4) 4( 5) 5 (6 ) 6) 2( 3 ) 3) 1
Context Vector
Attention
Distribution 4 5 6

Cross-Entropy
Figure 4: An example of sequencing an AST to a sequence
&" &# &$ … &% by SBT. (For a number, the bold font number after bracket
indicates node itself and the number in brackets denotes the
!" !# !$ … !% tree structure by taking it as the root node.)
Encoder P(X) as
m
Õ
ci = αi j s j (3)
Figure 3: Sequence-to-Sequence model.
j=1
The weight α i j of each hidden state s j is computed as
• How to represent ASTs to store the structural information
and keep the representation unambiguous while traversing exp(ei j )
α i j = Ím (4)
the ASTs? k =1
exp(eik )
• How to deal with out-of-vocabulary tokens in source code? and
In the following paragraphs, we will introduce the details of ei j = a(hi−1 , s j ) (5)
the model and the approaches we propose to resolve the above- is an alignment model which scores how well the inputs around
mentioned challenges. position j and the output at position i match.

3.1 Sequence-to-Sequence Model 3.1.3 Decoder. The Decoder aims to generate the target se-
quence y by sequentially predicting the probability of a word yi
In this paper, we apply a Sequence-to-Sequence (Seq2Seq) model conditioned on the context vector c i and its previous generated
to learn source code and generate comments. Seq2Seq model is words y1 , ..., yi−1 , i.e.,
widely used for machine translation [37], text summarization [34],
dialogue system [39], etc. The model consists of three components, p(yi |y1 , ..., yi−1 , x) = д(yi−1 , hi , c i ) (6)
an Encoder, a Decoder, and an Attention component, in which where д is used to estimate the probability of the word yi . The goal
the Encoder and Decoder are both LSTMs. Figure 3 illustrates the of the model is to minimize the cross-entropy, i.e., minimize the
detailed Seq2Seq model. following objective function:
N n
3.1.1 Encoder. The encoder is an LSTM we describe in Section 1 ÕÕ (i)
2 and responsible for learning the source code. At each time step t, H (y) = − loдp(y j ) (7)
N i=1 j=1
it reads one token x t of the sequence, then updates and records the
current hidden state st , namely, where N is the total number of training instances, and n is the
(i)
length of each target sequence. y j means the jth word in the ith
st = f (x t , st −1 ) (2) instance. Through optimizing the objective function using opti-
where f is a LSTM unit that maps a word of source language x t into mization algorithms such as gradient descendant, the parameters
a hidden state st . The encoder learns latent features from source can be estimated.
code, and the features are encoded into the context vector c. These
latent features include the identifiers naming conventions, control
3.2 Abstract Syntax Tree with SBT traversal
structures, and etc. In this paper, DeepCom adopts the attention Translation between source code and NL is challenging due to the
mechanism to compute the context vector c. structure of source code. One simple way to model source code
is to just view it as plain text. However, in such way, the struc-
3.1.2 Attention. Attention mechanism is a recent model that ture information will be omitted, which will cause inaccuracies
selects the important parts from the input sequence for each target in the generated comments. To learn the semantic and syntactic
word. For example, the token “whether” in comments usually aligns information at the same time, we convert the ASTs into specially
with the “if” statements in the source code. The generation of each formatted sequences by traversing the ASTs. Sequences obtained
word is guided by a classic attention method proposed by Bahdanau by classical traversal methods (e.g., pre-order traversal) are lossy
et al. [6]. since the original ASTs cannot unambiguously be reconstructed
It defines individual c i for predicting each target word yi as a back from them. This ambiguity may cause different Java methods
weighted sum of all hidden states s 1 , .., sm in encoder and computed (each with different comments) to be mapped to the same sequence
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden

( MethodDeclaration
( Modifier_public ) Modifier_public
representation. It is confusing for the neural network if there are MethodDeclaration
( SimpleType
Modifier (public) ( SimpleName_String ) SimpleName_String
multiple labels (in our setting, comments) given to a specific input. SimpleType ) SimpleType
( SingleVariableDeclaration
SimpleName (String)
( SimpleType
For addressing this problem, we propose a Structure-based Traver- SingleVariableDeclaration ( SimpleName_Integer ) SimpleName_Integer
) SimpleType
SimpleType
sal (SBT) method to traverse the AST. The details are presented in ( SimpleName_id ) SimpleName_id
) SingleVariableDeclaration
SimpleName (Integer)
( Block
Algorithm 1. Figure 4 illustrates a simple example of SBT to traverse SimpleName (id) ( ExpressionStatement
Block ( MethodInvocation
SBT
a tree and the detailed procedure is as follows: ExpressionStatement ( SimpleName_LOG ) SimpleName_LOG
( SimpleName_debug ) SimpleName_debug
MethodInvocation (SimpleName_ExtractingmethodwithID:{})
• From the root node, we first use a pair of brackets to represent SimpleName (Log)
SimpleName_ExtractingmethodwithID:{}
( SimpleName_id ) SimpleName_id
) MethodInvocation
the tree structure and put the root node itself behind the SimpleName (debug)
) ExpressionStatement
SimpleName (Extracting method with ID:{}) ( ReturnStatement
right bracket, that is (1)1, shown in Figure 4. SimpleName (id) ( MethodInvocation
( SimpleName_request ) SimpleName_request
ReturnStatement
• Next, we traverse the subtrees of the root node and put all MethodInvocation
( SimpleName_remove) SimpleName_remove
( SimpleName_id ) SimpleName_id
SimpleName (request) ) MethodInvocation
root nodes of subtrees into the brackets, i.e., (1(2)2(3)3)1. SimpleName (remove)
) ReturnStatement
) Block

• Recursively, we traverse each subtree until all nodes are SimpleName (extractFor)
SimpleName (id) ( SimpleName_extractFor ) SimpleName_extractFor
) MethodDeclaration

traversed and the final sequence (1(2(4)4(5)5(6)6)2(3)3)1 is


obtained.
Figure 5: AST of the Java method named extractFor.
Algorithm 1 Structure-based Traversal
top 30,000) during data processing. The out-of-vocabulary tokens
1: procedure SBT(r ) ▷ Traverse a tree from root r
are replaced by a special unknown token, e.g., ⟨UNK⟩. It is effective
2: seq ← œ ▷ seq is the sequence of a tree after traversal
for NLP because words outside vocabulary are so rare. However,
3: if !r .hasChild then
this method is arguably inappropriate when it comes to source code.
4: seq ← (r )r ▷ Add brackets for terminal nodes
In addition to fixed operators and keywords, there are user-defined
5: else
identifiers which take up the majority of code tokens [7]. These
6: seq ← (r ▷ Add left bracket for non-terminal nodes
identifiers have a substantial influence on the vocabulary of lan-
7: for c in childs do
guage models. If we keep a regular vocabulary size for source code,
8: seq ← seq + SBT (c)
there will be many unknown tokens. If we want the occurrences
9: seq ← seq+)r ▷ Add right bracket for non-terminal of ⟨UNK⟩ tokens to be as few as possible, the vocabulary size will
nodes after traversing all their children increase a lot. A large vocabulary size will make it difficult to train
10: return seq a deep learning model since it requires more training data, time,
and memory. To achieve optimal and stable results, models need to
DeepCom processes each AST into a sequence following the SBT
run a larger number of iterations to tune the parameters for each
algorithm. For example, the AST sequence of the following Java
word in the vocabulary.
method extracted from project Eclipse Che2 is shown in Figure 5:
Hence, we propose a new method to represent the out-of-vocabulary
public String extractFor(Integer id){ tokens for source code. In AST, the non-terminal nodes have “type”
LOG.debug("Extracting method with ID:{}", id); feature, and terminal nodes not only have “type” feature, but also
return requests.remove(id); have “value” feature. DeepCom takes the AST sequences as inputs,
} the vocabulary consists of brackets, all “type” of nodes (including
The left part of Figure 5 is the AST of the method. The non-terminal non-terminal nodes Tnon and terminal nodes Tt erm ), and partial
nodes (those without boxes) illustrate the structural information type-value pairs of terminal tokens. We keep the tokens which
of source code. They have the feature “type” which is a fixed set appear in the most frequent 30,000 tokens as the AST sequences
(e.g., IfStatement, Block, and ReturnStatement). The terminal nodes vocabulary. For the type-value pairs outside the vocabulary, Deep-
(those within boxes) not only have “type” but also have “value” Com uses their “type” Tt erm instead of the ⟨UNK⟩ token to replace
(token within brackets). The “value” is the concrete token occurring them. For example, for the terminal nodes “extractFor” and “id” in
in the source code and “type” indicates the type of the token. The the code presented above, their types are both “SimpleName” as
right part of the figure is the sequence constructed by traversing shown in Figure 5. The tokens input into the model should be “Sim-
the AST. The terminal nodes are represented by their “type” and pleName_extractFor” and “SimpleName_id” respectively. However,
“value” (connected by “_”), such as “log” is represented by “Sim- since the token “SimpleName_extractFor” is out of the vocabulary,
pleName_Log”. The non-terminal nodes are represented by their we use its type “SimpleName” representing it instead. In this way,
“type”. A subtree is included in a pair of brackets and we can restore the out-of-vocabulary tokens are represented by their related type
the AST from the given sequence easily. In this way, we can keep information instead of the meaningless word.
the structural information and make the representation lossless –
the original AST can be unambiguously reconstructed from the 4 EXPERIMENT SETUP
sequence. Then we use the Eclipse’s JDT compiler3 to parse the Java methods
into ASTs and extract corresponding Javadoc comments which
3.3 Out-of-vocabulary tokens are standard comments for Java methods. The methods without
Vocabulary is another challenge to model source code [16]. In NL, Javadoc are omitted in this paper. For each method with a comment,
studies usually limit vocabulary to the most common words (e.g., we use the first sentence appeared in its Javadoc description as
2 https://github.com/eclipse/che 3 http://www.eclipse.org/jdt/
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.

Table 1: Statistics for code snippets in our dataset on the Seq2Seq model in Tensorflow tutorials7 . The parameters are
shown as follows:
#All # All # Unique #Unique • The SGD (with minibatch size 100 randomly chosen from
#Methods
Tokens Identifiers Tokens Identifiers training instances) is used to train the parameters.
69,708 8,713,079 2,711,496 234,146 234,055 • DeepCom uses two-layered LSTMs with 512 dimensions of
the hidden states and 512-dimensional word embeddings.
Table 2: Statistics for code lengths and comments lengths • The learning rate is set to 0.5 and we clip the gradients norm
by 5. The learning rate is decayed using the rate 0.99.
• To prevent over-fitting, we use dropout with 0.5.
Code Lengths
Avg Mode Median <100 <150 <200 4.2 Evaluation Measure: BLEU-4
99.94 16 65 68.63% 82.06% 89.00% DeepCom uses machine translation evaluation metrics BLEU-4
score[31] to measure the quality of generated comments. BLEU
Comments Lengths
score is a widely-used accuracy measure for NMT [22] and has
Avg Mode Median <20 <30 <50 been used in software tasks evaluation [12, 20]. It calculates the
8.86 8 13 75.50% 86.79% 95.45% similarity between the generated sequence and reference sequence
(usually a human-written sequence). The BLEU score ranges from
1 to 100 as a percentage value. The higher the BLEU, the closer the
the comment since it typically describes the functionalities of Java candidate is to the reference. If the candidate is completely equal to
methods according to Javadoc guidance4 . Empty or just one-word the reference, the BLEU becomes 100%. Jiang et al. [20] exploit it to
descriptions are filtered out in this work because these comments evaluate the generated summaries for commit messages. Gu et al.
have no ability to express the Java methods functionalities. We also [12] use BLEU to evaluate the accuracy of generated API sequences
exclude the setter, getter, constructor and test methods, since they from natural language queries. Their experiments show that BLEU
are easy for a model to generate the comments. score is reasonable to measure the accuracy of generated sequences.
Finally, we get 69,708 ⟨ Java method, comment⟩ pairs5 . Similar It computes the n-gram precision of a candidate sequence to the
to Jiang et al. [20]’s work, we randomly select 80% of the pairs for reference. The score is computed as:
training, 10% of the pairs for validation, and rest 10% for testing. N
Õ
Table 1 and Table 2 illustrates statistics of the corpus. We also BLEU = BP · exp( w n loдpn ) (8)
give the details of methods lengths and comments length. The n=1
average lengths of Java methods and comments are 99.94 and 8.86 where pn is the ratio of length n subsequences in the candidate that
tokens in this corpus. We find that more than 95% code comments are also in the reference. In this paper, we set N to 4, which is the
have no more than 50 words and about 90% Java methods no longer maximum number of grams. BP is brevity penalty,
than 200 tokens.
ifc > r

1
During the training, the numerals and strings are replaced with BP = (1−r /c) (9)
generic tokens ⟨NUM⟩ and ⟨STR⟩ respectively. The maximum length e ifc ≤ r
of AST sequences is set to 400. We use a special symbol ⟨PAD⟩ to where c is the length of the candidate translation and r is the effec-
pad the shorter sequences and the longer sequences will be cut tive reference sequence length.
into sequences with 400 tokens. We add special tokens ⟨START⟩ In this paper, we regard a generated comment as a candidate
and ⟨EOS⟩ to the decoder sequences during training. ⟨START⟩ is and a programmer-written comment (extracted from Javadoc) as a
the start of the decoding sequence and the ⟨EOS⟩ means the end reference.
of it. The maximum comment length is set to 30. The vocabulary
sizes for AST sequences and comments are both 30,000 in this pa- 5 RESULTS
per. While there is no ⟨UNK⟩ in ASTs sequences, there are a few
In this section, we evaluate different approaches by measuring their
out-of-vocabulary tokens in comments that are replaced by ⟨UNK⟩.
accuracy on generating Java methods’ comments. Specifically, we
mainly focus on the following research questions:
4.1 Training Details
• RQ1: How effective is DeepCom compared with the state-of-
The model is validated every 2,000 minibatches on the validation set
the-art baseline?
by BLEU [31] which is a commonly used automatic metric for NMT.
• RQ2: How effective is DeepCom to source code and com-
Training runs for about 50 epochs and we select the best model
ments of varying lengths?
that has best results on the validation set as the final model. The
model is then evaluated on the test set by computing average BLEU
5.1 RQ1: DeepCom vs. Baseline
scores and the results will be discussed in Section 5. All models are
implemented using the Tensorflow framework6 and extended based 5.1.1 Baseline. We compare DeepCom with CODE-NN [19]
which is a state-of-the-art code summarization approach and also a
4 http://www.oracle.com/technetwork/articles/java/index- 137868.html deep learning based method. CODE-NN is an end-to-end generation
5 Data is available at https://github.com/huxingfree/DeepCom
6 https://www.tensorflow.org/ 7 https://github.com/tensorflow/nmt
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden

Table 3: Evaluation results on Java methods Table 4: Evaluation results on CODE-NN datasets including
C# and SQL programming languages.
Approaches BLEU-4 score (%)
Language Approaches BLEU-4 score(%)
CODE-NN 25.30
Seq2Seq 34.87 CODE-NN 20.4
C#
Attention-based Seq2Seq 35.50 Seq2Seq 30.00
DeepCom (Pre-order) 36.01
CODE-NN 17.0
DeepCom (SBT) 38.17 SQL
Seq2Seq 30.94

system to generate summaries for code snippets. It exploits an RNN


with attention to generate summaries by integrating the token structural information in source code needs to be taken into ac-
embeddings of source code instead of building language models for count. DeepCom can generate more informative comments than the
source code. We do not use IR approaches as baselines, because the state-of-the-art method. Compared to the model without AST, the
results in CODE-NN has shown that CODE-NN outperforms the IR BLEU score of DeepCom increases to 38.17% and the BLEU-4 scores
based approaches. of about 38% of the instances are greater than 50%. We evaluate two
We also compare DeepCom with its variants, that are, the basic traversal methods SBT and pre-order traversal. DeepCom with SBT
Seq2Seq model, the attention based Seq2Seq model, and DeepCom performs better than traditional pre-order traversal. This is the case
with a classical traversal method (i.e., pre-order traversal). The because SBT better preserves the structure of ASTs. Experimental
Seq2Seq model and the attention based Seq2Seq model take the results indicate that the structural information is important for
source code as inputs. They aim to evaluate the effectiveness of translating text in structured languages to unstructured ones.
NMT approaches for comments generation. To evaluate the effec-
tiveness of SBT, we compare SBT with one of the most ordinary 5.2 RQ2: BLEU-4 scores for source code and
traversal methods – pre-order traversal. In addition, we also com-
pare DeepCom with CODE-NN on the dataset that CODE-NN uses.
comments of different lengths
We further analyze the prediction accuracy for Java methods and
comments of different lengths. Figure 6 presents the average BLEU-
5.1.2 Results. We measure the gap between automatically gen- 4 scores of DeepCom and CODE-NN for source code and ground
erated comments and human-written comments. The difference is truth comments of varying lengths. As Figure 6(a) illustrates, the
evaluated by a machine translation metric, i.e., BLEU-4 score. Table average BLEU-4 scores tend to be lower when we increase source
3 illustrates the average BLEU-4 scores of different approaches to code length. For most code lengths, the average BLEU-4 scores of
generating comments for Java methods. The accuracy of machine DeepCom improve those of CODE-NN by about 10%. For DeepCom,
translation model Seq2Seq substantially outperforms CODE-NN. AST lengths grow rapidly as the source code lengths increase and as
CODE-NN fails to learn the semantic of the source code when it a result, some features are lost when cutting the long AST sequences
generates comments from token embeddings of source code di- into a fixed length sequence during training.
rectly. Seq2Seq model exploits RNN to build a language model For comments of different lengths, DeepCom maintains similar
for the source code and effectively learns the semantics of Java accuracy as shown in Figure 6(b). However, the accuracy of CODE-
methods. The BLEU-4 score increases further while integrating the NN decreases sharply while code comment length increases. When
structural information. Compared to DeepCom with the pre-order the code comment lengths are greater than 25 tokens, the accuracy
traversal, the SBT based model is much more capable of learning of CODE-NN decreases to less than 10%. DeepCom still performs
semantic and syntactic information within Java methods. In a word, better when we need to generate comments consisting of 25-28
the improvement of our proposed DeepCom (SBT) over CODE-NN words.
is large. The average BLEU-4 score of DeepCom improves about
13% compared to CODE-NN. The results of DeepCom are compara-
6 DISCUSSION
ble to the BLEU scores of state-of-the-art NMT models on natural
language translation which are about 40%[21, 44]. 6.1 Qualitative analysis
We further conduct experiments on the same datasets CODE- Here, we perform qualitative analysis on the human-written com-
NN used, which includes C# and SQL snippets collected from Stack ments and comments which are automatically generated by our
Overflow. The results are shown in Table 4. Since many of the approach. Table 5 shows some examples of Java methods, the com-
code snippets in their provided dataset are incomplete and hard to ments generated by DeepCom and human-written comments. By
parse them into ASTs, we compare the Seq2Seq model with CODE- analyzing cases of generated results, we find the cases can be di-
NN. It highlights that the Seq2Seq outperforms the state-of-the-art vided into the following situations.
method CODE-NN in different languages. The average BLEU scores
of Seq2Seq improve more than 10% on various program languages 6.1.1 Exactly correct comments. DeepCom can generate exactly
compared to CODE-NN. correct comments from the source code of different lengths (Case
Through the evaluation, we have verified that comments gen- 1 and Case 2), which validate the capability of our approach to
eration task is very similar to machine translation except that the encode Java methods and decode comments. Generally, DeepCom
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.

4 and Case 9). The influence of API invocations explains that Deep-
Com can learn the platform standard APIs usage patterns from a
large-scale dataset. However, it can not learn customized APIs well
because the customized APIs with the same name have different
usage patterns in different programs.

6.1.5 Low BLEU score cases. The results with lower BLEU scores
are mainly divided into two types, meaningless sentences, and sen-
tences with clear semantics. The former mainly contains empty
sentences and results with too many repetitive words. We conjec-
ture the problems come from out-of-vocabulary words in original
(a) BLEU-4 scores for different code lengths comments or mismatch between the Java methods and comments
in the original dataset.
In the latter ones, most of them are irrelevant to original com-
ments in their semantics. There are also some interesting results
that hold relevant semantics but gain low BLEU scores (shown in
Case 4). The automatically generated and manual comments may
describe similar functionalities but with different words or order.

6.1.6 Unknown words in generated comments. There are un-


known words in the generated comments sometimes. As Case
3 shows, DeepCom fails to predict the token “FactoryConfigu-
rationError” which is the method name defined by developers.
DeepCom is not good at learning the method or identifiers names
(b) BLEU-4 scores for different comment lengths occurred in comments. Developers define various names while pro-
gramming and most of these tokens appearing at most once in the
Figure 6: The average BLEU-4 scores of different lengths of comments. During the training process, we replaced all unknown
code and comment in Java language. (We compare two meth- identifier tokens in AST sequences with their types, but we do not
ods DeepCom with SBT and CODE-NN) replace the unknown identifiers occur in comments. It is hard for
DeepCom to learn these user-defined tokens in comments that have
performs well when the business logic of these Java methods is been replaced by the unknown token ⟨UNK⟩.
clear and code conventions are universal.
6.1.2 Algorithm implementations. For the Java methods which 6.2 Strengths of DeepCom
are more concerned about algorithm than business logic, Deep- A major challenge for generating comments from code is the seman-
Com can generate accurate comments. The algorithm concerned tic gap between code and natural language descriptions. Existing
Java methods usually use similar structures to implement the same approaches are based on manually crafted templates or information
algorithm function. As Case 5 shows, the method “sort” aims to retrieval and lack a model to capture the semantic relationship
sort an array using Binary Sort, DeepCom captures the correct between source code and natural language. DeepCom, a machine
functionality and generates the correct comments. translation model, has the ability to bridge the gap between two
6.1.3 Cases when generated comments are better than human- languages, i.e., programming language and natural language.
written ones. By analyzing the generated comments and the source
6.2.1 Probabilistic model connecting semantics of code and com-
code, we find that DeepCom performs better than human written
ments. One advantage of DeepCom is generating comments directly
comments when the Java methods aim to determine something
by learning source code instead of synthesizing comments from
true or not. Developers write interrogative sentences as comments
keywords or searching similar code snippets’ comments.
sometimes (shown in Case 6 and Case 7). These comments are
Synthesizing comments from keywords usually uses some man-
nonstandard even though they can express the functionalities of
ually crafted templates. The procedure of templates definition is
Java methods. DeepCom can not only generate accurate comments
time-consuming and the quality of keywords depends on the qual-
but also more standard comments.
ity of a given Java method. They fail to extract accurate keywords
6.1.4 API invocations intensive Java methods. Developers usually when the identifiers and methods are poorly named. The IR based
invoke APIs to implement a specific function. These APIs include approaches usually search the similar code snippets and take their
platform standard APIs and customized APIs defined by third par- comments as the final results. These IR based approaches rely on
ties or developers themselves. We find that DeepCom can generate whether similar code snippets can be retrieved and how similar the
accurate comments when most API invocations are platform stan- snippets are.
dard APIs (shown in Case 1). However, when the majority API DeepCom builds language models for code and natural language
invocations in a Java method are customized APIs, DeepCom does descriptions. The language models are able to handle the uncer-
not perform as good as human-written comments (shown in Case tainty in the correspondence between code and text. DeepCom
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden

Table 5: Examples of generated comments by DeepCom. These samples are necessarily limited to short methods because of
space limitations. AST structure is not shown in the table, because AST is much longer than source code.

Case ID Java method Comments

public static byte[] bitmapToByte(Bitmap b){


ByteArrayOutputStream o = new ByteArrayOutputStream();
Automatically generated: convert Bitmap to byte array
1 b.compress(Bitmap.CompressFormat.PNG,100,o); Human-written: convert Bitmap to byte array
return o.toByteArray();
}

private static void addDefaultProfile(SpringApplication app,


SimpleCommandLinePropertySource source){
if(!source.containsProperty("spring.profiles.active") Automatically generated: If no profile has been configured , set by default the "dev"
2 &&!System.getenv().containsKey("SPRING_PROFILES_ACTIVE")){ profile.
app.setAdditionalProfiles(Constants.SPRING_PROFILE_DEVELOPMENT); Human-written: If no profile has been configured , set by default the "dev" profile.
}
}

public FactoryConfigurationError(Exception e){ Automatically generated: Create a new ⟨UNK ⟩ with a given Exception base cause
super(e.toString()); of the error.
3
this.exception=e; Human-written: Create a new FactoryConfigurationError with a given Exception
} base cause of the error.

protected void createItemsLayout(){


if (mItemsLayout == null){
mItemsLayout=new LinearLayout(getContext()); Automatically generated: Creates item layouts if any parameters
4
mItemsLayout.setOrientation(LinearLayout.VERTICAL); Human-written: Creates item layout if necessary
}
}

public static void sort(Comparable[] a){


int n=a.length;
for (int i=1; i < n; i++){
Comparable v=a[i];
Automatically generated: Sorts the array in ascending order,using the natural or-
int lo=0, hi=i; der.
5
while (lo < hi) { ... } Human-written: Rearranges the array in ascending order,using the natural order.
...
}
assert isSorted(a);
}

public boolean isEmpty(){


Automatically generated: Returns true if the symbol is empty.
6 return root == null; Human-written: Is this symbol table empty?
}

public boolean contains(int key){ Automatically generated: Checks whether the given object is contained within the
7 return rank(key) != -1; given set.
} Human-written: Is the key in this set of integers?

public void tag(String inputFileName,String outputFileName,


OutputFormat outputFormat){ Automatically generated: Replaces the message with a given tag
8 List<String> sentences=jsc.textFile(inputFileName).collect(); Human-written: Tags a text file , each sentence in a line and writes the result to
tag(sentences,outputFileName,outputFormat); an output file with a desired output format.
}

public void unlisten(String pattern){


UtilListener listener=listeners.get(pattern);
if(listener!=null){
Automatically generated: It can be called when the product only or refresh has
listener.destroy();
ended.
9 listeners.remove(pattern); Human-written: Removes a listener that was previously registered with listenFor-
}else{ Subscriptions.
client.onError(Topic.RECORD,Event.NOT_LISTENING,pattern);
}
}

learns common patterns from a large-scale source code and the en- dense than text and have formal syntax and semantics. It is diffi-
coder itself is a language model which remembers the likelihood of cult for models to learn semantic and syntax information at the
different Java methods. The decoder of DeepCom learns the context same time just given code sequences. Existing approaches usually
of source code which bridges the gap between natural language analyze source code directly and omit its syntax representation.
and code. Furthermore, the attention mechanism helps align code In contrast to traditional NMT models, DeepCom takes advan-
tokens and natural language words. tage of rich and unambiguous code structures. In this way, Deep-
Com bridges the gap between code and natural language with the
6.2.2 Generation assisted by structural information. Program- assistance of structure information within the source code. From
ming languages are formal languages which are more structure
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.

the evaluation results, we find that the structural information im- with attention to produce summaries that describe C# code snip-
proves the quality of comments. The improvements for methods pets and SQL queries. It takes source code as plain text and models
implementing standard algorithms are much more obvious. Java the conditional distribution of the summary. Allamanis et al. [4]
methods realizing the same algorithm may define different variables apply a neural convolutional attentional model to the problem that
while their ASTs are much more similar. extremely summarizes the source code snippets into short, name-
like summaries. These learning-based approaches mainly learn the
6.3 Threats to Validity latent features from source code, such as semantics, formatting,
We have identified the following threats to validity: and etc. The comments are generated according to these learned
Automatic evaluation metrics: We evaluate the gap between features. The experimental results of them have proved the effective-
generated comments and human-written comments by machine ness of deep learning methods on code summarization. In this paper,
translation metric BLEU which is gradually used in generative- DeepCom integrates the structure information which is verified
based software issues [12, 20]. The reason for this setting is that we important for comments generation.
want to reduce the impact of the subjectivity of manual evaluation.
Quality of collected comments: We collected the comments for 7.2 Language models for source code
Java methods from the first sentence of Javadoc as other work Recently, thanks to the insight of Hindle et al. [17], there is an emerg-
does [12]. Although we define heuristic rules to decrease the noise ing interest in building language models of source code. These lan-
in comments, there are some mismatched comments in the dataset. guage models vary from n-gram model [1, 29], bimodal model [5],
In the future, we will investigate a better technique to build a better and RNNs [12, 19]. Hindle et al. [17] first propose to explore N-gram
parallel corpus. to model the source code and demonstrate that most software is
Comparisons on Java dataset: Another threat to validity is that also natural and find regularities in natural code. Some studies build
our approach is experimented on Java dataset. Although we fail to the models to bridge the gap between the programming language
evaluate DeepCom directly on CODE-NN’ dataset which is difficult and natural language descriptions. Allamanis et al. [1] develop a
to parse into ASTs, the results on Java have proved the effectiveness framework to learn the code conventions of a codebase and the
of DeepCom. In the future, we will extend our approach to other framework exploits N-gram model to name Java identifiers. Alla-
programming languages (e.g., Python). manis et al. [2] and Raychev et al. [33] suggest names for variables,
methods, and classes. Mou et al. [26] present a tree-based convo-
lutional neural networks to model the source code and classify
7 RELATED WORK programs. Gu et al. [12] present a classic encoder-decoder model
7.1 Code Summarization to bridge the gap between the Java API sequences and natural lan-
As a critical task in software engineering, code summarization aims guage. Yin and Neubig [47] build a data-driven syntax-based neural
to generate brief natural language descriptions for source code. network model for generating code from natural language.
Automatic code summarization approaches vary from manually- Learning from source code is applied to various software engi-
crafted template [24, 35, 36], IR [14, 15, 43] to learning-based ap- neering tasks, e.g., fault detection [32], code completion [27, 29],
proaches [4, 19, 28]. code clone [38] and code summarization [19]. In this paper, we
Creating manually-crafted templates to generate code comments explore the combination of deep learning methods and source code
is one of the most common code summarization approaches. Srid- features to generate code comments. Compared to the previous
hara et al. [35] use the Software Word Usage Model (SWUM) to cre- works, DeepCom explains the code summarization procedure from
ate a rule-based model that generates natural language descriptions a machine translation perspective. The experimental results also
for Java methods. Moreno et al. [25] predefine heuristic rules to prove the ability of DeepCom.
select information and generate comments for Java classes by com-
bining the information. These rule-based approaches have been ex- 8 CONCLUSION
panded to cover special types of code artifacts such as test cases [48] This paper formulates code summarization task as a machine trans-
and code changes [8]. Human templates usually synthesize com- lation problem which translates source code written in a program-
ments by extracting keywords from the given source code. ming language to comments in natural language. We propose Deep-
IR approaches are widely used in summary generation and usu- Com, an attention-based Seq2Seq model, to generate comments
ally search comments from similar code snippets. Haiduc et al. [15] for Java methods. DeepCom takes ASTs sequences as input. These
apply the Vector Space Model (VSM) and Latent Semantic Indexing ASTs are converted to specially formatted sequences using a new
(LSI) to generate term-based comments for classes and methods. structure-based traversal (SBT) method. SBT can express the struc-
Their works are replicated and expanded by Eddy et al. [11] which tural information and keep the representation lossless at the same
exploit a hierarchical topic model. Wong et al. [42] apply code time. DeepCom outperforms the state-of-the-art approaches and
clone detection techniques to find similar code snippets and use the achieves better results on machine translation metrics. In future
comments from similar code snippets. The work is similar to their work, we plan to improve the effectiveness of our proposed ap-
previous work AutoComment [43] which mines human-written de- proach by introducing more domain-specific customizations. We
scriptions for automatic comment generation from Stack Overflow. also plan to apply our proposed approach to other software engi-
Recently, some studies try giving natural language summaries neering tasks that can be mapped to a machine translation problem
by deep learning approaches. Iyer et al. [19] present RNN networks (e.g., code migration, etc.).
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden

REFERENCES Conference on. IEEE, 23–32.


[1] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learn- [26] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural
ing natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT Interna- Networks over Tree Structures for Programming Language Processing.. In AAAI,
tional Symposium on Foundations of Software Engineering. ACM, 281–293. Vol. 2. 4.
[2] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Sug- [27] Lili Mou, Rui Men, Ge Li, Lu Zhang, and Zhi Jin. 2015. On end-to-end pro-
gesting accurate method and class names. In Proceedings of the 2015 10th Joint gram generation from user intention by deep neural networks. arXiv preprint
Meeting on Foundations of Software Engineering. ACM, 38–49. arXiv:1510.07211 (2015).
[3] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2017. [28] Dana Movshovitz-Attias and William W Cohen. 2013. Natural language models
A Survey of Machine Learning for Big Code and Naturalness. arXiv preprint for predicting programming comments. (2013).
arXiv:1709.06182 (2017). [29] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen.
[4] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional at- 2013. A statistical semantic language model for source code. In Proceedings of the
tention network for extreme summarization of source code. In International 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 532–542.
Conference on Machine Learning. 2091–2100. [30] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti,
[5] Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code
modelling of source code and natural language. In International Conference on from source code using statistical machine translation (t). In Automated Software
Machine Learning. 2123–2132. Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 574–
[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine 584.
Translation by Jointly Learning to Align and Translate. Computer Science (2014). [31] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
[7] Manfred Broy, Florian Deißenböck, and Markus Pizka. 2005. A holistic approach method for automatic evaluation of machine translation. In Proceedings of the
to software quality at work. In Proc. 3rd World Congress for Software Quality 40th annual meeting on association for computational linguistics. Association for
(3WCSQ). Computational Linguistics, 311–318.
[8] Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting [32] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto
program changes. In Proceedings of the IEEE/ACM international conference on Bacchelli, and Premkumar Devanbu. 2016. On the naturalness of buggy code. In
Automated software engineering. ACM, 33–42. Proceedings of the 38th International Conference on Software Engineering. ACM,
[9] Ciprian Chelba, Dan Bikel, Maria Shugrina, Patrick Nguyen, and Shankar Kumar. 428–439.
2012. Large scale language modeling in automatic speech recognition. arXiv [33] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting program
preprint arXiv:1210.8440 (2012). properties from big code. In ACM SIGPLAN Notices, Vol. 50. ACM, 111–124.
[10] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. [34] Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention
Empirical evaluation of gated recurrent neural networks on sequence modeling. model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685
arXiv preprint arXiv:1412.3555 (2014). (2015).
[11] Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft, and Jeffrey C Carver. 2013. [35] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-
Evaluating source code summarization techniques: Replication and expansion. In Shanker. 2010. Towards automatically generating summary comments for java
Program Comprehension (ICPC), 2013 IEEE 21st International Conference on. IEEE, methods. In Proceedings of the IEEE/ACM international conference on Automated
13–22. software engineering. ACM, 43–52.
[12] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep [36] Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Automatically
API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Sympo- detecting and describing high level actions within methods. In Proceedings of the
sium on Foundations of Software Engineering. ACM, 631–642. 33rd International Conference on Software Engineering. ACM, 101–110.
[13] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: [37] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
Migrate APIs with Multi-modal Sequence to Sequence Learning. arXiv preprint with neural networks. In Advances in neural information processing systems. 3104–
arXiv:1704.07734 (2017). 3112.
[14] Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program [38] Jeffrey Svajlenko and Chanchal K Roy. 2016. A Machine Learning Based Approach
comprehension with source code summarization. In Proceedings of the 32nd for Evaluating Clone Detection Tools for a Generalized and Accurate Precision.
ACM/IEEE International Conference on Software Engineering-Volume 2. ACM, 223– International Journal of Software Engineering and Knowledge Engineering 26,
226. 09n10 (2016), 1399–1429.
[15] Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the [39] Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint
use of automated text summarization techniques for summarizing source code. arXiv:1506.05869 (2015).
In Reverse Engineering (WCRE), 2010 17th Working Conference on. IEEE, 35–44. [40] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic
[16] Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks features for defect prediction. In Proceedings of the 38th International Conference
the best choice for modeling source code?. In Proceedings of the 2017 11th Joint on Software Engineering. ACM, 297–308.
Meeting on Foundations of Software Engineering. ACM, 763–773. [41] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
[17] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2016. Deep learning code fragments for code clone detection. In Proceedings of
2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th the 31st IEEE/ACM International Conference on Automated Software Engineering.
International Conference on. IEEE, 837–847. ACM, 87–98.
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural [42] Edmund Wong, Taiyue Liu, and Lin Tan. 2015. Clocom: Mining existing source
computation 9, 8 (1997), 1735–1780. code for automatic comment generation. In Software Analysis, Evolution and
[19] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 380–
Summarizing Source Code using a Neural Attention Model.. In ACL (1). 389.
[20] Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generat- [43] Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question
ing commit messages from diffs using neural machine translation. In Proceedings and answer sites for automatic comment generation. In Proceedings of the 28th
of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE/ACM International Conference on Automated Software Engineering. IEEE
IEEE Press, 135–146. Press, 562–567.
[21] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng [44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
2016. Google’s multilingual neural machine translation system: enabling zero- 2016. Google’s neural machine translation system: Bridging the gap between
shot translation. arXiv preprint arXiv:1611.04558 (2016). human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[22] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M [45] Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shan-
Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. arXiv ping Li. 2017. Measuring program comprehension: A large-scale field study with
preprint arXiv:1701.02810 (2017). professionals. IEEE Transactions on Software Engineering (2017).
[23] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A Neural Ar- [46] Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2015.
chitecture for Generating Natural Language Descriptions from Source Code Neural generative question answering. arXiv preprint arXiv:1512.01337 (2015).
Changes. arXiv preprint arXiv:1704.04856 (2017). [47] Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-
[24] Paul W McBurney and Collin McMillan. 2014. Automatic documentation gen- Purpose Code Generation. arXiv preprint arXiv:1704.01696 (2017).
eration via source code summarization of method context. In Proceedings of the [48] Sai Zhang, Cheng Zhang, and Michael D Ernst. 2011. Automated documentation
22nd International Conference on Program Comprehension. ACM, 279–290. inference to explain failed tests. In Proceedings of the 2011 26th IEEE/ACM Inter-
[25] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, national Conference on Automated Software Engineering. IEEE Computer Society,
and K Vijay-Shanker. 2013. Automatic generation of natural language summaries 63–72.
for java classes. In Program Comprehension (ICPC), 2013 IEEE 21st International

You might also like