SP-Deep Code Comment Generation
SP-Deep Code Comment Generation
Research Collection School Of Computing and School of Computing and Information Systems
Information Systems
5-2018
Ge LI
Peking University
Xin XIA
Monash University
David LO
Singapore Management University, davidlo@smu.edu.sg
Zhi JIN
Peking University
Citation
HU, Xing; LI, Ge; XIA, Xin; LO, David; and JIN, Zhi. Deep code comment generation. (2018). ICPC '18:
Proceedings of the 26th Conference on Program Comprehension, Gothenburg, Sweden, May 27-28.
200-210.
Available at: https://ink.library.smu.edu.sg/sis_research/4292
This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
cherylds@smu.edu.sg.
Deep Code Comment Generation∗
Xing Hu1 , Ge Li1 , Xin Xia2 , David Lo3 , Zhi Jin1
1 Key Laboratory of High Confidence Software Technologies (Peking University), MoE, Beijing, China
2 Faculty of Information Technology, Monash University, Australia
3 School of Information Systems, Singapore Management University, Singapore
1 {huxing0101,lige,zhijin}@pku.edu.cn, 2 xin.xia@monash.edu, 3 davidlo@smu.edu.sg
ABSTRACT 1 INTRODUCTION
During software maintenance, code comments help developers In software development and maintenance, developers spend around
comprehend programs and reduce additional time spent on reading 59% of their time on program comprehension activities [45]. Previ-
and navigating source code. Unfortunately, these comments are ous studies have shown that good comments are important to pro-
often mismatched, missing or outdated in the software projects. gram comprehension, since developers can understand the meaning
Developers have to infer the functionality from the source code. of a piece of code by using the natural language description of the
This paper proposes a new approach named DeepCom to automat- comments [35]. Unfortunately, due to tight project schedule and
ically generate code comments for Java methods. The generated other reasons, code comments are often mismatched, missing or
comments aim to help developers understand the functionality outdated in many projects. Automatic generation of code comments
of Java methods. DeepCom applies Natural Language Processing can not only save developers’ time in writing comments, but also
(NLP) techniques to learn from a large code corpus and generates help in source code understanding.
comments from learned features. We use a deep neural network Many approaches have been proposed to generate comments for
that analyzes structural information of Java methods for better methods [24, 35] and classes [25] of Java, which is the most popu-
comments generation. We conduct experiments on a large-scale lar programming language in the past 10 years1 . Their techniques
Java corpus built from 9,714 open source projects from GitHub. We vary from the use of manually-crafted [25] to Information Retrieval
evaluate the experimental results on a machine translation met- (IR) [14, 15]. Moreno et al. [25] defined heuristics and stereotypes to
ric. Experimental results demonstrate that our method DeepCom synthesize comments for Java classes. These heuristics and stereo-
outperforms the state-of-the-art by a substantial margin. types are used to select information that will be included in the
comment. Haiduc et al. [14, 15] applied IR approaches to generate
CCS CONCEPTS summaries for classes and methods. IR approaches such as Vector
Space Model (VSM) and Latent Semantic Indexing (LSI) usually
• Software and its engineering → Documentation; • Comput-
search comments from similar code snippets. Although promising,
ing methodologies → Neural networks;
these techniques have two main limitations: First, they fail to ex-
tract accurate keywords used for identifying similar code snippets
KEYWORDS when identifiers and methods are poorly named. Second, they rely
program comprehension, comment generation, deep learning on whether similar code snippets can be retrieved and how similar
the snippets are.
ACM Reference Format: Recent years have seen an emerging interest in building proba-
Xing Hu1 , Ge Li1 , Xin Xia2 , David Lo3 , Zhi Jin1 . 2018. Deep Code Comment bilistic models for large-scale source code. Hindle et al. [17] have
Generation. In Proceedings of IEEE/ACM International Conference on Program addressed the naturalness of software and demonstrated that code
Comprehension, Gothenburg, Sweden, May 27 - May 28, 2018 (ICPC’18). ACM, can be modeled by probabilistic models. Several subsequent studies
New York, NY, USA, 11 pages. https://doi.org/10.475/123_4
have developed various probabilistic models for different software
tasks [12, 23, 40, 41]. When applied to code summarization, different
from IR-based approaches, existing probabilistic-model-based ap-
∗ Thisresearch is supported by the National Basic Research Program of China (the proaches usually generate comments directly from code instead of
973 Program) under Grant No. 2015CB352201, and the National Natural Science Foun-
dation of China under Grant Nos. 61232015 and 61620106007. Zhi Jin and Ge Li are
synthesizing them from keywords. One of such probabilistic-model-
corresponding authors. based approaches is by Iyer et al. [19] who propose an attention-
based Recurrent Neural Network (RNN) model called CODE-NN.
It builds a language model for natural language comments and
Permission to make digital or hard copies of all or part of this work for personal or
aligns the words in comments with individual code tokens directly
classroom use is granted without fee provided that copies are not made or distributed by attention component. CODE-NN recommends code comments
for profit or commercial advantage and that copies bear this notice and the full citation given source code snippets extracted from Stack Overflow. Experi-
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, mental results demonstrate the effectiveness of probabilistic models
to post on servers or to redistribute to lists, requires prior specific permission and/or a on code summarization. These studies provide principled methods
fee. Request permissions from permissions@acm.org. for probabilistically modeling and resolving ambiguities both in
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden
© 2018 Association for Computing Machinery.
natural language descriptions and in the source code.
ACM ISBN 123-4567-24-567/08/06. . . $15.00
https://doi.org/10.475/123_4 1 https://www.tiobe.com/tiobe-index/
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.
Modifier_public
34%.'55!0+6
( SimpleType Attention Training Model
String #!$%&'()%' !"
(#,7,'$'+,
SimpleName_String )
…
SimpleName_String ) SimpleType X
*+,'-'.
Method ,( ,) ,* … ,+
Invocation
( SingleVariableDeclaration ( SimpleType
'( ') '* … '+
( SimpleName_Integer ) 8 Code Comment
SBT Encoder P(X)
of a sequence is computed via each of its tokens. That is, 1(b) illustrates a typical LSTM unit and for more details of LSTM,
please refer to [10, 18].
P(x) = P(x 1 )P(x2|x 1 )...P(x n |x 1 ...x n−1 ) (1)
In this paper, we adopt a language model based on the deep neural 2.2 Neural Machine Translation
network called Long Short-Term Memory (LSTM) [18]. LSTM is
NMT [44] is an end-to-end learning approach for automated trans-
one of the state-of-the-art RNNs. LSTM outperforms general RNN
lation. It is a deep learning based approach and has made rapid
because it is capable of learning long-term dependencies. It is a
progress in recent years. NMT has shown impressive results surpass-
natural model to use for source code which has long dependencies
ing those of phrase-based systems while addressing shortcomings
(e.g., a class is used far away from its import statement). The details
such as the need for hand engineered features. Its architecture typi-
of RNN and LSTM are shown in Figure 1.
cally consists of two RNNs, one to consume the input text sequences
2.1.1 Recurrent Neural Networks. RNNs are intimately related and the other one to generate the translated output sequences. It is
to sequences and lists because of their chain-like natures. It can in often accompanied by an attention mechanism that aligns target
principle map from the entire history of previous inputs to each with source tokens [6].
output. At each time step t, the unit in the RNN takes not only the NMT bridges the gap between different natural languages. Gen-
input of the current step but also the hidden state outputted by its erating comments from the source code is a variant of machine
previous time step t − 1. As Figure 1(a) illustrates, the hidden state translation problem between the source code and the natural lan-
of time step t is updated according to the input vector x t and its guage. We explore whether the NMT approach can be applied
previous hidden state ht −1 , namely, ht = tanh(W x t + U ht −1 + b) to comments generation. In this paper, we follow the common
where W , U , and b are the trainable parameters which are updated Sequence-to-Sequence (Seq2Seq) [37] learning framework with at-
while training, and tanh is the activation function: tanh(z) = (e z − tention [6] which helps cope effectively with the long source code.
e −z )/(e z + e (−z) ).
A prominent drawback of the standard RNN model is that gra- 3 PROPOSED APPROACH
dients may explode or vanish during the back-propagation. These
The transition process between source code and comments is simi-
phenomena often appear when long dependencies exist in the se-
lar to the translation process between different natural languages.
quences. To address these problems, some researchers have pro-
Existing research has applied machine translation methods trans-
posed several variants to preserve long-term dependencies. These
lating code from one source language (e.g., Java) to another (e.g.,
variants include LSTM and Gated Recurrent Unit (GRU). In this
C#) [13]. A few studies adopt machine translation method for gener-
paper, we adopt the LSTM which has achieved success on many
ating natural language descriptions from the source code. Oda et al.
NLP tasks [6, 37].
[30] present a machine translation approach to generate natural
2.1.2 Long Short-Term Memory. LSTM introduces a structure language Pseudo-code of the source code at the statement level.
called the memory cell to solve the problem that ordinary RNN is In this paper, DeepCom translates the source code to a high-level
difficult to learn long-term dependencies in the data. The LSTM is description at the method level.
trained to selectively “forget” information from the hidden states, The overall framework of DeepCom is illustrated in Figure 2.
thus allowing room to take in more important information [18]. DeepCom mainly consists of three stages: data processing, model
LSTM introduces a gating mechanism to control when and how training, and online testing. The source code we obtained from
to read previous information from the memory cell and write new GitHub is parsed and preprocessed into a parallel corpus of Java
information. The memory cell vector in the recurrent unit preserves methods and their corresponding comments. In order to learn the
long-term dependencies. In this way, LSTM handles long-term structural information, the Java methods are converted into AST
dependencies more effectively than vanilla RNN. LSTM has been sequences by a special traversal approach before input into the
widely used to solve semantically related tasks and has achieved model. With the parallel corpus of AST sequences and comments,
convincing performance. These advantages motivate us to exploit we build and train generative neural models based on the idea of
LSTM for building models for source code and comments. Figure NMT. There are two challenges during training process:
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.
Cross-Entropy
Figure 4: An example of sequencing an AST to a sequence
&" &# &$ … &% by SBT. (For a number, the bold font number after bracket
indicates node itself and the number in brackets denotes the
!" !# !$ … !% tree structure by taking it as the root node.)
Encoder P(X) as
m
Õ
ci = αi j s j (3)
Figure 3: Sequence-to-Sequence model.
j=1
The weight α i j of each hidden state s j is computed as
• How to represent ASTs to store the structural information
and keep the representation unambiguous while traversing exp(ei j )
α i j = Ím (4)
the ASTs? k =1
exp(eik )
• How to deal with out-of-vocabulary tokens in source code? and
In the following paragraphs, we will introduce the details of ei j = a(hi−1 , s j ) (5)
the model and the approaches we propose to resolve the above- is an alignment model which scores how well the inputs around
mentioned challenges. position j and the output at position i match.
3.1 Sequence-to-Sequence Model 3.1.3 Decoder. The Decoder aims to generate the target se-
quence y by sequentially predicting the probability of a word yi
In this paper, we apply a Sequence-to-Sequence (Seq2Seq) model conditioned on the context vector c i and its previous generated
to learn source code and generate comments. Seq2Seq model is words y1 , ..., yi−1 , i.e.,
widely used for machine translation [37], text summarization [34],
dialogue system [39], etc. The model consists of three components, p(yi |y1 , ..., yi−1 , x) = д(yi−1 , hi , c i ) (6)
an Encoder, a Decoder, and an Attention component, in which where д is used to estimate the probability of the word yi . The goal
the Encoder and Decoder are both LSTMs. Figure 3 illustrates the of the model is to minimize the cross-entropy, i.e., minimize the
detailed Seq2Seq model. following objective function:
N n
3.1.1 Encoder. The encoder is an LSTM we describe in Section 1 ÕÕ (i)
2 and responsible for learning the source code. At each time step t, H (y) = − loдp(y j ) (7)
N i=1 j=1
it reads one token x t of the sequence, then updates and records the
current hidden state st , namely, where N is the total number of training instances, and n is the
(i)
length of each target sequence. y j means the jth word in the ith
st = f (x t , st −1 ) (2) instance. Through optimizing the objective function using opti-
where f is a LSTM unit that maps a word of source language x t into mization algorithms such as gradient descendant, the parameters
a hidden state st . The encoder learns latent features from source can be estimated.
code, and the features are encoded into the context vector c. These
latent features include the identifiers naming conventions, control
3.2 Abstract Syntax Tree with SBT traversal
structures, and etc. In this paper, DeepCom adopts the attention Translation between source code and NL is challenging due to the
mechanism to compute the context vector c. structure of source code. One simple way to model source code
is to just view it as plain text. However, in such way, the struc-
3.1.2 Attention. Attention mechanism is a recent model that ture information will be omitted, which will cause inaccuracies
selects the important parts from the input sequence for each target in the generated comments. To learn the semantic and syntactic
word. For example, the token “whether” in comments usually aligns information at the same time, we convert the ASTs into specially
with the “if” statements in the source code. The generation of each formatted sequences by traversing the ASTs. Sequences obtained
word is guided by a classic attention method proposed by Bahdanau by classical traversal methods (e.g., pre-order traversal) are lossy
et al. [6]. since the original ASTs cannot unambiguously be reconstructed
It defines individual c i for predicting each target word yi as a back from them. This ambiguity may cause different Java methods
weighted sum of all hidden states s 1 , .., sm in encoder and computed (each with different comments) to be mapped to the same sequence
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden
( MethodDeclaration
( Modifier_public ) Modifier_public
representation. It is confusing for the neural network if there are MethodDeclaration
( SimpleType
Modifier (public) ( SimpleName_String ) SimpleName_String
multiple labels (in our setting, comments) given to a specific input. SimpleType ) SimpleType
( SingleVariableDeclaration
SimpleName (String)
( SimpleType
For addressing this problem, we propose a Structure-based Traver- SingleVariableDeclaration ( SimpleName_Integer ) SimpleName_Integer
) SimpleType
SimpleType
sal (SBT) method to traverse the AST. The details are presented in ( SimpleName_id ) SimpleName_id
) SingleVariableDeclaration
SimpleName (Integer)
( Block
Algorithm 1. Figure 4 illustrates a simple example of SBT to traverse SimpleName (id) ( ExpressionStatement
Block ( MethodInvocation
SBT
a tree and the detailed procedure is as follows: ExpressionStatement ( SimpleName_LOG ) SimpleName_LOG
( SimpleName_debug ) SimpleName_debug
MethodInvocation (SimpleName_ExtractingmethodwithID:{})
• From the root node, we first use a pair of brackets to represent SimpleName (Log)
SimpleName_ExtractingmethodwithID:{}
( SimpleName_id ) SimpleName_id
) MethodInvocation
the tree structure and put the root node itself behind the SimpleName (debug)
) ExpressionStatement
SimpleName (Extracting method with ID:{}) ( ReturnStatement
right bracket, that is (1)1, shown in Figure 4. SimpleName (id) ( MethodInvocation
( SimpleName_request ) SimpleName_request
ReturnStatement
• Next, we traverse the subtrees of the root node and put all MethodInvocation
( SimpleName_remove) SimpleName_remove
( SimpleName_id ) SimpleName_id
SimpleName (request) ) MethodInvocation
root nodes of subtrees into the brackets, i.e., (1(2)2(3)3)1. SimpleName (remove)
) ReturnStatement
) Block
• Recursively, we traverse each subtree until all nodes are SimpleName (extractFor)
SimpleName (id) ( SimpleName_extractFor ) SimpleName_extractFor
) MethodDeclaration
Table 1: Statistics for code snippets in our dataset on the Seq2Seq model in Tensorflow tutorials7 . The parameters are
shown as follows:
#All # All # Unique #Unique • The SGD (with minibatch size 100 randomly chosen from
#Methods
Tokens Identifiers Tokens Identifiers training instances) is used to train the parameters.
69,708 8,713,079 2,711,496 234,146 234,055 • DeepCom uses two-layered LSTMs with 512 dimensions of
the hidden states and 512-dimensional word embeddings.
Table 2: Statistics for code lengths and comments lengths • The learning rate is set to 0.5 and we clip the gradients norm
by 5. The learning rate is decayed using the rate 0.99.
• To prevent over-fitting, we use dropout with 0.5.
Code Lengths
Avg Mode Median <100 <150 <200 4.2 Evaluation Measure: BLEU-4
99.94 16 65 68.63% 82.06% 89.00% DeepCom uses machine translation evaluation metrics BLEU-4
score[31] to measure the quality of generated comments. BLEU
Comments Lengths
score is a widely-used accuracy measure for NMT [22] and has
Avg Mode Median <20 <30 <50 been used in software tasks evaluation [12, 20]. It calculates the
8.86 8 13 75.50% 86.79% 95.45% similarity between the generated sequence and reference sequence
(usually a human-written sequence). The BLEU score ranges from
1 to 100 as a percentage value. The higher the BLEU, the closer the
the comment since it typically describes the functionalities of Java candidate is to the reference. If the candidate is completely equal to
methods according to Javadoc guidance4 . Empty or just one-word the reference, the BLEU becomes 100%. Jiang et al. [20] exploit it to
descriptions are filtered out in this work because these comments evaluate the generated summaries for commit messages. Gu et al.
have no ability to express the Java methods functionalities. We also [12] use BLEU to evaluate the accuracy of generated API sequences
exclude the setter, getter, constructor and test methods, since they from natural language queries. Their experiments show that BLEU
are easy for a model to generate the comments. score is reasonable to measure the accuracy of generated sequences.
Finally, we get 69,708 ⟨ Java method, comment⟩ pairs5 . Similar It computes the n-gram precision of a candidate sequence to the
to Jiang et al. [20]’s work, we randomly select 80% of the pairs for reference. The score is computed as:
training, 10% of the pairs for validation, and rest 10% for testing. N
Õ
Table 1 and Table 2 illustrates statistics of the corpus. We also BLEU = BP · exp( w n loдpn ) (8)
give the details of methods lengths and comments length. The n=1
average lengths of Java methods and comments are 99.94 and 8.86 where pn is the ratio of length n subsequences in the candidate that
tokens in this corpus. We find that more than 95% code comments are also in the reference. In this paper, we set N to 4, which is the
have no more than 50 words and about 90% Java methods no longer maximum number of grams. BP is brevity penalty,
than 200 tokens.
ifc > r
1
During the training, the numerals and strings are replaced with BP = (1−r /c) (9)
generic tokens ⟨NUM⟩ and ⟨STR⟩ respectively. The maximum length e ifc ≤ r
of AST sequences is set to 400. We use a special symbol ⟨PAD⟩ to where c is the length of the candidate translation and r is the effec-
pad the shorter sequences and the longer sequences will be cut tive reference sequence length.
into sequences with 400 tokens. We add special tokens ⟨START⟩ In this paper, we regard a generated comment as a candidate
and ⟨EOS⟩ to the decoder sequences during training. ⟨START⟩ is and a programmer-written comment (extracted from Javadoc) as a
the start of the decoding sequence and the ⟨EOS⟩ means the end reference.
of it. The maximum comment length is set to 30. The vocabulary
sizes for AST sequences and comments are both 30,000 in this pa- 5 RESULTS
per. While there is no ⟨UNK⟩ in ASTs sequences, there are a few
In this section, we evaluate different approaches by measuring their
out-of-vocabulary tokens in comments that are replaced by ⟨UNK⟩.
accuracy on generating Java methods’ comments. Specifically, we
mainly focus on the following research questions:
4.1 Training Details
• RQ1: How effective is DeepCom compared with the state-of-
The model is validated every 2,000 minibatches on the validation set
the-art baseline?
by BLEU [31] which is a commonly used automatic metric for NMT.
• RQ2: How effective is DeepCom to source code and com-
Training runs for about 50 epochs and we select the best model
ments of varying lengths?
that has best results on the validation set as the final model. The
model is then evaluated on the test set by computing average BLEU
5.1 RQ1: DeepCom vs. Baseline
scores and the results will be discussed in Section 5. All models are
implemented using the Tensorflow framework6 and extended based 5.1.1 Baseline. We compare DeepCom with CODE-NN [19]
which is a state-of-the-art code summarization approach and also a
4 http://www.oracle.com/technetwork/articles/java/index- 137868.html deep learning based method. CODE-NN is an end-to-end generation
5 Data is available at https://github.com/huxingfree/DeepCom
6 https://www.tensorflow.org/ 7 https://github.com/tensorflow/nmt
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden
Table 3: Evaluation results on Java methods Table 4: Evaluation results on CODE-NN datasets including
C# and SQL programming languages.
Approaches BLEU-4 score (%)
Language Approaches BLEU-4 score(%)
CODE-NN 25.30
Seq2Seq 34.87 CODE-NN 20.4
C#
Attention-based Seq2Seq 35.50 Seq2Seq 30.00
DeepCom (Pre-order) 36.01
CODE-NN 17.0
DeepCom (SBT) 38.17 SQL
Seq2Seq 30.94
4 and Case 9). The influence of API invocations explains that Deep-
Com can learn the platform standard APIs usage patterns from a
large-scale dataset. However, it can not learn customized APIs well
because the customized APIs with the same name have different
usage patterns in different programs.
6.1.5 Low BLEU score cases. The results with lower BLEU scores
are mainly divided into two types, meaningless sentences, and sen-
tences with clear semantics. The former mainly contains empty
sentences and results with too many repetitive words. We conjec-
ture the problems come from out-of-vocabulary words in original
(a) BLEU-4 scores for different code lengths comments or mismatch between the Java methods and comments
in the original dataset.
In the latter ones, most of them are irrelevant to original com-
ments in their semantics. There are also some interesting results
that hold relevant semantics but gain low BLEU scores (shown in
Case 4). The automatically generated and manual comments may
describe similar functionalities but with different words or order.
Table 5: Examples of generated comments by DeepCom. These samples are necessarily limited to short methods because of
space limitations. AST structure is not shown in the table, because AST is much longer than source code.
public FactoryConfigurationError(Exception e){ Automatically generated: Create a new ⟨UNK ⟩ with a given Exception base cause
super(e.toString()); of the error.
3
this.exception=e; Human-written: Create a new FactoryConfigurationError with a given Exception
} base cause of the error.
public boolean contains(int key){ Automatically generated: Checks whether the given object is contained within the
7 return rank(key) != -1; given set.
} Human-written: Is the key in this set of integers?
learns common patterns from a large-scale source code and the en- dense than text and have formal syntax and semantics. It is diffi-
coder itself is a language model which remembers the likelihood of cult for models to learn semantic and syntax information at the
different Java methods. The decoder of DeepCom learns the context same time just given code sequences. Existing approaches usually
of source code which bridges the gap between natural language analyze source code directly and omit its syntax representation.
and code. Furthermore, the attention mechanism helps align code In contrast to traditional NMT models, DeepCom takes advan-
tokens and natural language words. tage of rich and unambiguous code structures. In this way, Deep-
Com bridges the gap between code and natural language with the
6.2.2 Generation assisted by structural information. Program- assistance of structure information within the source code. From
ming languages are formal languages which are more structure
ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden Xing Hu et al.
the evaluation results, we find that the structural information im- with attention to produce summaries that describe C# code snip-
proves the quality of comments. The improvements for methods pets and SQL queries. It takes source code as plain text and models
implementing standard algorithms are much more obvious. Java the conditional distribution of the summary. Allamanis et al. [4]
methods realizing the same algorithm may define different variables apply a neural convolutional attentional model to the problem that
while their ASTs are much more similar. extremely summarizes the source code snippets into short, name-
like summaries. These learning-based approaches mainly learn the
6.3 Threats to Validity latent features from source code, such as semantics, formatting,
We have identified the following threats to validity: and etc. The comments are generated according to these learned
Automatic evaluation metrics: We evaluate the gap between features. The experimental results of them have proved the effective-
generated comments and human-written comments by machine ness of deep learning methods on code summarization. In this paper,
translation metric BLEU which is gradually used in generative- DeepCom integrates the structure information which is verified
based software issues [12, 20]. The reason for this setting is that we important for comments generation.
want to reduce the impact of the subjectivity of manual evaluation.
Quality of collected comments: We collected the comments for 7.2 Language models for source code
Java methods from the first sentence of Javadoc as other work Recently, thanks to the insight of Hindle et al. [17], there is an emerg-
does [12]. Although we define heuristic rules to decrease the noise ing interest in building language models of source code. These lan-
in comments, there are some mismatched comments in the dataset. guage models vary from n-gram model [1, 29], bimodal model [5],
In the future, we will investigate a better technique to build a better and RNNs [12, 19]. Hindle et al. [17] first propose to explore N-gram
parallel corpus. to model the source code and demonstrate that most software is
Comparisons on Java dataset: Another threat to validity is that also natural and find regularities in natural code. Some studies build
our approach is experimented on Java dataset. Although we fail to the models to bridge the gap between the programming language
evaluate DeepCom directly on CODE-NN’ dataset which is difficult and natural language descriptions. Allamanis et al. [1] develop a
to parse into ASTs, the results on Java have proved the effectiveness framework to learn the code conventions of a codebase and the
of DeepCom. In the future, we will extend our approach to other framework exploits N-gram model to name Java identifiers. Alla-
programming languages (e.g., Python). manis et al. [2] and Raychev et al. [33] suggest names for variables,
methods, and classes. Mou et al. [26] present a tree-based convo-
lutional neural networks to model the source code and classify
7 RELATED WORK programs. Gu et al. [12] present a classic encoder-decoder model
7.1 Code Summarization to bridge the gap between the Java API sequences and natural lan-
As a critical task in software engineering, code summarization aims guage. Yin and Neubig [47] build a data-driven syntax-based neural
to generate brief natural language descriptions for source code. network model for generating code from natural language.
Automatic code summarization approaches vary from manually- Learning from source code is applied to various software engi-
crafted template [24, 35, 36], IR [14, 15, 43] to learning-based ap- neering tasks, e.g., fault detection [32], code completion [27, 29],
proaches [4, 19, 28]. code clone [38] and code summarization [19]. In this paper, we
Creating manually-crafted templates to generate code comments explore the combination of deep learning methods and source code
is one of the most common code summarization approaches. Srid- features to generate code comments. Compared to the previous
hara et al. [35] use the Software Word Usage Model (SWUM) to cre- works, DeepCom explains the code summarization procedure from
ate a rule-based model that generates natural language descriptions a machine translation perspective. The experimental results also
for Java methods. Moreno et al. [25] predefine heuristic rules to prove the ability of DeepCom.
select information and generate comments for Java classes by com-
bining the information. These rule-based approaches have been ex- 8 CONCLUSION
panded to cover special types of code artifacts such as test cases [48] This paper formulates code summarization task as a machine trans-
and code changes [8]. Human templates usually synthesize com- lation problem which translates source code written in a program-
ments by extracting keywords from the given source code. ming language to comments in natural language. We propose Deep-
IR approaches are widely used in summary generation and usu- Com, an attention-based Seq2Seq model, to generate comments
ally search comments from similar code snippets. Haiduc et al. [15] for Java methods. DeepCom takes ASTs sequences as input. These
apply the Vector Space Model (VSM) and Latent Semantic Indexing ASTs are converted to specially formatted sequences using a new
(LSI) to generate term-based comments for classes and methods. structure-based traversal (SBT) method. SBT can express the struc-
Their works are replicated and expanded by Eddy et al. [11] which tural information and keep the representation lossless at the same
exploit a hierarchical topic model. Wong et al. [42] apply code time. DeepCom outperforms the state-of-the-art approaches and
clone detection techniques to find similar code snippets and use the achieves better results on machine translation metrics. In future
comments from similar code snippets. The work is similar to their work, we plan to improve the effectiveness of our proposed ap-
previous work AutoComment [43] which mines human-written de- proach by introducing more domain-specific customizations. We
scriptions for automatic comment generation from Stack Overflow. also plan to apply our proposed approach to other software engi-
Recently, some studies try giving natural language summaries neering tasks that can be mapped to a machine translation problem
by deep learning approaches. Iyer et al. [19] present RNN networks (e.g., code migration, etc.).
Deep Code Comment Generation ICPC’18, May 27 - May 28, 2018, Gothenburg, Sweden