Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Convolutional Neural Networks over Tree Structures for Programming Language Processing

This paper introduces a Tree-Based Convolutional Neural Network (TBCNN) designed for programming language processing, leveraging the structural information in abstract syntax trees (ASTs). The TBCNN architecture is shown to effectively classify programs and detect code patterns, outperforming traditional machine learning methods that rely heavily on feature engineering. The study highlights the importance of deep learning techniques in automating the analysis of programming languages, which differ significantly from natural languages.

Uploaded by

WiseNeuron
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Convolutional Neural Networks over Tree Structures for Programming Language Processing

This paper introduces a Tree-Based Convolutional Neural Network (TBCNN) designed for programming language processing, leveraging the structural information in abstract syntax trees (ASTs). The TBCNN architecture is shown to effectively classify programs and detect code patterns, outperforming traditional machine learning methods that rely heavily on feature engineering. The study highlights the importance of deep learning techniques in automating the analysis of programming languages, which differ significantly from natural languages.

Uploaded by

WiseNeuron
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Convolutional Neural Networks over Tree Structures

for Programming Language Processing

Lili Mou,1 Ge Li,1∗ Lu Zhang,1 Tao Wang,2 Zhi Jin1∗


1
Software Institute, Peking University ∗ Corresponding authors
doublepower.mou@gmail.com, {lige,zhanglu,zhijin}@sei.pku.edu.cn
2
Stanford University, twangcat@stanford.edu

Abstract statistical properties, which are important for program anal-


arXiv:1409.5718v2 [cs.LG] 8 Dec 2015

ysis. These properties are difficult to capture by humans,


Programming language processing (similar to natural
but justify learning-based approaches for programming lan-
language processing) is a hot research topic in the field
of software engineering; it has also aroused growing in- guage processing. However, existing machine learning pro-
terest in the artificial intelligence community. However, gram analysis depends largely on feature engineering, which
different from a natural language sentence, a program is labor-intensive and ad hoc to a specific task, e.g., code
contains rich, explicit, and complicated structural infor- clone detection (Chilowicz, Duris, and Roussel 2009), and
mation. Hence, traditional NLP models may be inappro- bug detection (Steidl and Gode 2013). Further, evidence
priate for programs. In this paper, we propose a novel in the machine learning literature suggests that human-
tree-based convolutional neural network (TBCNN) for engineered features may fail to capture the nature of data,
programming language processing, in which a convo- so they may be even worse than automatically learned ones.
lution kernel is designed over programs’ abstract syn-
tax trees to capture structural information. TBCNN is The deep neural network, also known as deep learning, is
a generic architecture for programming language pro- a highly automated learning machine. By exploring multi-
cessing; our experiments show its effectiveness in two ple layers of non-linear transformation, the deep architec-
different program analysis tasks: classifying programs ture can automatically learn complicated underlying fea-
according to functionality, and detecting code snippets tures, which are crucial to the task of interest. Over the
of certain patterns. TBCNN outperforms baseline meth- past few years, deep learning has made significant break-
ods, including several neural models for NLP. throughs in various fields, such as speech recognition (Dahl,
Mohamed, and Hinton 2010), computer vision (Krizhevsky,
Introduction Sutskever, and Hinton 2012), and natural language process-
ing (Collobert and Weston 2008).
Researchers from various communities are showing growing
Despite some similarities between natural languages and
interest in applying artificial intelligence (AI) techniques to
programming languages, there are also obvious differences
solve software engineering (SE) problems (Dietz et al. 2009;
(Pane, Ratanamahatana, and Myers 2001). Based on a for-
Bettenburg and Begel 2013; Hao et al. 2013). In the area of
mal language, programs contain rich and explicit structural
SE, analyzing program source code—called programming
information. Even though structures also exist in natural
language processing in this paper—is of particular impor-
languages, they are not as stringent as in programs. Pinker
tance.
(1994) illustrates an interesting example, “The dog the stick
Even though computers can run programs, they do not
the fire burned beat bit the cat.” This sentence complies
truly “understand” programs. Analyzing source code pro-
with all grammar rules, but too many attributive clauses
vides a way of estimating programs’ behavior, functionality,
are nested. Hence, it can hardly be understood by people
complexity, etc. For instance, automatically detecting source
due to the limitation of human intuition capacity. On the
code snippets of certain patterns help programmers to dis-
contrary, three nested loops are common in programs. The
cover buggy or inefficient algorithms so as to improve code
parse tree of a program, in fact, is typically much larger
quality. Another example is managing large software repos-
than that of a natural language sentence—there are approx-
itories, where automatic source code classification and tag-
imately 190 nodes on average in our experiment, whereas
ging are crucial to software reuse. Programming language
a sentence comprises only 20 words in a sentiment analy-
processing, in fact, serves as a foundation for many SE tasks,
sis dataset (Socher et al. 2013). Further, the grammar rules
e.g., requirement analysis (Ghabi and Egyed 2012), software
“alias” neighboring relationships among program compo-
development and maintenance (Bettenburg and Begel 2013).
nents. The statements inside and outside a loop, for exam-
Hindle et al. (2012) demonstrate that programming lan-
ple, do not form one semantic group, and thus are not se-
guages, similar to natural languages, also contain abundant
mantically neighboring. On the above basis, we think more
Copyright c 2016, Association for the Advancement of Artificial effective neural models are in need to capture structural in-
Intelligence (www.aaai.org). All rights reserved. formation in programs.
In this paper, we propose a novel Tree-Based Convolu- Tree-Based Convolutional Neural Network
tional Neural Network (TBCNN) based on programs’ ab- Programming languages have a natural tree representation—
stract syntax trees (ASTs). We also introduce the notion the abstract syntax tree (AST). Figure 1a shows the AST of
of “continuous binary trees” and apply dynamic pooling to the code snippet “int a=b+3;”.3 Each node in the AST
cope with ASTs of different sizes and shapes. The TBCNN is an abstract component in program source code. A node p
model is a generic architecture, and is applied to two SE with children c1 , · · · , cn represents the constructing process
tasks in our experiments—classifying programs by function- of the component p → c1 · · · cn .
alities and detecting code snippets of certain patterns. It out- Figure 1b shows the overall architecture of TBCNN. In
performs baseline methods in both tasks, including the re- our model, an AST node is first represented as a distributed,
cursive neural network (Socher et al. 2011b) proposed for real-valued vector so that the (anonymous) features of the
NLP. To the best of our knowledge, this paper is also the first symbols are captured. The vector representations are learned
to apply deep neural networks to the field of programming by a coding criterion in our previous work (Peng et al. 2015).
language processing.1 Then we design a set of subtree feature detectors, called
the tree-based convolution kernel, sliding over the entire
Related Work AST to extract structural information of a program. We
thereafter apply dynamic pooling to gather information over
Deep neural networks have made significant breakthroughs different parts of the tree. Finally, a hidden layer and an out-
in many fields. Stacked restricted Boltzmann machines and put layer are added. For supervised classification tasks, the
autoencoders are successful pretraining methods (Hinton, activation function of the output layer is softmax.
Osindero, and Teh 2006; Bengio et al. 2006). They explore In the rest of this section, we first explain the coding cri-
the underlying features of data in an unsupervised manner, terion for AST nodes’ representation learning, serving as a
and give a more meaningful initialization of weights for later pretraining phase of programming language processing. We
supervised learning with deep neural networks. These ap- then describe the proposed TBCNN model, including a cod-
proaches work well with generic data (e.g. data located in a ing layer, a convolutional layer, and a pooling layer. We also
manifold embedded in a certain dimensional space), but they provide additional information on dealing with nodes that
may not be suitable for programming language processing, have varying numbers of child nodes, as in ASTs, by intro-
because programs contain rich structural information. Fur- ducing the notion of continuous binary trees.
ther, AST structures also vary largely among different data
samples (programs), and hence they cannot be fed directly Representation Learning for AST Nodes
to a fixed-size network. Vector representations, sometimes known as embeddings,
To capture explicit structures in data, it may be impor- can capture underlying meanings of discrete symbols, like
tant and beneficial to integrate human priors to the net- AST nodes. We propose in our previous work (Peng et al.
works (Bengio, Courville, and Vincent 2013). One exam- 2015) an unsupervised approach to learn program vector
ple is convolutional neural networks (CNNs, LeCun et al. representations by a coding criterion, which serves as a way
1995; Krizhevsky, Sutskever, and Hinton 2012), which spec- of pretraining.
ify spatial neighboring information in data. CNNs work with A generic criterion for representation learning is
signals of a certain dimension (e.g., images); they also fail “smoothness”—similar symbols have similar feature vec-
to capture tree-structural information as in programs. tors (Bengio, Courville, and Vincent 2013). For example,
Socher et al. (2013, 2011b) propose a recursive neural net- the symbols While and For are similar because both of
work (RNN) for NLP. Although structural information may them are related to control flow, particularly loops. But they
be coded to some extent in RNNs, the major drawback is are different from ID, since ID probably represents some
that only the root features are used for supervised learn- data. In our scenario, we would like the child nodes’ repre-
ing, which buries illuminating information under a compli- sentations to “code” their parent node’s via a single neural
cated neural architecture. RNNs also suffer from the diffi- layer, during which both vector representations and coding
culty of training due to the long dependency path during weights are learned. Formally, let vec(·) ∈ RNf be the fea-
back-propagation (Bengio, Simard, and Frasconi 1994). ture representation of a symbol, where Nf is the feature di-
Subsequent work. After the preliminary version of this mension. For each non-leaf node p and its direct children
paper was preprinted on arXiv,2 Zaremba and Sutskever c1 , · · · , cn , we would like
(2014) use recurrent neural networks to estimate the output X 
of restricted python programs. Piech et al. (2015) build re- vec(p) ≈ tanh li Wcode,i · vec(ci ) + bcode (1)
i
cursive networks on Hoare triples. Regarding the proposed
TBCNN, we extend it to process syntactic parse trees of nat- where Wcode,i ∈ RNf ×Nf is the weight matrix corre-
ural languages (Mou et al. 2015); Duvenaud et al. (2015) ap- sponding to the node ci ; bcode ∈ RNf is the bias. li =
#leaves under ci
ply a similar convolutional network over graphs to analyze #leaves under p is the coefficient of the weight. (Weights
molecules. Wcode,i are weighted by leaf numbers.)
Because different nodes may have different numbers of
1
We make our source code and the collected dataset available children, the number of Wcode,i ’s is not fixed. To overcome
through our website (https://sites.google.com/site/treebasedcnn/).
2 3
On 18 September 2014 (http://arxiv.org/abs/1409.5718v1). Parsed by pycparser (https://pypi.python.org/pypi/pycparser/).
(a) Decl (b)
max fully fully
pooling connected connected
TypeDecl BinaryOp softmax

IdentifierType ID Constant
Vector representation Tree-based Dynamic Hidden Output
and coding convolution pooling

Figure 1: (a) Illustration of an AST, corresponding to the C code snippet “int a=b+3;” It should be notice that our model takes as
input the entire AST of a program, which is typically much larger. (b) The architecture of the Tree-Based Convolutional Neural
Network (TBCNN). The main components in our model include vector representation and coding, tree-based convolution and
dynamic pooling; then a fully-connected hidden layer and an output layer (softmax) are added.

this problem, we introduce the “continuous binary tree,” (a)


l r
where only two weight matrices Wcode and Wcode serve as
model parameters. Wi is a linear combination of the two pa-
rameter matrices according to the position of node i. Details
are deferred to the last part of this section.
The closeness between vec(p) and its coded vector is (b) (c)
TOP
measured by Euclidean distance square, i.e.,
X  2 LOWER_LEFT LOWER_RIGHT
d = vec(p) − tanh li Wcode,i · vec(ci ) + bcode
i 2

To prevent the pretraining algorithm from learning trivial Figure 2: (a) Tree-based convolution. Nodes on the left are
representations (e.g., 0’s will give 0 distance but are mean- the feature vectors of AST nodes. They are either pretrained
ingless), negative sampling is applied like Collobert et al. or combined with pretrained and coded vectors. (b) An illus-
(2011). For each pretraining data sample p, c1 , · · · , cn , we tration of 3-way pooling. (c) An analogy to the continuous
substitute one symbol (either p or one of c’s) with a random binary tree model. In the triangle, the color of a pixel is a
symbol. The distance of the negative sample is denoted as combination of three primary colors; in the convolution pro-
dc , which should be at least larger than that of the positive cess, the weight for a node is a combination of three weight
training sample plus a margin ∆ (set to 1 in our experiment). parameters, namely Wconvt l
, Wconv r
, and Wconv .
Thus, the pretraining objective is to
minimize max {0, ∆ + d − dc }
l ,W r ,b
Wcode code code ,vec(·)
fixed-depth feature detectors sliding over the entire tree, de-
picted in Figure 2a. The subtree feature detectors can be
Coding Layer viewed as convolution with a set of finite support kernels.
Having pretrained the feature vectors for all symbols, we We call this tree-based convolution.
would like to feed them forward to the tree-based convo- Formally, in a fixed-depth window, if there are n nodes
lutional layer for supervised learning. For leaf nodes, they with vector representations x1 , · · · , xn , then the output of
are just the vector representations learned in the pretraining the feature detectors is4
 Xn 
phase. For a non-leaf node p, it has two representations: the y = tanh Wconv,i · xi + bconv
one learned in the pretraining phase (left-hand side of Equa- i=1
tion 1), and the coded one (right-hand side of Equation 1). where y, bconv ∈ R , Wconv,i ∈ RNc ×Nf . (Nc is the number
Nc
They are linearly combined before being fed to the convolu- of feature detectors.) 0’s are padded for nodes at the bottom
tional layer. Let c1 , · · · , cn be the children of node p and we that do not have as many layers as the feature detectors. In
denote the combined vector as p. We have our experiments, the kernel depth is set to 2.
p = Wcomb1 · vec(p) Note that, to deal with varying numbers of children, we
X  also adopt the notion of continuous binary tree. In this sce-
+ Wcomb2 · tanh li Wcode,i · vec(xi ) + bcode nario, three weight matrices serve as model parameters,
i t l r
namely Wconv , Wconv , and Wconv . Wconv,i is a linear combi-
where Wcomb1 , Wcomb2 ∈ RNf ×Nf are the parameters for nation of these three matrices (explained in detail in the last
combination. They are initialized as diagonal matrices and part of this section).
then fine-tuned during supervised training. 4
We used tanh as the activation function in TBCNN mainly
because we hope to encode features to a same semantic space
Tree-based Convolutional Layer (−1, 1) during coding. We are grateful to an anonymous reviewer
Now that each symbol in ASTs is represented as a dis- for reminding us of using ReLU in convolution, and we are happy
tributed, real-valued vector x ∈ RNf , we apply a set of to try it in future work.
Dynamic Pooling Likewise, the continuous binary tree for coding has two
l r
After convolution, structural features in an AST are ex- weight matrices Wcode and Wcode as parameters. The details
tracted, and a new tree is generated. The new tree has ex- are not repeated here.
actly the same shape and size as the original one, which is To sum up, the entire parameter set for TBCNN is Θ =
varying among different programs. Therefore, the extracted l
{Wcode r
, Wcode , Wcomb1 , Wcomb2 , Wconv t l
, Wconv r
, Wconv , Whid ,
features cannot be fed directly to a fixed-size neural layer. Wout , bcode , bconv , bhid , bout , vec(·)}, where Whid , Wout , bhid ,
Dynamic pooling (Socher et al. 2011a) is applied to deal and bout are the weights and biases for the hidden and
with this problem. output layers. To set up supervised training, Wcode l r
, Wcode ,
The simplest approach, perhaps, is to pool all features to bcode , and vec(·) are derived from the pretraining phase;
one vector. We call this one-way pooling. Concretely, the Wcomb1 and Wcomb2 are initialized as diagonal matrices;
maximum value in each dimension is taken from the features other parameters are initialized randomly. We apply the
that are detected by tree-based convolution. We also propose cross-entropy loss and use stochastic gradient descent,
an alternative, three-way pooling, where features are pooled computed by back-propagation.
to 3 parts, TOP, LOWER LEFT, and LOWER RIGHT, ac-
cording to the their positions in the AST (Figure 2b). As we
shall see from the experimental results, the simple one-way Experiments
pooling just works as well as three-way pooling. Therefore We first assess the learned vector representations both quali-
we adopt one-way pooling in our experiments. tatively and quantitatively. Then we evaluate TBCNN in two
After pooling, the features are fully connected to a hid- supervised learning tasks, and conduct model analysis.
den layer and then fed to the output layer (softmax) for su- The dataset of our experiments comes from a pedagogi-
pervised classification. With the dynamic pooling process, cal programming open judge (OJ) system.6 There are a large
structural features along the entire AST reach the output number of programming problems on the OJ system. Stu-
layer with short paths. Hence, they can be trained effectively dents submit their source code as the solution to a certain
by back-propagation. problem; the OJ system automatically judges the validity of
submitted source code by running the program. We down-
The “Continuous Binary Tree” Model loaded the source code and the corresponding programming
As stated, one problem of coding and convolving is that we problems (represented as IDs) as our dataset.
cannot determine the number of weight matrices because
AST nodes have different numbers of children. Unsupervised Program Vector Representations
One possible solution is the continuous bag-of-words We applied the coding criterion of pretraining to all C code
model (CBoW, Mikolov et al., 2013),5 but position informa- in the OJ system, and obtained AST nodes’ vector represen-
tion will be lost completely. Such approach is also used in tations.
Hermann and Blunsom (2014). Socher et al. (2014) allocate Qualitative analysis. Figure 3a illustrates the hierarchical
a different weight matrix as parameters for each position; clustering result based on a subset of AST nodes. As demon-
but this method fails to scale up since there will be a huge strated, the symbols mainly fall into three categories: (1)
number of different positions in ASTs. BinaryOp, ArrayRef, ID, Constant are grouped to-
In our model, we view any subtree as a “binary” tree, re- gether since they are related to data reference/manipulation;
gardless of its size and shape. That is, we have only three (2) For, If, While are similar since they are related to
weight matrices as parameters for convolution, and two for control flow; (3) ArrayDecl, FuncDecl, PtrDecl are
coding. We call it a continuous binary tree. similar since they are declarations. The result is quite sen-
Take convolution as an example. The three parameter ma- sible because it is consistent with human understanding of
t l r
trices are Wconv , Wconv , and Wconv . (Superscripts t, l, r refer programs.
to “top,” “left,” and “right.”) For node xi in a window, its Quantitative analysis. We also evaluated pretraining’s
weight matrix for convolution Wconv,i is a linear combina- effect on supervised learning by feeding the learned repre-
t l r
tion of Wconv , Wconv , and Wconv , with coefficients ηit , ηil , and sentations to a program classification task. (See next subsec-
r
ηi , respectively. The coefficients are computed according to tion.) Figure 3b plots the learning curves of both training and
the relative position of a node in the sliding window. Fig- validation, which are compared with random initialization.
ure 2c is an analogy to the continuous binary tree model. Unsupervised vector representation learning accelerates the
The equations for computing η’s are listed as follows. supervised training process by nearly 1/3, showing that pre-
i −1
• ηit = dd−1 (di : the depth of the node i in the sliding win- training does capture underlying features of AST nodes, and
dow; d: the depth of the window.) that they can emerge high-level features spontaneously dur-
i −1
• ηir = (1 − ηit ) pn−1 . (pi : the position of the node; n: the ing supervised learning. However, pretraining has a limited
total number of p’s siblings.) effect on the final accuracy. One plausible explanation is that
• ηil = (1 − ηit )(1 − ηir ) the number of AST nodes is small: the pycparser, we
use, distinguishes only 44 symbols. Hence, their representa-
5
In their original paper, they do not deal with varying-length tions can be adequately tuned in a supervised fashion.
data, but their method extends naturally to this scenario. Their
6
method is also mathematically equivalent to average pooling. http://programming.grids.cn
Group Method Test Accuracy (%)
linear SVM+BoW 52.0
Surface RBF SVM+BoW 83.9
features linear SVM+BoT 72.5
RBF SVM+BoT 88.2
DNN+BoW 76.0
(a) (b) NN-based DNN+BoT 89.7
approaches Vector avg. 53.2
Figure 3: Analysis of vector representations. (a) Hierarchical RNN 84.8
clustering based on AST nodes’ vector representations. (b) Our method TBCNN 94.0
Learning curves with and without pretraining.
Table 3: The accuracy of 104-label program classifications.
Statistics Mean Sample std.
# of code lines 36.3 19.0
# of AST nodes 189.6 106.0 (RNN, Socher et al. 2011b). Hyperparameters for baselines
Average leaf nodes’ depth in an AST 7.6 1.7 are listed as follows.
Max depth of an AST 12.3 3.2
SVM. The linear SVM has one hyperparameter C; RBF
SVM has two, C and γ. They are tuned by validation over
Table 1: Statistics of our dataset. the set {· · · , 1, 0.3, 0.1, 0.03, · · · } with grid search.
DNN. We applied a 4-layer DNN (including input) em-
Hyperparameter Value How is the value chosen?
Initial learning rate 0.3 By validation pirically. The hidden layers’ dimension is 300, chosen from
Learning rate decay None Empirically {100, 300, 1000}; learning rates are 0.003 for BoW and 0.03
Embedding dimension 30 Empirically for BoT, chosen from {0.003, · · · , 0.3} with granularity 3x.
Convolutional layers’ dim. 600 By validation
Penultimate layer’s dim. 600 Same as conv layers `2 regularization coefficient is 10−6 for both BoW and BoT,
l2 penalty None Empirically chosen from {10−7 , · · · , 10−4 } with granularity 10x, and
also no regularization.
Table 2: TBCNN’s hyperparameters. RNN. Recursive units are 600-dimensional, as in
our method. The learning rate is chosen from the set
{· · · 1.0, 0.3, 0.1 · · · }, and 0.3 yields the highest validation
Nonetheless, we think the pretraining criterion is effec- performance.
tive and beneficial for TBCNN, because training deep neural
networks is usually time-consuming, especially when tuning Results Table 3 presents the results in the 104-label
hyperparameters. The pretrained vector representations are program classification experiment. Using SVM with sur-
used throughout the experiments below. face features does distinguish different programs to some
extent—for example, a program about string manipulation
is different from, say, matrix operation; also, a difficult pro-
Classifying Programs by Functionalities
gramming problem necessitates a more complex program,
Task description In software engineering, classifying and thus more lines of code and AST nodes. However, their
programs by functionalities is an important problem for var- performance is comparatively low.
ious software development tasks. For example, in a large We tried deep feed-forward neural networks on these fea-
software repository (e.g., SourceForge), software products tures, and achieved accuracies of 76.0–89.7%, comparable
are usually organized into categories, a typical criterion for to SVMs. Vector averaging with softmax—another neural
which is by functionalities. With program classification, it network-based competing method applied in NLP (Socher et
becomes feasible to automatically tag a software component al. 2013; Kalchbrenner, Grefenstette, and Blunsom 2014)—
newly added into the repository, which is beneficial for soft- yields an accuracy similar to a linear classifier built on
ware reuse during the development process. BoW features. This is probably because the number of AST
In our experiment, we applied TBCNN to classify source symbols is far fewer than words in natural languages, and
code in the OJ system. The target label of a data sample is thus the vector representations (provided non-singular) can
one of 104 programming problems (represented as an ID). be absorbed into the classifier’s weights. Comparing these
That is, programs with a same target label have the same approaches with our method, we deem TBCNN’s perfor-
functionality. We randomly chose exactly 500 programs in mance boost is not merely caused by using a better clas-
each class, and thus 52,000 samples in total, which were sifier (neural networks versus SVM, say), but also the fea-
further randomly split by 3:1:1 for training, validation, and ture/representation learning nature, which enables automatic
testing. Relevant statistics are shown in Table 1. structural feature extraction.
We also applied RNN to the program classification
Hyperparameters TBCNN’s hyperparameters are shown task7 ; the RNN’s accuracy is lower than shallow methods
in Table 2. Our competing methods include SVM and a deep
feed-forward neural network based on hand-crafted features, 7
We do not use the pretrained vector representations, which are
namely bag-of-words (BoW, the counting of each symbol) inimical to RNN: the weight Wcode codes children’s representation
or bag-of-tree (BoT, the counting of 2-layer subtrees). We to its candidate parent’s; adversely, the high-level nodes in pro-
also compare our model with the recursive neural network grams (e.g., a function definition) are typically non-informative.
Classifier Features Accuracy
Rand/majority – 50.0
RBF SVM Bag-of-words 62.3
RBF SVM Bag-of-trees 77.1
TBCNN Learned 89.1

Table 4: Accuracy of detecting bubble sort (in percentage).

Model Variant Validation Acc. Figure 4: Validation accuracy versus the number of convo-
Coding layer → None 92.3 lution units.
1-way pooling → 3-way 94.3
Continuous binary tree → CBoW 93.1
TBCNN with the best gadgets 94.4 62.3%. Bag-of-trees features are better, and achieve 77.06%.
Our model outperforms these methods by more than 10%.
Table 5: Effect of coding, pooling, and the continuous binary This experiment also suggests that neural networks can learn
tree. more robust features than just counting surface statistics.

(SVM+BoT). Taking into consideration experiments in NLP


Model Analysis
(Socher et al. 2011b; Socher et al. 2013), we observe a We now analyze each gadget of TBCNN quantitatively, with
degradation of RNN’s performance if the tree structure is the 104-label program classification as our testbed. We re-
large. port validation accuracies throughout this part.
TBCNN outperforms the above methods, yielding an ac-
curacy of 94%. By exploring tree-based convolution, our Effect of coding layer In the proposed TBCNN model for
model is better at capturing programs’ structural features, program analysis, we represent a non-leaf node by combin-
which is important for program analysis. ing its coded representation and its pretrained one. We find
that, the underneath coding layer can also integrate global
Detecting Bubble Sort information in addition to merely averaging two homoge-
neous sources. If we build a tree-based convolutional layer
Task description To further evaluate our TBCNN model
directly on the pretrained vector representations, all struc-
in a more realistic SE scenario, we used it to detect an un-
tural features are “local,” that is, confined in the convolution
healthy code pattern, bubble sort, which can also be regarded
window. The lack of integrating global information leads to
as a (binary) program classification task. Detecting source
2% degradation in performance. (See the first and last rows
code of certain patterns is closely related to many SE prob-
in Table 5.)
lems. In this experiment, bubble sort is thought of as un-
healthy code because it implements an inefficient algorithm. Layers’ dimensions In our experiments, AST nodes’ vec-
By identifying such unhealthy code, project managers can tor representations are set to be 30-dimensional empirically.
refine the implementations during the maintenance process. We chose this small value because AST nodes have only 44
Before the experiment, a volunteer8 annotated, from the different symbols. Hence, the dimension needs to be, intu-
OJ system, 109 programs that contain bubble sort, and 109 itively, smaller than words’ vector representations, e.g., 300
programs that do not contain bubble sort. They were split in Mou et al. (2015). The dimension of convolution, i.e.,
1:1 for validation and testing. the number of feature detectors, was chosen by validation
Data augmentation To train our TBCNN model, a dataset (Figure 4). We tried several configurations, among which
of such scale is insufficient. We propose a simple yet useful 600-dimensional convolution results in the highest valida-
data augmentation technique for programs. Concretely, we tion accuracy. This analysis also verifies that programs have
used the source code of 4k programs in the OJ system as the rich structural information, even though the number of AST
non-bubble sort class. For each program, we randomly sub- symbols is not large. As the rich semantics are emerged by
stituted a fragment of program statements with a pre-written different combinations of AST symbols, we are in need of
bubble sort snippet. Thus we had 8k data samples in total. more feature detectors, that is, a larger convolutional layer.

Results We tested our model on the annotated real-world Effect of pooling layer We tried two pooling methods in
programs. Note that the test samples were written by real- our TBCNN model, and compare them in Table 5 (the sec-
world programmers, and thus the styles and forms of bubble ond and last rows). 3-way pooling is proposed in hope of
sort snippets may differ from the training set, for example, preserving features from different parts of the tree. However,
sorting an integer array versus sorting a user-defined struc- as indicated by the experimental result, the simple 1-way
ture, and sorting an array versus sorting two arrays simulta- pooling works just as fine (even 0.1% higher on the valida-
neously. As we see in Table 4, bag-of-words features are not tion set). This suggests that TBCNN is not sensitive to pool-
illuminating in this classification and yield a low accuracy of ing methods, which mainly serve as a necessity for packing
varying sized and shaped data. Further development can be
8
The volunteer has neither authorship nor a conflict of interests. addressed in future work.
Effect of continuous binary tree The continuous binary [2009] Dietz, L.; Dallmeier, V.; Zeller, A.; and Scheffer, T. 2009.
tree is introduced to treat nodes with different numbers of Localizing bugs in program executions with graphical models. In
children, as well as to capture order information of child NIPS.
nodes. We also implemented the continuous bag-of-words [2015] Duvenaud, D.; Maclaurin, D.; Aguilera-Iparraguirre, J.;
(CBoW) model, where child nodes’ representations are av- Gómez-Bombarelli, R.; Hirzel, T.; Aspuru-Guzik, A.; and
eraged before convolution. Rows 4 and 5 in Table 5 compare Adams, R. 2015. Convolutional networks on graphs for learn-
our proposed continuous binary tree and the above alterna- ing molecular fingerprints. arXiv preprint arXiv:1509.09292.
tive. The result shows a boost of 1.3% in considering child [2012] Ghabi, A., and Egyed, A. 2012. Code patterns for automat-
nodes’ order information. ically validating requirements-to-code traces. In ASE, 200–209.
[2013] Hao, D.; Lan, T.; Zhang, H.; Guo, C.; and Zhang, L. 2013.
Conclusion Is this a bug or an obsolete test? In Proc. ECOOP, 602–628.
In this paper, we applied deep neural networks to the field [2014] Hermann, K., and Blunsom, P. 2014. Multilingual models
of programming language processing. Due to the rich and for compositional distributed semantics. In ACL, 58–68.
explicit tree structures of programs, we proposed the novel [2012] Hindle, A.; Barr, E.; Su, Z.; Gabel, M.; and Devanbu, P.
Tree-Based Convolutional Neural Network (TBCNN). In 2012. On the naturalness of software. In ICSE, 837–847.
our model, program vector representations are learned by [2006] Hinton, G.; Osindero, S.; and Teh, Y. 2006. A fast learning
the coding criterion; structural features are detected by the algorithm for deep belief nets. Neural Computation 18(7):1527–
convolutional layer; the continuous binary tree and dynamic 1554.
pooling enable our model to cope with trees of varying sizes
[2014] Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014.
and shapes. Experimental results show the superiority of our A convolutional neural network for modelling sentences. In ACL,
model to baseline methods. 655–665.
[2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2012. Ima-
Acknowledgments geNet classification with deep convolutional neural networks. In
We would like to thank anonymous reviewers for insight- NIPS.
ful comments; we also thank Xiaowei Sun for annotating [1995] LeCun, Y.; Jackel, L.; Bottou, L.; Brunot, A.; Cortes, C.;
bubble sort programs, Yuxuan Liu for data processing, and Denker, J.; Drucker, H.; Guyon, I.; Muller, U.; and Sackinger, E.
Weiru Liu for discussion on the manuscript. This research 1995. Comparison of learning algorithms for handwritten digit
is supported by the National Basic Research Program of recognition. In Proc. Int. Conf. Artificial Neural Networks.
China (the 973 Program) under Grant No. 2015CB352201 [2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; and
and the National Natural Science Foundation of China un- Dean, J. 2013. Distributed representations of words and phrases
der Grant Nos. 61421091, 61232015, 61225007, 91318301, and their compositionality. In NIPS.
and 61502014. [2015] Mou, L.; Peng, H.; Li, G.; Xu, Y.; Zhang, L.; and Jin, Z.
2015. Discriminating neural sentence modeling by tree-based
References convolution. In EMNLP, 2315–2325.
[2006] Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. [2001] Pane, J.; Ratanamahatana, C.; and Myers, B. 2001. Study-
2006. Greedy layer-wise training of deep networks. In NIPS. ing the language and structure in non-programmers’ solutions
[2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Repre- to programming problems. Int. J. Human-Computer Studies
sentation learning: A review and new perspectives. IEEE Trans. 54(2):237–264.
Pattern Anal. Mach. Intell. 35(8):1798–1828. [2015] Peng, H.; Mou, L.; Li, G.; Liu, Y.; Zhang, L.; and Jin, Z.
[1994] Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning 2015. Building program vector representations for deep learn-
long-term dependencies with gradient descent is difficult. IEEE ing. In Proc. 8th Int. Conf. Knowledge Science, Engineering and
Trans. Neural Networks 5(2):157–166. Management, 547–553.
[2013] Bettenburg, N., and Begel, A. 2013. Deciphering the [2015] Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sa-
story of software development through frequent pattern mining. hami, M.; and Guibas, L. 2015. Learning program embeddings
In ICSE, 1197–1200. to propagate feedback on student code. In ICML.
[2009] Chilowicz, M.; Duris, E.; and Roussel, G. 2009. Syntax [1994] Pinker, S. 1994. The Language Instinct: The New Science
tree fingerprinting for source code similarity detection. In Proc. of Language and Mind. Pengiun Press.
IEEE Int. Conf. Program Comprehension, 243–247. [2011a] Socher, R.; Huang, E.; Pennin, J.; Manning, C.; and Ng,
[2008] Collobert, R., and Weston, J. 2008. A unified architec- A. 2011a. Dynamic pooling and unfolding recursive autoen-
ture for natural language processing: Deep neural networks with coders for paraphrase detection. In NIPS.
multitask learning. In ICML. [2011b] Socher, R.; Pennington, J.; Huang, E.; Ng, A.; and Man-
[2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; ning, C. 2011b. Semi-supervised recursive autoencoders for pre-
Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language pro- dicting sentiment distributions. In EMNLP, 151–161.
cessing (almost) from scratch. JRML 12:2493–2537. [2013] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.;
[2010] Dahl, G.; Mohamed, A.; and Hinton, G. 2010. Phone Ng, A.; and Potts, C. 2013. Recursive deep models for semantic
recognition with the mean-covariance restricted Boltzmann ma- compositionality over a sentiment treebank. In EMNLP, 1631–
chine. In NIPS. 1642.
[2014] Socher, R.; Karpathy, A.; Le, Q.; Manning, C.; and Ng,
A. Y. 2014. Grounded compositional semantics for finding and
describing images with sentences. TACL 2:207–218.
[2013] Steidl, D., and Gode, N. 2013. Feature-based detection of
bugs in clones. In 7th Int. Workshop on Software Clones, 76–82.
[2014] Zaremba, W., and Sutskever, I. 2014. Learning to execute.
arXiv preprint arXiv:1410.4615.

You might also like