Convolutional Neural Networks over Tree Structures for Programming Language Processing
Convolutional Neural Networks over Tree Structures for Programming Language Processing
IdentifierType ID Constant
Vector representation Tree-based Dynamic Hidden Output
and coding convolution pooling
Figure 1: (a) Illustration of an AST, corresponding to the C code snippet “int a=b+3;” It should be notice that our model takes as
input the entire AST of a program, which is typically much larger. (b) The architecture of the Tree-Based Convolutional Neural
Network (TBCNN). The main components in our model include vector representation and coding, tree-based convolution and
dynamic pooling; then a fully-connected hidden layer and an output layer (softmax) are added.
To prevent the pretraining algorithm from learning trivial Figure 2: (a) Tree-based convolution. Nodes on the left are
representations (e.g., 0’s will give 0 distance but are mean- the feature vectors of AST nodes. They are either pretrained
ingless), negative sampling is applied like Collobert et al. or combined with pretrained and coded vectors. (b) An illus-
(2011). For each pretraining data sample p, c1 , · · · , cn , we tration of 3-way pooling. (c) An analogy to the continuous
substitute one symbol (either p or one of c’s) with a random binary tree model. In the triangle, the color of a pixel is a
symbol. The distance of the negative sample is denoted as combination of three primary colors; in the convolution pro-
dc , which should be at least larger than that of the positive cess, the weight for a node is a combination of three weight
training sample plus a margin ∆ (set to 1 in our experiment). parameters, namely Wconvt l
, Wconv r
, and Wconv .
Thus, the pretraining objective is to
minimize max {0, ∆ + d − dc }
l ,W r ,b
Wcode code code ,vec(·)
fixed-depth feature detectors sliding over the entire tree, de-
picted in Figure 2a. The subtree feature detectors can be
Coding Layer viewed as convolution with a set of finite support kernels.
Having pretrained the feature vectors for all symbols, we We call this tree-based convolution.
would like to feed them forward to the tree-based convo- Formally, in a fixed-depth window, if there are n nodes
lutional layer for supervised learning. For leaf nodes, they with vector representations x1 , · · · , xn , then the output of
are just the vector representations learned in the pretraining the feature detectors is4
Xn
phase. For a non-leaf node p, it has two representations: the y = tanh Wconv,i · xi + bconv
one learned in the pretraining phase (left-hand side of Equa- i=1
tion 1), and the coded one (right-hand side of Equation 1). where y, bconv ∈ R , Wconv,i ∈ RNc ×Nf . (Nc is the number
Nc
They are linearly combined before being fed to the convolu- of feature detectors.) 0’s are padded for nodes at the bottom
tional layer. Let c1 , · · · , cn be the children of node p and we that do not have as many layers as the feature detectors. In
denote the combined vector as p. We have our experiments, the kernel depth is set to 2.
p = Wcomb1 · vec(p) Note that, to deal with varying numbers of children, we
X also adopt the notion of continuous binary tree. In this sce-
+ Wcomb2 · tanh li Wcode,i · vec(xi ) + bcode nario, three weight matrices serve as model parameters,
i t l r
namely Wconv , Wconv , and Wconv . Wconv,i is a linear combi-
where Wcomb1 , Wcomb2 ∈ RNf ×Nf are the parameters for nation of these three matrices (explained in detail in the last
combination. They are initialized as diagonal matrices and part of this section).
then fine-tuned during supervised training. 4
We used tanh as the activation function in TBCNN mainly
because we hope to encode features to a same semantic space
Tree-based Convolutional Layer (−1, 1) during coding. We are grateful to an anonymous reviewer
Now that each symbol in ASTs is represented as a dis- for reminding us of using ReLU in convolution, and we are happy
tributed, real-valued vector x ∈ RNf , we apply a set of to try it in future work.
Dynamic Pooling Likewise, the continuous binary tree for coding has two
l r
After convolution, structural features in an AST are ex- weight matrices Wcode and Wcode as parameters. The details
tracted, and a new tree is generated. The new tree has ex- are not repeated here.
actly the same shape and size as the original one, which is To sum up, the entire parameter set for TBCNN is Θ =
varying among different programs. Therefore, the extracted l
{Wcode r
, Wcode , Wcomb1 , Wcomb2 , Wconv t l
, Wconv r
, Wconv , Whid ,
features cannot be fed directly to a fixed-size neural layer. Wout , bcode , bconv , bhid , bout , vec(·)}, where Whid , Wout , bhid ,
Dynamic pooling (Socher et al. 2011a) is applied to deal and bout are the weights and biases for the hidden and
with this problem. output layers. To set up supervised training, Wcode l r
, Wcode ,
The simplest approach, perhaps, is to pool all features to bcode , and vec(·) are derived from the pretraining phase;
one vector. We call this one-way pooling. Concretely, the Wcomb1 and Wcomb2 are initialized as diagonal matrices;
maximum value in each dimension is taken from the features other parameters are initialized randomly. We apply the
that are detected by tree-based convolution. We also propose cross-entropy loss and use stochastic gradient descent,
an alternative, three-way pooling, where features are pooled computed by back-propagation.
to 3 parts, TOP, LOWER LEFT, and LOWER RIGHT, ac-
cording to the their positions in the AST (Figure 2b). As we
shall see from the experimental results, the simple one-way Experiments
pooling just works as well as three-way pooling. Therefore We first assess the learned vector representations both quali-
we adopt one-way pooling in our experiments. tatively and quantitatively. Then we evaluate TBCNN in two
After pooling, the features are fully connected to a hid- supervised learning tasks, and conduct model analysis.
den layer and then fed to the output layer (softmax) for su- The dataset of our experiments comes from a pedagogi-
pervised classification. With the dynamic pooling process, cal programming open judge (OJ) system.6 There are a large
structural features along the entire AST reach the output number of programming problems on the OJ system. Stu-
layer with short paths. Hence, they can be trained effectively dents submit their source code as the solution to a certain
by back-propagation. problem; the OJ system automatically judges the validity of
submitted source code by running the program. We down-
The “Continuous Binary Tree” Model loaded the source code and the corresponding programming
As stated, one problem of coding and convolving is that we problems (represented as IDs) as our dataset.
cannot determine the number of weight matrices because
AST nodes have different numbers of children. Unsupervised Program Vector Representations
One possible solution is the continuous bag-of-words We applied the coding criterion of pretraining to all C code
model (CBoW, Mikolov et al., 2013),5 but position informa- in the OJ system, and obtained AST nodes’ vector represen-
tion will be lost completely. Such approach is also used in tations.
Hermann and Blunsom (2014). Socher et al. (2014) allocate Qualitative analysis. Figure 3a illustrates the hierarchical
a different weight matrix as parameters for each position; clustering result based on a subset of AST nodes. As demon-
but this method fails to scale up since there will be a huge strated, the symbols mainly fall into three categories: (1)
number of different positions in ASTs. BinaryOp, ArrayRef, ID, Constant are grouped to-
In our model, we view any subtree as a “binary” tree, re- gether since they are related to data reference/manipulation;
gardless of its size and shape. That is, we have only three (2) For, If, While are similar since they are related to
weight matrices as parameters for convolution, and two for control flow; (3) ArrayDecl, FuncDecl, PtrDecl are
coding. We call it a continuous binary tree. similar since they are declarations. The result is quite sen-
Take convolution as an example. The three parameter ma- sible because it is consistent with human understanding of
t l r
trices are Wconv , Wconv , and Wconv . (Superscripts t, l, r refer programs.
to “top,” “left,” and “right.”) For node xi in a window, its Quantitative analysis. We also evaluated pretraining’s
weight matrix for convolution Wconv,i is a linear combina- effect on supervised learning by feeding the learned repre-
t l r
tion of Wconv , Wconv , and Wconv , with coefficients ηit , ηil , and sentations to a program classification task. (See next subsec-
r
ηi , respectively. The coefficients are computed according to tion.) Figure 3b plots the learning curves of both training and
the relative position of a node in the sliding window. Fig- validation, which are compared with random initialization.
ure 2c is an analogy to the continuous binary tree model. Unsupervised vector representation learning accelerates the
The equations for computing η’s are listed as follows. supervised training process by nearly 1/3, showing that pre-
i −1
• ηit = dd−1 (di : the depth of the node i in the sliding win- training does capture underlying features of AST nodes, and
dow; d: the depth of the window.) that they can emerge high-level features spontaneously dur-
i −1
• ηir = (1 − ηit ) pn−1 . (pi : the position of the node; n: the ing supervised learning. However, pretraining has a limited
total number of p’s siblings.) effect on the final accuracy. One plausible explanation is that
• ηil = (1 − ηit )(1 − ηir ) the number of AST nodes is small: the pycparser, we
use, distinguishes only 44 symbols. Hence, their representa-
5
In their original paper, they do not deal with varying-length tions can be adequately tuned in a supervised fashion.
data, but their method extends naturally to this scenario. Their
6
method is also mathematically equivalent to average pooling. http://programming.grids.cn
Group Method Test Accuracy (%)
linear SVM+BoW 52.0
Surface RBF SVM+BoW 83.9
features linear SVM+BoT 72.5
RBF SVM+BoT 88.2
DNN+BoW 76.0
(a) (b) NN-based DNN+BoT 89.7
approaches Vector avg. 53.2
Figure 3: Analysis of vector representations. (a) Hierarchical RNN 84.8
clustering based on AST nodes’ vector representations. (b) Our method TBCNN 94.0
Learning curves with and without pretraining.
Table 3: The accuracy of 104-label program classifications.
Statistics Mean Sample std.
# of code lines 36.3 19.0
# of AST nodes 189.6 106.0 (RNN, Socher et al. 2011b). Hyperparameters for baselines
Average leaf nodes’ depth in an AST 7.6 1.7 are listed as follows.
Max depth of an AST 12.3 3.2
SVM. The linear SVM has one hyperparameter C; RBF
SVM has two, C and γ. They are tuned by validation over
Table 1: Statistics of our dataset. the set {· · · , 1, 0.3, 0.1, 0.03, · · · } with grid search.
DNN. We applied a 4-layer DNN (including input) em-
Hyperparameter Value How is the value chosen?
Initial learning rate 0.3 By validation pirically. The hidden layers’ dimension is 300, chosen from
Learning rate decay None Empirically {100, 300, 1000}; learning rates are 0.003 for BoW and 0.03
Embedding dimension 30 Empirically for BoT, chosen from {0.003, · · · , 0.3} with granularity 3x.
Convolutional layers’ dim. 600 By validation
Penultimate layer’s dim. 600 Same as conv layers `2 regularization coefficient is 10−6 for both BoW and BoT,
l2 penalty None Empirically chosen from {10−7 , · · · , 10−4 } with granularity 10x, and
also no regularization.
Table 2: TBCNN’s hyperparameters. RNN. Recursive units are 600-dimensional, as in
our method. The learning rate is chosen from the set
{· · · 1.0, 0.3, 0.1 · · · }, and 0.3 yields the highest validation
Nonetheless, we think the pretraining criterion is effec- performance.
tive and beneficial for TBCNN, because training deep neural
networks is usually time-consuming, especially when tuning Results Table 3 presents the results in the 104-label
hyperparameters. The pretrained vector representations are program classification experiment. Using SVM with sur-
used throughout the experiments below. face features does distinguish different programs to some
extent—for example, a program about string manipulation
is different from, say, matrix operation; also, a difficult pro-
Classifying Programs by Functionalities
gramming problem necessitates a more complex program,
Task description In software engineering, classifying and thus more lines of code and AST nodes. However, their
programs by functionalities is an important problem for var- performance is comparatively low.
ious software development tasks. For example, in a large We tried deep feed-forward neural networks on these fea-
software repository (e.g., SourceForge), software products tures, and achieved accuracies of 76.0–89.7%, comparable
are usually organized into categories, a typical criterion for to SVMs. Vector averaging with softmax—another neural
which is by functionalities. With program classification, it network-based competing method applied in NLP (Socher et
becomes feasible to automatically tag a software component al. 2013; Kalchbrenner, Grefenstette, and Blunsom 2014)—
newly added into the repository, which is beneficial for soft- yields an accuracy similar to a linear classifier built on
ware reuse during the development process. BoW features. This is probably because the number of AST
In our experiment, we applied TBCNN to classify source symbols is far fewer than words in natural languages, and
code in the OJ system. The target label of a data sample is thus the vector representations (provided non-singular) can
one of 104 programming problems (represented as an ID). be absorbed into the classifier’s weights. Comparing these
That is, programs with a same target label have the same approaches with our method, we deem TBCNN’s perfor-
functionality. We randomly chose exactly 500 programs in mance boost is not merely caused by using a better clas-
each class, and thus 52,000 samples in total, which were sifier (neural networks versus SVM, say), but also the fea-
further randomly split by 3:1:1 for training, validation, and ture/representation learning nature, which enables automatic
testing. Relevant statistics are shown in Table 1. structural feature extraction.
We also applied RNN to the program classification
Hyperparameters TBCNN’s hyperparameters are shown task7 ; the RNN’s accuracy is lower than shallow methods
in Table 2. Our competing methods include SVM and a deep
feed-forward neural network based on hand-crafted features, 7
We do not use the pretrained vector representations, which are
namely bag-of-words (BoW, the counting of each symbol) inimical to RNN: the weight Wcode codes children’s representation
or bag-of-tree (BoT, the counting of 2-layer subtrees). We to its candidate parent’s; adversely, the high-level nodes in pro-
also compare our model with the recursive neural network grams (e.g., a function definition) are typically non-informative.
Classifier Features Accuracy
Rand/majority – 50.0
RBF SVM Bag-of-words 62.3
RBF SVM Bag-of-trees 77.1
TBCNN Learned 89.1
Model Variant Validation Acc. Figure 4: Validation accuracy versus the number of convo-
Coding layer → None 92.3 lution units.
1-way pooling → 3-way 94.3
Continuous binary tree → CBoW 93.1
TBCNN with the best gadgets 94.4 62.3%. Bag-of-trees features are better, and achieve 77.06%.
Our model outperforms these methods by more than 10%.
Table 5: Effect of coding, pooling, and the continuous binary This experiment also suggests that neural networks can learn
tree. more robust features than just counting surface statistics.
Results We tested our model on the annotated real-world Effect of pooling layer We tried two pooling methods in
programs. Note that the test samples were written by real- our TBCNN model, and compare them in Table 5 (the sec-
world programmers, and thus the styles and forms of bubble ond and last rows). 3-way pooling is proposed in hope of
sort snippets may differ from the training set, for example, preserving features from different parts of the tree. However,
sorting an integer array versus sorting a user-defined struc- as indicated by the experimental result, the simple 1-way
ture, and sorting an array versus sorting two arrays simulta- pooling works just as fine (even 0.1% higher on the valida-
neously. As we see in Table 4, bag-of-words features are not tion set). This suggests that TBCNN is not sensitive to pool-
illuminating in this classification and yield a low accuracy of ing methods, which mainly serve as a necessity for packing
varying sized and shaped data. Further development can be
8
The volunteer has neither authorship nor a conflict of interests. addressed in future work.
Effect of continuous binary tree The continuous binary [2009] Dietz, L.; Dallmeier, V.; Zeller, A.; and Scheffer, T. 2009.
tree is introduced to treat nodes with different numbers of Localizing bugs in program executions with graphical models. In
children, as well as to capture order information of child NIPS.
nodes. We also implemented the continuous bag-of-words [2015] Duvenaud, D.; Maclaurin, D.; Aguilera-Iparraguirre, J.;
(CBoW) model, where child nodes’ representations are av- Gómez-Bombarelli, R.; Hirzel, T.; Aspuru-Guzik, A.; and
eraged before convolution. Rows 4 and 5 in Table 5 compare Adams, R. 2015. Convolutional networks on graphs for learn-
our proposed continuous binary tree and the above alterna- ing molecular fingerprints. arXiv preprint arXiv:1509.09292.
tive. The result shows a boost of 1.3% in considering child [2012] Ghabi, A., and Egyed, A. 2012. Code patterns for automat-
nodes’ order information. ically validating requirements-to-code traces. In ASE, 200–209.
[2013] Hao, D.; Lan, T.; Zhang, H.; Guo, C.; and Zhang, L. 2013.
Conclusion Is this a bug or an obsolete test? In Proc. ECOOP, 602–628.
In this paper, we applied deep neural networks to the field [2014] Hermann, K., and Blunsom, P. 2014. Multilingual models
of programming language processing. Due to the rich and for compositional distributed semantics. In ACL, 58–68.
explicit tree structures of programs, we proposed the novel [2012] Hindle, A.; Barr, E.; Su, Z.; Gabel, M.; and Devanbu, P.
Tree-Based Convolutional Neural Network (TBCNN). In 2012. On the naturalness of software. In ICSE, 837–847.
our model, program vector representations are learned by [2006] Hinton, G.; Osindero, S.; and Teh, Y. 2006. A fast learning
the coding criterion; structural features are detected by the algorithm for deep belief nets. Neural Computation 18(7):1527–
convolutional layer; the continuous binary tree and dynamic 1554.
pooling enable our model to cope with trees of varying sizes
[2014] Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014.
and shapes. Experimental results show the superiority of our A convolutional neural network for modelling sentences. In ACL,
model to baseline methods. 655–665.
[2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2012. Ima-
Acknowledgments geNet classification with deep convolutional neural networks. In
We would like to thank anonymous reviewers for insight- NIPS.
ful comments; we also thank Xiaowei Sun for annotating [1995] LeCun, Y.; Jackel, L.; Bottou, L.; Brunot, A.; Cortes, C.;
bubble sort programs, Yuxuan Liu for data processing, and Denker, J.; Drucker, H.; Guyon, I.; Muller, U.; and Sackinger, E.
Weiru Liu for discussion on the manuscript. This research 1995. Comparison of learning algorithms for handwritten digit
is supported by the National Basic Research Program of recognition. In Proc. Int. Conf. Artificial Neural Networks.
China (the 973 Program) under Grant No. 2015CB352201 [2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; and
and the National Natural Science Foundation of China un- Dean, J. 2013. Distributed representations of words and phrases
der Grant Nos. 61421091, 61232015, 61225007, 91318301, and their compositionality. In NIPS.
and 61502014. [2015] Mou, L.; Peng, H.; Li, G.; Xu, Y.; Zhang, L.; and Jin, Z.
2015. Discriminating neural sentence modeling by tree-based
References convolution. In EMNLP, 2315–2325.
[2006] Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. [2001] Pane, J.; Ratanamahatana, C.; and Myers, B. 2001. Study-
2006. Greedy layer-wise training of deep networks. In NIPS. ing the language and structure in non-programmers’ solutions
[2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Repre- to programming problems. Int. J. Human-Computer Studies
sentation learning: A review and new perspectives. IEEE Trans. 54(2):237–264.
Pattern Anal. Mach. Intell. 35(8):1798–1828. [2015] Peng, H.; Mou, L.; Li, G.; Liu, Y.; Zhang, L.; and Jin, Z.
[1994] Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning 2015. Building program vector representations for deep learn-
long-term dependencies with gradient descent is difficult. IEEE ing. In Proc. 8th Int. Conf. Knowledge Science, Engineering and
Trans. Neural Networks 5(2):157–166. Management, 547–553.
[2013] Bettenburg, N., and Begel, A. 2013. Deciphering the [2015] Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sa-
story of software development through frequent pattern mining. hami, M.; and Guibas, L. 2015. Learning program embeddings
In ICSE, 1197–1200. to propagate feedback on student code. In ICML.
[2009] Chilowicz, M.; Duris, E.; and Roussel, G. 2009. Syntax [1994] Pinker, S. 1994. The Language Instinct: The New Science
tree fingerprinting for source code similarity detection. In Proc. of Language and Mind. Pengiun Press.
IEEE Int. Conf. Program Comprehension, 243–247. [2011a] Socher, R.; Huang, E.; Pennin, J.; Manning, C.; and Ng,
[2008] Collobert, R., and Weston, J. 2008. A unified architec- A. 2011a. Dynamic pooling and unfolding recursive autoen-
ture for natural language processing: Deep neural networks with coders for paraphrase detection. In NIPS.
multitask learning. In ICML. [2011b] Socher, R.; Pennington, J.; Huang, E.; Ng, A.; and Man-
[2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; ning, C. 2011b. Semi-supervised recursive autoencoders for pre-
Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language pro- dicting sentiment distributions. In EMNLP, 151–161.
cessing (almost) from scratch. JRML 12:2493–2537. [2013] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.;
[2010] Dahl, G.; Mohamed, A.; and Hinton, G. 2010. Phone Ng, A.; and Potts, C. 2013. Recursive deep models for semantic
recognition with the mean-covariance restricted Boltzmann ma- compositionality over a sentiment treebank. In EMNLP, 1631–
chine. In NIPS. 1642.
[2014] Socher, R.; Karpathy, A.; Le, Q.; Manning, C.; and Ng,
A. Y. 2014. Grounded compositional semantics for finding and
describing images with sentences. TACL 2:207–218.
[2013] Steidl, D., and Gode, N. 2013. Feature-based detection of
bugs in clones. In 7th Int. Workshop on Software Clones, 76–82.
[2014] Zaremba, W., and Sutskever, I. 2014. Learning to execute.
arXiv preprint arXiv:1410.4615.