The 2022 Mining Software Repositories Conference
GraphCode2Vec: Generic Code Embedding via Lexical and
Program Dependence Analyses
Wei Ma*
Mengjie Zhao*
Ezekiel Soremekun
University of Luxembourg
Luxembourg
wei.ma@uni.lu
LMU Munich
Germany
mzhao@cis.lmu.de
University of Luxembourg
Luxembourg
ezekiel.soremekun@uni.lu
Qiang Hu
Jie M. Zhang
Mike Papadakis
University of Luxembourg
Luxembourg
qiang.hu@uni.lu
University College London
United Kingdom
jie.zhang@ucl.ac.uk
University of Luxembourg
Luxembourg
michail.papadakis@uni.lu
Maxime Cordy
Xiaofei Xie
Yves Le Traon
University of Luxembourg
Luxembourg
maxime.cordy@uni.lu
Singapore Management University
Singapore
xfxie@smu.edu.sg
University of Luxembourg
Luxembourg
yves.letraon@uni.lu
ABSTRACT
ACM Reference Format:
Wei Ma* , Mengjie Zhao* , Ezekiel Soremekun, Qiang Hu, Jie M. Zhang, Mike
Papadakis, Maxime Cordy, Xiaofei Xie, and Yves Le Traon. 2022. GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence
Analyses. In 19th International Conference on Mining Software Repositories
(MSR ’22), May 23–24, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA,
13 pages. https://doi.org/10.1145/3524842.3528456
Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end,
we propose the first self-supervised pre-training approach (called
GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves
this via a synergistic combination of code analysis and Graph Neural
Networks. GraphCode2Vec is generic, it allows pre-training, and
it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name
prediction, solution classification, mutation testing and overfitted
patch classification), and compare it with four (4) similarly generic
code embedding baselines (Code2Seq, Code2Vec, CodeBERT, GraphCodeBERT) and seven (7) task-specific, learning-based methods. In
particular, GraphCode2Vec is more effective than both generic and
task-specific learning-based baselines. It is also complementary and
comparable to GraphCodeBERT (a larger and more complex model).
We also demonstrate through a probing and ablation study that
GraphCode2Vec learns lexical and program dependence features
and that self-supervised pre-training improves effectiveness.
1 INTRODUCTION
Applying machine learning to address software engineering (SE)
problems often requires a vector representation of the program
code, especially for deep learning systems. A naïve representation,
used in many SE applications, is one-hot encoding that represents
every feature with a dedicated binary variable (a vector including
binary values) [55]. However, this type of embedding is usually
a high-dimensional sparse vector because the size of vocabulary
is very large in practice, which results in the notorious curse of
dimensionality problem [4]. Besides, one-hot encoding has out-ofvocabulary (OOV) problem, which decreases model generalization
capability such that it cannot handle new type of data [59].
To deal with these issues, researchers use dense and reasonably concise vectors to encode program features for specific SE
tasks, since they generalise better [30, 64, 66, 74]. More recently,
researchers apply natural language processing (NLP) techniques to
learn the universal code embedding vector for general SE tasks [1–
3, 5–7, 10, 17, 23, 26, 33, 49, 51, 62, 65]. The resulting code embedding
represents a mapping from the “program space” to the “latent space”
that captures the different code-used semantics, i.e., the semantic
similarities between program snippets. The aim is that similar programs should have similar representations in the latent space.
State-of-the-art code embedding approaches focus either on syntactic features (i.e., tokens/AST), or on semantic features (i.e., program dependencies) ignoring the importance of combining both
features together. For example, Code2Vec [3] and CodeBERT [17])
focus on syntactic features, while PROGRAML [10] and NCC [5])
focus on program semantics. There are few studies using both program semantics and syntax, e.g., GraphCodeBERT [23]. However,
KEYWORDS
code embedding, code representation, code analysis
* Both
authors contributed equally.
This work is licensed under a Creative Commons AttributionNonCommercial International 4.0 License.
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9303-4/22/05.
https://doi.org/10.1145/3524842.3528456
524
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Ma and Zhao, et al.
Figure 1: Motivating example showing (a) an original method (LowerBound), and two behaviorally equivalent clones of the
original method, namely (b) a renamed method (findLowerBound), and (c) a refactored method (getLowerBound).
public static int lowerBound(int[] array,
int length, int value) {
int low = 0;
int high = length;
while (low < high) {
final int mid = (low + high) / 2;
if (value <= array[mid]) {
high = mid;
} else {
low = mid + 1;
}
}
return low;
}
(a) Original Method
public static int findLowerBound(int[] inputs,
int size, int v) {
int bounder = 0;
int l = size;
int mindex = 0;
while (bounder < l) {
mindex = (bounder + l) / 2;
if (v <= inputs[mindex]) {
l = mindex;
} else {
bounder = mindex + 1;
}
}
return bounder;
}
public static int getLowerBound(int v,
int size, int[] inputs) {
int h = size;
int mindex = 0;
int check = 0;
while (check < h) {
mindex = (check + h) / 2;
if (v > inputs[mindex]) {
check = mindex + 1;
} else {
h = mindex;
}
}
return check;
}
(b) Renamed Method
(c) Refactored Method
Table 1: Cosine Similarity of three behaviorally/semantically similar program pairs from our motivating example,
using GraphCodeBERT, CodeBERT and GraphCode2Vec
these approaches are not precise, they do not obtain or embed the
entire program dependence graph. Instead, they estimate program
dependence via string matching (instead of static program analysis),
then augment AST trees with sequential data flow edges.
To address these challenges, we propose the first approach (called
GraphCode2Vec) to synergistically capture syntactic and semantic program features with Graph Neural Network (GNN) via selfsupervised pretraining. The key idea of our approach is to use static
program analysis and graph neural networks to effectively represent
programs in the latent space. This is achieved by combining lexical and program dependence analysis embeddings. During lexical
embedding, GraphCode2Vec embeds the syntactic features in the
latent space via tokenization. In addition, it performs dependence
embedding to capture program semantics via static program analysis, it derives the program dependence graph (PDG) and represent
it in the latent space using Graph Neural Networks (GNN). It then
concatenates both lexical embedding and dependence embedding
in the program’s vector space. This allows GraphCode2Vec to be
effective and applicable on several downstream tasks.
To demonstrate the importance of semantic embedding, we compare the similarity of three pairs of programs using our approach,
in comparison to a syntax-only embedding approach – CodeBERT,
and GraphCodeBERT, which embeds both syntax and semantic,
albeit without program dependence analysis. Consider the example
of three program clones in Figure 1. This example includes three
behaviorally or semantically equivalent programs, that have low
syntactic similarity (i.e., different tokens), but with similar semantic
features, i.e., program dependence graphs (PDGs). To measure the
similarity distance in the latent space, in addition to the example
code clones (Figure 1), we randomly select 10 other different code
methods (from GitHub) without any change to establish a baseline
for comparing all approaches. To this end, we compute the average cosine similarity distance for all 91 program pairs ( 14×13
2 ) for
reference to show that all approaches report similar scores for all
randomly selected 91 pairs (Table 1).1 For all three approaches, the
similarity between the “original program” and a direct copy of the
program with only method name renaming to “searchLowerBound”,
Program Pairs
searchLowerBound & lowerBound
findLowerBound & lowerBound
getLowerBound & lowerBound
Average of 91 pairs
GraphCodeBERT
1
0.70
0.70
-0.05
CodeBERT
GraphCode2Vec
0.99
0.61
0.51
-0.06
1
0.99
0.99
-0.03
is well captured with an almost perfect cosine similarity score for all
approaches (1 or 0.99). Likewise, the cosine similarity of the original
program and the “renamed” program (findLowerBound) is mostly
well captured by all approaches, since they all embed program syntax, albeit with lower cosine similarity scores for CodeBERT (0.61)
and GraphCodeBERT (0.70), in comparison to our approach (0.99).
Meanwhile, CodeBERT fails to capture the semantic similarity between the “original program” and the “refactored program”
(getLowerBound), even though they are behaviorally similar and
share similar program dependence. This is evidenced by the low cosine similarity score (0.51), because it does not account for semantic
information in its embedding, especially the similar program dependence graph shared by both programs. Lastly, GraphCodeBERT
performs slightly better than CodeBERT (0.70 vs. 0.51), but lower
than our approach (0.99). This is due to lack of actual static program
analysis in the embedding of GraphCodeBERT, since it only applies
a heuristic (string matching) to estimate program dependence, it is
imprecise. This example demonstrates the importance and necessity
of embedding precise dependence information.
A key ingredient of GraphCode2Vec is self-supervised pretraining. Even though task-specific learning based approaches (e.g.,
CNNSentence [45]) learn the vector representation of code without
pre-training, they are non-generic and less effective. Applying their
learned vector representation to other (SE) tasks requires re-tuning
model parameters, and the lack of pretraining reflects in their performance. As an example, our evaluation (in RQ1 section 5) showed
that our self-supervised pretraining approach improves effectiveness when compared to 7 task-specific approaches (i.e., without
pretraining) addressing two (SE) tasks (solution classification and
patch classification). To further demonstrate the importance of
1 The
purpose of computing the average cosine similarity of all 91 code pairs is to
establish a meaningful reference for comparing embeddings and to serve as a sanity
check. We expect the mean of the cosine similarity of a set of randomly selected pairs
of code clones and non-clones to lie around zero for all approaches (range -1 to 1).
525
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
self-supervised pretraining, we compare the effectiveness of GraphCode2Vec with and without pretraining using two downstream
tasks. Overall, we demonstrate that our self-supervised pretraining
improves effectiveness by 28% (see RQ3).
To evaluate GraphCode2Vec, we compare it to four generic
code embedding approaches, and seven (7) task-specific learningbased applications. We also investigate the stability and learning
ability of our approach through sensitivity, ablation and probing
analyses. Overall, we make the following contributions:
Task-specific learning-based applications. We introduce the automatic application of GraphCode2Vec to solve specific downstream SE tasks, without extensive human intervention to adapt
model architecture. In comparison to the state-of-the-art task-specific
learning-based approaches (e.g., ODS [72] ), our approach does not
require any effort to tune the hyper-parameters to be applicable
to a downstream task (Section 3). Our evaluation on two downstream tasks, solution classification and patch classification, showed
that GraphCode2Vec outperforms the state-of-the-art task-specific
learning-based applications: For all tasks it outperforms all taskspecific applications (RQ1 in Section 5).
Generic Code embedding. We propose a novel and generic code
embedding learning approach (i.e., GraphCode2Vec) that captures
the lexical, control flow and data flow features of programs through
a novel combination of tokenization, static code analysis and graph
neural networks (GNNs). To the best of our knowledge, GraphCode2Vec is the first code embedding approach to precisely capture syntactic and semantic program features with GNNs via selfsupervised pretraining. We demonstrate that GraphCode2Vec is
effective (RQ2 in Section 5): It outperforms all syntax-only generic
code embedding baselines. We provide our pre-trained models and
generic embedding for public use and scrutiny.2
Further Analyses. We extensively evaluate the stability and interpretability of our approach by conducting sensitivity, probing and
ablation analyses. We also investigate the impact of configuration
choices (i.e., pre-training strategies and GNN architectures) on the
effectiveness of our approach on downstream tasks. Our evaluation
results show that GraphCode2Vec effectively learns lexical and program dependence features, it is stable and insensitive to the choice
of GNN architecture or pre-training strategy (RQ3 in Section 5).3
approaches include Code2Vec [3], Code2Seq [2], CodeBERT [17],
C-BERT [7], InferCode[6], CC2Vec [26], AST-based NN [73] and
ProgHeteroGraph [65] (see Table 2). Notably, these approaches use
neural models for representing code (snippets), e.g., via code vector (e.g., Code2Vec [3]), machine translation (e.g., Code2Seq [2])
or transformers (e.g., CodeBERT [17]). Code2Vec [3] is an ASTbased code representation learning model that represents code
snippets as single fixed-length code vector. It decomposes a program into a collection of paths using an AST and learns the atomic
representation of each path while simultaneously learning how
to aggregate the set of paths. Code2Seq [2] is an alternative code
embedding approach that uses Sequence-to-sequence (seq2seq)
models, adopted from neural machine translation (NMT), to encode
code snippets. CodeBERT [17] is a bimodal pre-trained model for
programming language (PL) and natural language (NL) tasks, which
uses transformer-based neural architecture to encode code snippets.
Besides, CodeBERT [17], C-BERT [7] and Cu-BERT [33] are BERTinspired approaches, these methods adopt similar methodologies
to learn code representations as BERT [12].
GraphCode2Vec is similar to the aforementioned generic code
embedding methods, it is also a general-purpose code embedding
approach that captures syntax by lexicalizing the program into
tokens (see Table 2). However, all of the aforementioned generic
approaches are syntax-based, none of these approaches account
for program semantics (i.e., data and control flow). Unlike these
approaches, GraphCode2Vec additionally captures program semantics via static analysis. In this paper, we compare our approach
(GraphCode2Vec) to the three (3) most popular and recent syntaxbased generic code embedding approaches, namely Code2Vec [3],
Code2Seq [2] and CodeBERT [17] (see section 5).
Semantic-based Generic Approaches: This refers to code embedding methods that capture only semantic information such as control and data flow dependencies in the program. Semantic-only
generic approaches include NCC [5] and PROGRAML [10]. On one
hand, NCC [5] extracts the contextual flow graph of a program
by building an LLVM intermediate representation (IR) of the program. It then applies word2vec [43] to learn code representations.
On the other hand, PROGRAML [10] is a language-independent,
portable representation of whole-program semantics for deep learning, which is designed for data flow analysis in compiler optimization. It adopts message passing neural networks (MPNN) [22] to
learn LLVM IR representations. In contrast to these approaches,
GraphCode2Vec captures both semantics and syntax.
Combined Semantic and Syntactic -based Approaches: There
are generic approaches that capture both syntactic and semantic features such as IR2Vec [5], OSCAR [49], ProgramGraph [1],
ProjectCodeNet [51] and GraphCodeBERT [23]. IR2Vec [5] and
OSCAR [49] use LLVM IR representation of a program to capture
program semantics. Meanwhile, ProgramGraph [1] uses GNN to
learn syntactic and semantic representations of code from ASTs augmented with data and control edges. ProgHeteroGraph leverages
abstract syntax description language (ASDL) grammar to learn code
representations via heterogeneous graphs [65]. Finally, GraphCodeBERT [23] is built upon CodeBERT [17], but in addition to capturing
syntactic features it also accounts for semantics by employing data
flow information in the pre-training stage.
2 BACKGROUND
2.1 Generic code embedding
We discuss methods that learn general-purpose code representations to support several downstream tasks. These approaches are
not designed for a specific task. There are three major types of
generic code embedding approaches, namely syntax-based, semanticbased and combined semantic and syntactic approaches (see Table 2).
Syntax-based Generic Approaches: These approaches encode
program snippets, either by dividing the program into strings, lexicalizing them into tokens or parsing the program into a parse
tree or abstract syntax tree (AST). Syntax-only generic embedding
2 https://github.com/graphcode2vec/graphcode2vec
3 In
the rest of this work, we interchangeably use the terms “lexical” and ”syntactic”
interchangeably, as well as “(program) dependence” and “semantic”. Such that the terms
“lexical embedding” and “syntactic embedding” refer to the embedding of program
syntax, and the terms “dependency embedding” and “semantic embedding” refer to
the embedding of program dependence information.
526
Ma and Zhao, et al.
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Table 2: Details of the state-of-the-art Code Embedding approaches. “Semantic” or “Sem” means program dependence,
and “Syntactic” or “Syntax” refers to strings, tokens, parse
tree or AST-tree. Symbol “” means the approach supports
a feature, and “×” means it does not support the feature.
Figure 2: Overview of GraphCode2Vec
Syntax
Both
Sem.
Both
Generic
Syntax-only
Task-specific
Type
Approaches
Syntactic
Semantic
CNNSentence [45]
OneCNNLayer [50]
SequentialCNN [21]
SimFeatures [63]
Prophet [41]
PatchSim [69]
ODS [72]
CodeBERT [17]
Code2Vec [3]
Code2Seq [2]
C-BERT [7]
InferCode [6]
CC2Vec [26]
AST-based NN [73]
ProgHeteroGraph [65]
NCC [5]
PROGRAML [10]
IR2Vec [5]
OSCAR [49]
ProgramGraph [1]
ProjectCodeNet [51]
GraphCodeBERT [23]
GraphCode2Vec
×
×
×
×
×
×
×
×
×
×
×
×
×
"
Granularity
Method Class
×
×
×
×
×
×
×
×
×
×
×
×
×
×
!
uses CNN for solution classification. It firstly pre-processes the program to remove unwanted entities (e.g., comments, spaces, tabs and
new lines), then tokenizes the program to generate the code embedding using word2vec. The resulting embedding includes the token
connections and their underlying meaning in the vector space.
Patch Classification: These are techniques designed to determine the correctness of patches (i.e., identify correct, wrong or
over-fitting patches). These learning-based techniques can be static
(e.g., ODS [72]), dynamic (e.g., Prophet [41]), heuristic-based (e.g.,
PatchSim [69]) or hybrid (e.g., SimFeatures [63]). Table 2 provides
details of these approaches. Notably, they all capture both syntactic
information (e.g. via AST) and program dependence information
(e.g., via execution paths or control flow information). For instance,
PatchSim [69] is a heuristic approach that leverages the behavioral
similarity of test case executions to determine patch correctness by
leveraging the complete path spectrum of test executions. Meanwhile, Wang et al. [63] proposed (SimFeatures –) a hybrid strategy
that identifies correct patches by integrating static code features with
dynamic features or (test) heuristics. SimFeatures combines a learned
static code model with dynamic or heuristic-based information
(such as the dependency similarity between a buggy program and
a patch) using majority voting. More recently, Ye et al. [72] proposed a supervised learning approach (called ODS) that employs
static code features of patched and buggy programs to determine
patch correctness, specifically to classify over-fitting patches. It
uses supervised learning on extracted static code at the AST level
to learn a probabilistic model for determining patch correctness.
ODS also tracks program dependencies by tracking control flow
statements. For this task, we compare GraphCode2Vec to ODS,
PatchSim, Prophet and SimFeatures (see Section 5).
In this work, we compare GraphCode2Vec to the aforementioned seven (7) learning-based methods for solution classification
and patch classification (see Section 5).
Similar to these approaches, our approach (GraphCode2Vec)
learns both syntactic and semantic features. In this work, we compare GraphCode2Vec to GraphCodeBERT because it is the most
recent state-of-the-art and closely related approach to ours, since it
captures both syntax and semantics (see RQ2 section 5).
2.2 Task-specific learning-based applications
Task-specific learning-based approaches are typically designed for
a single SE task, but GraphCode2Vec is amenable to several tasks
and learns a generic code representation beyond the specific task-athand, via model pre-training. Researchers have proposed specialised
learning-based techniques to tackle specific (SE) downstream tasks,
e.g.. patch classification [41, 72] and solution classification [21, 45, 50].
In our experiments, we consider specialised learning approaches
for both tasks. This is because these tasks have several software
engineering applications, especially during software maintenance
and evolution [41, 45, 72]. Table 2 highlights details of our taskspecific learning methods.
Solution classification: Let us describe the state-of-the-art learningbased approaches for solution classification. Most of these approaches are syntax-based and adopt convolution neural networks
(CNNs) to classify programming tasks. SequentialCNN [21] applies
a CNN to predict the language/tasks from code snippets using
lexicalized tokens represented as a matrix of word embeddings.
CNNSentence [45] is similar to SequentialCNN since it also uses
CNNs, except that it classifies source code without relying on keywords, e.g., variable and function names. It instead considers the
structural features of the program in terms of tokens that characterize the process of arithmetic processing, loop processing, and
conditional branch processing. Finally, OneCNNLayer [50] also
3 APPROACH
3.1 Overview
Figure 2 illustrates the steps and components of our approach.
First, GraphCode2Vec takes as input a Java program (i.e. a set of
class files) that is converted to a Jimple intermediate representation
(IR) [15]. Jimple is typed, based on a three-address code and provides
15 different operations; hence, it is easier to analyse and optimize
than Java bytecode (with over 200 operations). Secondly, GraphCode2Vec employs Soot [60] to obtain the program dependence
graph (PDG) by feeding the class files as input. From the resulting
527
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Jimple representation and PDG, GraphCode2Vec learns two program embeddings, namely a lexical embedding and a dependence
embedding. These two embeddings are ultimately concatenated to
form the final code embedding.
To achieve lexical embedding, our approach first tokenizes the
Jimple instructions obtained from our pre-processing step into subwords. Next, given the sub-words, our approach learns sub-word
embedding using word2vec [42]. Then, it learns the instruction
embedding by representing every Jimple instruction as a sequence
of subwords embeddings using a bi-directional LSTM (BiLSTM,
Section 3.2). The forward and backward hidden states of this BiLSTM allows to build the instruction embeddings. GraphCode2Vec
employs a BiLSTM since it learns context better: BiLSTM can learn
both past and future information while LSTM only learns past information. Finally, it aggregates multiple instruction embeddings
using element-wise addition, in order to obtain the overall lexical
program embedding.
To learn the dependence embedding, GraphCode2Vec applies a
Graph Neural Network (GNN) [54] to embed Jimple instructions
and their dependencies. Each node in the graph corresponds to a
Jimple instruction and contains the (dependence) embedding of
this instruction. Node attributes are from lexical embeddings. The
edges of the graph represent the dependencies between instructions.
Our approach considers the following program dependencies: data
flow, control flow and method call graphs. GraphCode2Vec uses
intra-procedural analysis [18] to extract data-flow and control-flow
dependencies by invoking Soot [60]. Then, it builds method call
graphs via class hierarchy analysis [11].
The training of GNNs is an iterative process where, at each
iteration, the embedding of each node 𝑛 is updated based on the
embedding of the neighboring nodes (i.e., nodes connected to 𝑛)
and the type of 𝑛’s edges [70, 77]. The message passing function
determines how to combine the embedding of the neighbors – also
based on the edge types – and how to update the embedding 𝑛 based
on its current embedding and the combined neighbors’ embedding.
The dependence embedding of an instruction is the embedding of
the corresponding node at the end of the training process.
Finally, after obtaining lexical embedding and dependence embedding, our approach concatenates both embeddings to obtain the
overall program representation.
diverse programs – which may then include words that did not
occur in the programs used to learn the embedding.
To address this challenge, we tokenize the Jimple code into subwords [37, 56, 67], which are units shorter than words, e.g., morphemes. Subwords have been widely adopted in representation
learning systems for texts [13, 25, 52, 76] as they solve the problem of overly long tokens and out-of-vocabulary words. New code
programs can be smoothly handled using short tokens representation, by limiting the amount of long, but different tokens. Subwords get rid of the almost-infinite character combinations that are
common in many program codes. For example, this is the reason
why BERT uses wordpiece subwords [67], and XLNet [71] and T5
[52] use sentence-piece subwords. Similarly, GraphCode2Vec uses
sentence-piece subwords. When using subwords, the long token
“getFunctionalInterfaceMethodSignature” is split into “get”, “Functional”, “Interface”, “Method” and “Signature”. It is worth noting
that most of the subwords are in fact words, e.g., “get” [31]. In
this step, punctuation (e.g., semi-colon “;”) is treated as a common
character.
Step 2 - Subword embedding with word2vec: Given a subwordtokenized Jimple code corpus C with vocabulary size |C|, our approach learns a subword embedding matrix E ∈ R | C |×𝑑 where 𝑑
is a hyperparameter referring to the embedding dimension (𝑑 is
usually set to 100). It uses the popular Skip-gram with negative
sampling (SGNS) method in word2vec [42] to produce E. And E is
utilized as the subword embedding matrix [42].
Step 3 - Instruction embedding: After forming the subword embeddings, GraphCode2Vec represents every Jimple instruction
as a sequence of subword embeddings (w0, w1, ..., w𝑛 ), by using a
bidirectional LSTM (BiLSTM). The role of BiLSTM is to learn the
embedding of the instruction from the subword sequence of the
→
−
←
−
instruction. Let h𝑡 and h𝑡 be the forward hidden state and backward
hidden state of LSTM after feeding the final subword. Then, it forms
→
−
←
−
the instruction embedding by concatenating h𝑡 and h𝑡 , denoted as
→
− ←
−
x = ( h𝑡 , h𝑡 ).
Step 4 - Instruction embedding aggregation: The last step in
the process of forming lexical embedding is the aggregation of
the instruction embeddings in order to form the overall program
lexical embedding. The reason why we aggregate instruction-level
embedding as opposed to learning an embedding for the whole
program is that LSTMs work with sequences of limited length and
thus, truncate the instructions into small sequences (not exceeding the maximal length). After tokenization, a program can have
many subwords and if one directly consider all subwords in the
program, one needs to cut these subwords into the limited sequence
length for LSTM and result in information loss. GraphCode2Vec
uses element-wise addition as the instruction aggregation function.
This operation allows for the aggregation of multiple instruction
embeddings while keeping a limited vector length.
3.2 Lexical embedding
Step 1 - Jimple code tokenization: The first crucial step of GraphCode2Vec is to properly tokenize Jimple code into meaningful
“tokens”, to learn the vector representations. The traditional way to
tokenize code is to split it on whitespaces. However, this manner
is inappropriate for two reasons. First, whitespace-based tokenization often results in long tokens such as long method names (e.g.,
“getFunctionalInterfaceMethodSignature”). Long sequences often
have a low frequency in a given corpus, which subsequently leads
to an embedding of inferior quality. Second, whitespace-based tokenization is not able to process new words that do not occur in the
training data – these out-of-vocabulary words are typically replaced
by a dedicated “unknown” token. This is an obvious disadvantage
for our approach, whose goal is to support practitioners to analyze
3.3 Dependence embedding
Step 1 - Building method graphs: A method graph is a tuple 𝐺 =
(𝑉 , 𝐸, X, K), where 𝑉 is the set of nodes (i.e. Jimple instructions), 𝐸 is
the set of edges (dependence relations between the instructions), X
is the node embedding matrix (which contains the embedding of the
instructions) and K is the edge attribute matrix (which encodes the
528
Ma and Zhao, et al.
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
dependencies that exist between instructions). For each node 𝑛 there
→
− ←
−
is a column vector xn in X such that xn = ( h 𝑡 , h 𝑡 ) (instruction
embedding).
To define 𝐸 and K, our approach extracts data-flow and controlflow dependencies by invoking Soot [18, 60]. Then, GraphCode2Vec
introduces an edge between two nodes if and only if the two corresponding instructions share some dependence.
Step 2 - Building program graphs: A program graph consists of
a pair P = (G, R) where G = {𝐺 0, 𝐺 1, ..., 𝐺𝑚 } is a set of method
graphs and where R ⊆ G 2 is the call relation between the methods,
that is, (𝐺𝑖 , 𝐺 𝑗 ) ∈ R if and only if the method that 𝐺𝑖 represents
calls the method that 𝐺 𝑗 represents. To represent this relation in
the GNN, GraphCode2Vec introduces an entry node and an exit
node for each method and edges linking those nodes with caller
instructions.
Step 3 - Message passing function: The exact definition of the
message passing function depends on the used GNN architecture.
We choose the widely-used GNN architectures with linear complexity [68] that has been successfully applied in various application domains. GraphCode2Vec employs four GNN architectures,
namely Graph Convolutional Network (GCN; Kipf and Welling
[35]), GraphSAGE [24], Graph Attention Network (GAT; Veličković
et al. [61]), Graph Isomorphism Network (GIN; Xu et al. [70]).
Step 4 - Learning the dependence embedding: The dependence
embedding of each instruction is obtained by running the message
passing function on all nodes for a pre-defined number of iterations, i.e., the number of GNN layers. Once these instruction embeddings have been produced, GraphCode2Vec aggregates them
using the global attention pool operation [39] in order to produce
the program-level dependence embedding. Attention mechanism
can make program-level dependence embedding consider more
important nodes (instructions).
The dependence embeddings that GNN produces depend on
the learnable parameters of (a) the message passing function and
(b) bidirectional LSTM from Section 3.2. These parameters can be
automatically set to optimize the effectiveness of GraphCode2Vec
either directly on the downstream task or on some pre-training
objectives, as described hereafter.
In the end, to obtain the program embedding vector, our approach
uses feature fusion as the concatenation operator that combines
both lexical embedding and dependency embedding. Specifically,
GraphCode2Vec combines the tensors of both features as one tensor
via feature fusion to reduce information loss. Concatenation has
been shown to be an effective method to fuse features without information loss when using DNN [20, 28, 38, 46], e.g., DenseNet [57]
and U-Net [58]. Although the dependence embedding inherently
encodes the lexical embedding, the importance of lexical inherently
fades away as the semantic representation is learnt. Our ablation
study (see RQ3 in Section 5) later reveals the benefits of concatenating an explicit lexical embedding with the dependence embedding.
multiple downstream tasks. In this work, we employed three (3) selfsupervised learning strategies to pre-train the BiLSTM and GNN in
GraphCode2Vec, namely node classification, context prediction [27],
and variational graph encoding (VGAE) [36]. Node (or instruction)
classification trains the model to infer the type of Jimple instruction,
given its embedding. Context prediction requires the model to
predict a masked node representation, given its surrounding context.
Variational graph encoding (VGAE) learns to encode and decode
the code dependence graph structure. Note that these pre-training
procedures do not require any human-labeled datasets. The model
learns from the raw datasets without any human supervision.
3.4 Pre-training
Baselines: We compare the effectiveness of GraphCode2Vec to
several state-of-the-art code embedding approaches (aka generic
baselines), and specialised or task-specific learning-based applications. On one hand, generic baselines refers to code embedding approaches that are designed to be general-purpose, i.e., they provide
4 EXPERIMENTAL SETUP
Research Questions: Our research questions (RQs) are designed
to evaluate the effectiveness of GraphCode2Vec. In particular, we
compare the effectiveness of GraphCode2Vec to the state-of-theart in task-specific and generic code embedding methods (see RQ1
and RQ2). This is to demonstrate the utility of GraphCode2Vec in
solving downstream tasks, in comparison to specialised learningbased approaches tailored towards solving specific SE tasks (RQ1)
and other general-purpose code embedding approaches (RQ1). We
also examine if GraphCode2Vec effectively embeds lexical and
program dependence features in the latent space, and how this impacts its effectiveness on downstream tasks (see RQ3). The first goal
of RQ3 is to demonstrate the validity of our approach, i.e., analyse
that it indeed embeds lexical and dependence features as intended
via probing analysis. In addition, we analyse the contribution of
lexical embedding and dependence embedding to its effectiveness
on downstream tasks by conducting an ablation study. We also
investigate the sensitivity of our approach to the choices in GraphCode2Vec’s framework, e.g., model pre-training (strategy) and GNN
configuration. These experiments allow to evaluate the influence of
these choices on the effectiveness of GraphCode2Vec.
Specifically, we ask the following research questions (RQs):
RQ1 Task-specific learning-based applications: Is our approach
(GraphCode2Vec) effective in comparison to the state-of-the-art
task-specific learning-based applications? What is the benefit of
capturing semantic features in our code embedding?
RQ2 Generic Code embedding: How effective is our approach
(GraphCode2Vec), in comparison to the state-of-the-art syntaxonly generic code embedding approaches? What is the impact of
capturing both syntactic and semantic features (i.e., program dependencies) in code embedding? How does GraphCode2Vec compare
to GraphCodeBERT, a larger and more complex model?
RQ3 Further Analyses: What is the impact of model pre-training
on the effectiveness of GraphCode2Vec? Does our approach effectively capture lexical and program dependence features? What is
the contribution of lexical embedding or dependence embedding
to the effectiveness of our approach on downstream tasks? Is our
approach sensitive to the choice of GNN?
Self-supervised learning has been applied with success for pretraining deep learning models [16, 40, 53]. It allows a model to
learn how to perform tasks without human supervision [44, 78]
by learning a universal embedding that can be fine-tuned to solve
529
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
a code embedding that is amenable to address several downstream
tasks. On the other hand, task-specific baselines refers to learningbased approaches that address a specific downstream SE task, e.g.,
patch classification. Table 2 provides details about these baselines
for solution classification and patch classification. Specifically, we
evaluated GraphCode2Vec in comparison to four (4) generic code
embedding approaches, namely Code2Seq [2], Code2Vec [3], CodeBERT [17] and GraphCodeBERT [23] (see RQ2 in section 5). We
have selected these generic baselines because they have been evaluated against several well-known state-of-the-art code embedding
methods and demonstrated considerable improvement over them.
Besides, these approaches are recent, popularly used and have been
applied on many downstream (SE) tasks.
For task-specific learning-based approaches, we consider solution classification, and patch classification. These are popular SE
downstream tasks that have been studied using learning-based
approaches. We utilised three (3) specialised learning-based baseline for the solution classification task, namely CNNSentence [45],
OneCNNLayer [50] and SequentialCNN [21]. We also used all four
patch classifiers (Prophet [41], PatchSim [69], SimFeatures [63] and
ODS [72]). These task-specific baselines have been selected because
they have been shown to outperform other proposed learning-based
approaches for these tasks. For instance, SequentialCNN [21] has
been evaluated against five other learning-based approaches and
demonstrated to be more effective. ODS [72] has also been shown to
be more effective and efficient than the three other patch classifiers.
Table 3: Details of Subject Programs
Subject
Program
Java-Small
Java250
Defects4J
LeetCode-10
M-LeetCode
Concurrency
Jimple-Graph
#Progs.
11
75000
15 & 5
100
100
46
1976
Tasks/Analyses
Method Name Prediction and Ablation Studies
Solution Classification and Ablation Studies
Mutant Prediction and Patch Classification
Probing Analysis
Probing Analysis
Probing Analysis
Model Pre-training
M-LeetCode and Concurrency for the probing analysis of our approach (GraphCode2Vec).
Downstream Tasks: In our evaluation, we considered four (4) major software engineering tasks, namely, mutant prediction, patch
classification, method name prediction, and solution classification.
These are popular downstream SE tasks that have been investigated in the community for decades. For these four tasks, we
evaluated GraphCode2Vec in comparison to four generic baselines,
namely Code2Seq [2], Code2Vec [3], CodeBERT [17] and GraphCodeBERT [23]. Table 3 provides details on the subject programs
employed for each downstream tasks. In the following, we provide
further details about the experimental setup for each task evaluated
in this paper.
Method Name Prediction: This refers to the task of predicting the
method name of a function in a program, given a set of method
names and the body of the function as inputs [6]. This task is
useful for automatic code completion during programming. In our
experiment, all four generic baselines were evaluated for this task.
We evaluated this task using the Java-Small dataset, since it was
designed for this task in previous studies [3] (see Table 3).
Solution Classification: This refers to the classification of source
code into a predefined number of classes, e.g., based on the task
it solves [50], or programming languages [21]. This is useful to
assist or assess programming tasks and manage code warehouse.
We evaluated all four generic baselines on this task, as well as
three specialised learning-based approaches for this task, namely
CNNSentence [45], OneCNNLayer [50], SequentialCNN [21] (Table 2). We evaluated this task using the Java250 dataset, which was
designed for this task in previous studies [51] (see Table 3).
Patch Classification: For this task, the aim is to identify the correctness of patches, i.e., if a patch is (in)correct, wrong or overfitting [69, 72]. In our experiment, we compare the performance of
GraphCode2Vec to the four generic baselines, as well as the current
state-of-the-art learning-based approach for patch classification,
i.e, ODS [72]. We employed the Defects4J [32] dataset (see Table 3)
which has also been used by previous studies for this task [69, 72].
The goal of this task is to identify over-fitting APR patches. We used
five (5) programs and 890 APR patches5 containing 643 over-fitting
patches and 247 correct patches.
Mutant Prediction: The goal of this task is to predict different types
of mutants employed during mutation testing. Mutation testing is
an important SE task that is typically deployed to determine the
adequacy of a test suite to expose injected faults in a program [47].
Subject Programs: In our experiments, we employed eight (8) subject programs written in Java. Table 3 provides details about each
of our subject programs and their experimental usage. Notably, we
employ four (4) publicly available programs for the downstream
tasks, namely Defects4J [32], Java-Small [3], and Java250 [51]. Since
we need the bytecode representation of each program, we utilise
only programs that can be compiled. These datasets were employed for our comparative evaluation (see RQ1 and RQ2). We
chose these datasets because they are popular and have been employed in the evaluation of our downstream tasks in previous studies [2, 51, 72, 74]. Besides, we employed Java-Small and Java250 in
our ablation study where we evaluate the contribution of lexical and
dependence embedding to the effectiveness of GraphCode2Vec
(RQ3). We chose these two datasets for this task because they correspond to tasks that require lexical and semantic information to
be effectively addressed. To further analyze GraphCode2Vec (see
RQ3), we employed the Concurrency dataset [14, 19] and collected
two (2) subject programs (named LeetCode-10 and M-LeetCode)
from LeetCode4 . We use these programs to investigate the difference between capturing lexical and dependence information.
In particular, the Concurrency dataset contains different concurrent code types, which have similar syntactic/lexical features but
different structure information. We mutated LeetCode-10 to create M-LeetCode dataset. Our mutation preserves lexical features,
but modifies semantic or program dependence features such that
LeetCode-10 and M-LeetCode have the same lexical features, but
different semantics. For example, a simple dependence mutant involves switching outer and inner loops. We utilize LeetCode-10,
5 We
exempted 12 patches out of the 902 patched programs used by ODS, since they
deleted complete functions, and there is no code representation for deleted functions.
4 https://leetcode.com/
530
Ma and Zhao, et al.
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
the code lexical syntactic embedding. Task-3 also mixes the two
datasets but uses all the 20 labels instead of a binary classification.
Task-3 integrates Task-1 and Task-2, requiring both lexical and
semantic information. Task-4 is a concurrency bug classification
task. The code with same label can have the high lexical similarity
but the code semantic structure should be different.
GraphCode2Vec’s Configuration: We employ three (3) pretraining strategies, namely node classification, context prediction
and VGAE. Our approach supports four (4) GNN architectures for
dependence embedding (see Section 3), namely GCN [35], GraphSAGE [24], GAT [61] and GIN [70]. In total, we have 12 possible configurations. However, the default configuration is context prediction
for pre-training and dependence embedding with GAT architecture.
In our experiments, we evaluate the effect of each configuration
on the effectiveness of our approach (see Section 5). For method
name prediction, we used the default data splitting from the public
Java-Small dataset. For patch classification task, we used the 10-fold
cross validation described in ODS [72]. For all other experiments,
we keep 10-20% of the dataset as test data.
Implementation Details and Platform: GraphCode2Vec was
implemented in about 4.8 KLOC of Python code, using the Pytorch
ML framework. Our data processing and evaluation code is about
3 KLOC of Java code. We use Soot [60] to extract the program
dependence graph (PDG). We reuse the code from the public repository of each baseline in our experiments.8 However, we adapt each
baseline to our downstream tasks, e.g., by replacing the classifier
but using the same performance metrics. All experiments were conducted on a Tesla V100 GPU server, with 40 CPUs (2.20 GHz) and
256G of main memory. The implementation of GraphCode2Vec is
available online9 .
In this work, we predict if a mutant is killable or live [8]. To this
end, we employ the Defects4J [32] dataset (see Table 3) which has
been popularly employed for several SE tasks, including mutation
testing [48]. We curated a mutant prediction dataset containing 15
Java programs, and 16,216 mutants.
Pre-training Setup: For model pre-training, we curated the Jimple-Graph dataset from the Maven repository6 , it contains 1,976
Java libraries with about 3.5 millions methods in total. We randomly
sample around 10% data for the pre-traning purpose. These Java
libraries are from 42 application domains, this ensures a reasonable
program diversity, these domains include math and image processing libraries. For the BiLSTM component (Section 3.2), we use one
layer with hidden dimension size 150. We pre-train sub-tokens using the Jimple text for each program, the sub-token embedding
dimension is set to 100 (see Section 3). We fine-tune the downstream
tasks using the obtained pre-trained weights after one epoch. All
GNNs use five (5) layers with dropout ratio 0.2. We use Adam [34]
optimizer with 0.001 learning rate. In our experiment, we evaluated
all three (3) pre-training strategies (Section 3.4).
Metrics and Measures: For all tasks, we report F1-score, precision
and recall. We discuss most of our results using F1-score since it
is the harmonic mean of precision and recall. Besides, it is a better
measurement metric than accuracy, especially when the dataset is
imbalanced (e.g., Java-Small). Hence, we do not report the accuracy
for imbalanced datasets, e.g., mutant data is imbalanced with about
30% live mutants and 70% killable mutants. We provide the code
details in the Github repository7 .
Probing Analysis: The goal of our probing analysis is to ensure
that lexical and dependence features are indeed learned by GraphCode2Vec’s code embedding. Probing is a widely used technique to
examine an embedding for desired properties [9, 53, 75]. To this end,
we trained diagnostic classifiers to probe GraphCode2Vec’s code
embedding for our desired properties (i.e., lexical and/or program
dependence features). Concretely, we train a simple classifier with
one MLP layer fed with the learned code embedding (e.g. lexical)
to examine if our code embedding encodes the desired property. To
achieve this, we curated a dedicated dataset for training and evaluating our probing classifiers. Specifically, we employ three probing
datasets, namely LeetCode-10, M-LeetCode and Concurrency (Table 3). We have employed these datasets because they require lexical
or dependence embedding to address their corresponding tasks.
Probing Task Design: We design four probing tasks. The first
three (Task-1, Task-2 and Task-3) use LeetCode-10 and M-LeetCode,
and the last one (Task-4) uses Concurrency. Task-1 classifies what
problem the solution code solves on LeetCode-10. LeetCode-10
shares lexical token similarities within one problem group, and
some solutions from the different problem groups may have the
same semantic structure, e.g., using one for-loop. Therefore, we
hypothesize that the lexical embedding is more informative than
the semantic embedding for Task-1. Task-2 mixes LeetCode-10 and
M-LeetCode, and then judges which dataset the input code is from
(binary classification). LeetCode-10 and M-LeetCode share lots of
similar lexical tokens but the code semantic structures are different.
Hence, the semantic embedding should be more informative than
5 EXPERIMENTAL RESULTS
RQ1 Task-specific learning-based applications: This experiment examines how GraphCode2Vec compares to seven (7) stateof-the-art task-specific learning-based techniques for solution classification and patch classification. We selected these two tasks for this
experiment due to their popularity, availability of ML-based baselines and their application to vital SE tasks, e.g., automated program
repair, patch validation, code evolution, and software warehousing.
We evaluated against three solution classifiers, namely CNNSentence [45], OneCNNLayer [50], SequentialCNN [21]. We also compare GraphCode2Vec to four patch classifiers – Prophet [41],
PatchSim [69], SimFeatures [63] and ODS [72].
Our evaluation results show that GraphCode2Vec outperforms
the state-of-the-art task-specific learning based approaches for the
tested tasks, i.e., patch classification, and solution classification. Table 5 highlights the effectiveness of GraphCode2Vec in comparison
to learning-based approaches for patch classification and solution
classification, respectively. In particular, GraphCode2Vec outperforms all seven task-specific baselines in our evaluation. GraphCode2Vec outperforms all three baselines for solution classification,
it is almost twice as effective as SequentialCNN and OneCNNLayer,
and 40% more effective than the best baseline – CNNSentence (see
8 https://github.com/tech-srl/code2vec,https://github.com/tech-srl/code2seq,
6 https://mvnrepository.com/
https://github.com/microsoft/CodeBERT, https://github.com/hukuda222/code2seq
9 https://github.com/graphcode2vec/graphcode2vec
7 https://github.com/graphcode2vec/graphcode2vec
531
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Table 4: Effectiveness of GraphCode2Vec vs. Syntax-only Generic Code Embedding approaches. The best results are in bold
text, the results for the best-performing baseline are in italics. We report the improvement in effectiveness between GraphCode2Vec and the best-performing baseline in “% Improvement”, improvements above five percent (>5%) are in bold text.
Generic Code
Embedding
Code2Seq
Code2Vec
CodeBERT
GraphCode2Vec
% Improvement
Method Name Prediction
F1
Preci
Recall
0.4920
0.5963
0.4187
0.3309 0.3779
0.2943
0.3963 0.3295
0.4969
0.5807 0.6150 0.5502
18.03% 3.14%
10.73%
Solution Classification
F1
Preci Recall
0.7542
0.7678 0.7536
0.8034
0.8081 0.8028
0.8783
0.8747
0.8878
0.9746 0.9753 0.9746
10.96% 11.50% 9.78%
Table 5: Effectiveness of GraphCode2Vec (aka “Graph.”)
vs. Task-Specific learning-based approaches for two SE tasks.
The best results are in bold text, the results for the second
best-performing approach are in italics. The improvement
in effectiveness between GraphCode2Vec and the bestperforming baseline is reported in “Graph. (% Improv.)”.
F1-Score
Recall
Precision
CNN
Sen.
0.690
0.690
0.700
Solution Classification
One Seq.Graph.
CNN. CNN
(% Improv.)
0.540 0.470 0.970 (40.6%)
0.540 0.470 0.970 (40.6%)
0.550 0.480 0.970 (38.6%)
SimFeatures
0.881
0.895
0.870
Mutant Prediction
F1
Preci Recall
0.5911 0.6423 0.5881
0.6398 0.6632 0.6320
0.7106 0.7305 0.6995
0.7542 0.7569 0.7524
6.14%
3.61%
7.56%
Patch Classification
F1
Preci Recall
0.8901 0.8355 0.9541
0.8787 0.8806 0.8782
0.9275 0.9099 0.9473
0.9359 0.9145 0.9602
0.91%
0.51%
0.64%
respectively. We observed CodeBERT is the best baseline on three
tasks. We attribute the performance of CodeBERT on these tasks
to its much higher complexity (i.e., huge number of trainable parameters, more than 124M) and the size of the pre-training dataset
(8.5M) [29]. Overall, our results demonstrate that including semantic
program features improves the performance of code representation
across these downstream tasks. Thus, emphasizing the importance
of semantic features in addressing SE tasks, especially the need to
capture program dependencies in code representation.
Patch Classification
Prop- PatchGraph.
ODS
het
Sim
(% Improv.)
0.892
0.881
0.900 0.915 (1.7%)
0.891
0.389
0.950 0.960 (2.1%)
0.889
0.830
0.924 0.936 (1.3%)
For all (four) tasks, GraphCode2Vec is (up to 18%) more effective
than (the best) syntax-only baselines.
Complementarity with GraphCodeBERT: We also observe that
despite the lower complexity of our approach (GraphCode2Vec), it
is comparable and complementary to GraphCodeBERT across tested
tasks. GraphCodeBERT captures both syntactic and semantic program features but, it is significantly larger and complex than GraphCode2Vec. Table 6 highlights the complexity and effectiveness of
GraphCodeBERT in comparison to GraphCode2Vec. For instance,
GraphCodeBERT has at least 50 times (50x) as many trainable parameters as GraphCode2Vec (124 million versus 2.8 million parameters), and seven times (7x) as much pre-training data (2.3M versus
314K methods). Despite the difference in size and complexity, GraphCodeBERT has a comparable performance to GraphCode2Vec.
Specifically, GraphCode2Vec outperforms GraphCodeBERT on
two tasks (method name prediction and patch classification) and it
is comparable on the other two tasks (solution classification, and
mutant prediction). Notably, GraphCodeBERT has a negligible improvement over GraphCode2Vec for these two tasks (about 1%).
These results demonstrate that although simpler and trained on 7
times less data, GraphCode2Vec is complementary to GraphCodeBERT. This disparity in size and complexity implies that precise
program dependence information is important. Nevertheless, our
results show that both GraphCode2Vec and GraphCodeBERT are
more effective than syntax-only approaches, e.g., CodeBERT (cf.
Table 5 and Table 6).
Table 5). In addition, GraphCode2Vec outperforms all four state of
the art patch classifiers, i.e., ODS [72], Prophet [41]), PatchSim [69]
and SimFeatures [63]. It is at least twice as effective as PatchSim (in
terms of recall) and slightly (up to 2%) more effective than the best
baseline, i.e., ODS (see Table 5). This result demonstrates the utility
of our approach in addressing both downstream tasks. Furthermore,
it highlights the effectiveness of generic code embedding in comparison to specialised learning-based approaches. This superior
performance can be attributed to the fact that GraphCode2Vec is
generic, and it employs self-supervised model pre-training.
GraphCode2Vec is up to two times (2x) more effective than the
seven (7) state-of-the-art task-specific approaches, for both tasks.
RQ2 Generic Code embedding: In this experiment, we demonstrate how GraphCode2Vec compares to the state-of-the-art generic
code embedding approaches. We thus, compare the effectiveness
of GraphCode2Vec with three (3) syntax-only generic baselines,
namely CodeBERT, Code2Seq and Code2Vec. Additionally, we compare the effectiveness of our approach to a a larger and more complex state-of-the-art generic approach that captures both syntax and
semantics, specifically, GraphCodeBERT. We used four (4) downstream SE tasks – method name prediction, solution classification,
mutant prediction and patch classification.
Syntax-only Generic Embedding: In our evaluation, we found
that our approach (GraphCode2Vec) outperforms all syntax-based
generic baselines for all tasks. Table 4 highlights the effectiveness
of GraphCode2Vec in comparison to the baselines (i.e., Code2seq,
Code2Vec and CodeBERT). As an example, consider method name
prediction, GraphCode2Vec is twice as effective as some baselines,
e.g., Code2Vec. For all (four) tasks, GraphCode2Vec clearly outperforms all baselines across all metrics. It is up to 12% and 18%
more effective than the best baselines, CodeBERT and Code2Seq,
GraphCode2Vec is complementary to GraphCodeBERT despite
being simpler and trained on seven times (7x) less data. It is more
effective on two tasks, and comparable on the other two tasks.
RQ3 Further Analyses: The goal of this research question is to
examine the impact of model pre-training on improving GraphCode2Vec’s effectiveness on downstream tasks. We also investigate
if GraphCode2Vec effectively captures lexical and/or semantic
532
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Ma and Zhao, et al.
Table 6: Effectiveness of GraphCode2Vec vs. GraphCodeBERT. Lower complexity, the best results and higher improvements
(above five percent (>5%)) are in bold text.) are in bold text.
Generic Code
Embedding
GraphCodeBERT
GraphCode2Vec
% Improvement
Model
Size
124M
2.8M
50X
Pretrain
Data
2.3M
314K
7X
Method Name Prediction
F1
Preci
Recall
0.5761 0.7261
0.4775
0.5807 0.6150
0.5502
7.99% -15.30% 15.23%
Solution Classification
F1
Preci Recall
0.9850 0.9868 0.9843
0.9746 0.9753 0.9746
-1.07% -1.17% -0.18%
Mutant Prediction
F1
Preci Recall
0.7649 0.768 0.7623
0.7542 0.7569 0.7524
-1.40% -1.45% -1.30%
Patch Classification
F1
Preci Recall
0.9317 0.9108 0.9557
0.9359 0.9145 0.9602
0.45%
0.41%
0.47%
Table 7: Probing Analysis results showing the accuracy for all pre-training strategies and GNN configurations. Best results
for each sub-category are in bold, and the better results between syntactic (lexical) embedding and semantic embedding is in
italics. “syn+sem” refers to GraphCode2Vec’s models capturing both syntactic and semantic features.
Pre-training
Strategy
Captured
Feature
semantic
Context
syntactic
syn+sem
semantic
Node
syntactic
syn+sem
semantic
VGAE
syntactic
syn+sem
Best Config.
GCN
0.822
0.934
0.918
0.758
0.904
0.872
0.856
0.916
0.92
Task-1 (syntax-only)
GIN GSAGE GAT
0.674
0.842
0.886
0.938
0.942
0.928
0.928
0.95
0.942
0.820
0.802
0.840
0.884
0.876
0.916
0.9
0.876
0.902
0.812
0.868
0.866
0.932
0.928
0.950
0.926
0.928
0.938
Syntactic = 8/12
Task-2 (semantic-only)
GCN GIN GSAGE GAT
0.684 0.614
0.704
0.741
0.615 0.602
0.617
0.602
0.641 0.641
0.688
0.797
0.651 0.667
0.741
0.686
0.584 0.587
0.606
0.593
0.624 0.618
0.691
0.67
0.594 0.653
0.583
0.617
0.591 0.572
0.594
0.599
0.59
0.63
0.591
0.596
Semantic = 9/12
Task-3 (syntax and semantic)
GCN GIN GSAGE GAT
0.513 0.381
0.543
0.612
0.529 0.527
0.528
0.527
0.559 0.546
0.587
0.6
0.426 0.514
0.625
0.563
0.516 0.504
0.490
0.513
0.522 0.508
0.572
0.545
0.403 0.532
0.407
0.477
0.485 0.494
0.492
0.495
0.498 0.548
0.508
0.492
Syntactic + Semantic = 7/12
Task-4 (semantic-only)
GCN GIN GSAGE GAT
0.654 0.666
0.657
0.594
0.580 0.525
0.524
0.449
0.605 0.592
0.608
0.592
0.647 0.664
0.659
0.670
0.484 0.476
0.420
0.550
0.519 0.522
0.451
0.57
0.673 0.680
0.674
0.656
0.523 0.617
0.584
0.591
0.627 0.658
0.531
0.586
Semantic = 12/12
Table 8: Effectiveness (F1-Score) of GraphCode2Vec on all GNN configurations and Pre-training Strategies, for all downstream tasks. For each subcategory, the best results for each category are in bold text.
GNN
GCN
GIN
GraphSage
GAT
Average
Variance
SD
No Pretraining
0.4494
0.4347
0.3998
0.4246
0.4271
0.0003
0.0180
Method Name Prediction
Pre-training Strategies
Context Node
VGAE
0.5018
0.4859 0.5337
0.4684
0.4037 0.5266
0.5006
0.4531 0.5412
0.5807
0.6194 0.5890
0.5129
0.4905 0.5476
0.0017
0.0064 0.0006
0.0413
0.0800 0.0244
Average
0.4930
0.4584
0.4736
0.5534
Table 9: Ablation Study results showing the F1-Score of
GraphCode2Vec. Best results are bold.
Pre-training
Strategy
Captured
Feature
semantic
Context
syntactic
semantic
Node
syntactic
semantic
VGAE
syntactic
Best config.
Method Name Prediction
GCN
GIN
GSAGE GAT
0.5454 0.4674 0.5038 0.6082
0.4575 0.4500
0.4644
0.4381
0.4843 0.4136 0.4404 0.5888
0.3800 0.3845
0.3660
0.3560
0.5988 0.4786 0.3675 0.5464
0.3922 0.4053 0.3936 0.4058
Semantic = 11/12
No Pretraining
0.9679
0.9645
0.9675
0.9647
0.9662
2.2e-6
0.0015
Solution Classification
Pre-training Strategies
Context Node VGAE
0.9710
0.9710 0.9751
0.9711
0.9700 0.9710
0.9712
0.9721 0.9727
0.9746
0.9703 0.9735
0.9720
0.9718 0.9731
2.2e-6
7.1e-7 2.3e-6
0.0015
0.0008 0.0015
Average
0.9712
0.9692
0.9709
0.9708
Model Pre-training: We examine if the three pre-training strategies improve the effectiveness of GraphCode2Vec on downstream
tasks, using two downstream tasks and all three pre-training strategies (node, context and VGAE) (see Table 8).
We found that model pre-training improves the effectiveness of
GraphCode2Vec across all tasks. Pre-training improves its effectiveness by up to 28%, on average. For instance, consider model
pre-training with VGAE strategy for method name prediction (see
Table 8). This result implies that model pre-training improves the
effectiveness of GraphCode2Vec on downstream SE tasks.
Solution Classification
GCN
GIN
GSAGE GAT
0.9698 0.9649 0.9682 0.9740
0.9614 0.9560
0.9588
0.9610
0.9738 0.9711 0.9696 0.9704
0.9563 0.9562
0.9572
0.9595
0.9725 0.9663 0.9671 0.9711
0.9711 0.9659
0.9626
0.9705
Semantic = 12/12
Model pre-training improves the effectiveness of
GraphCode2Vec (by up to 28%, on average) across all tasks.
program feature(s). We employ probing analysis to analyze if pretrained GraphCode2Vec models learn the lexical and semantic
features required for feature-specific tasks, i.e, that require capturing either or both features to be well-addressed. For instance,
Task-4 is the concurrency classification task requiring semantic
features. In addition, we conduct an ablation study to investigate
how the syntactic and semantic information captured by GraphCode2Vec influence its effectiveness on downstream tasks. Finally,
we evaluate the sensitivity of our approach to the selected GNN.
Probing Analysis: Let us examine if our pre-trained code embedding indeed encodes the desired lexical and semantic program
features. To achieve this, we use the lexical embedding and semantic embedding from GraphCode2Vec’s pre-training as inputs
for probing. In this probing analysis, only the classifier is trainable and GraphCode2Vec is frozen and non-trainable. We use one
533
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
MLP-layer classifier to evaluate these models on four tasks, Task-1
requires only lexical/syntactic information. However, Task-2 and
Task-4 require only semantic information (program dependence).
Finally, Task-3 subsumes tasks one and two, such that it requires
both syntactic and semantic information.
Our evaluation results show that GraphCode2Vec’s pre-trained
code embedding mostly captures the desired lexical and semantic
program features for all tested tasks, regardless of the pre-training
strategy or GNN configuration. Table 7 highlights the effectiveness of
each frozen pre-trained model for each task, configuration and pretraining strategy. Notably, the frozen pre-trained model performed
best for the desired embedding for each task in three-quarters
(36/48=75%) of all tested configurations. As an example, for tasks
requiring semantic information (Task-2 and Task-4), our pre-trained
model encoding only semantic information performed best for 88%
of all configurations (21/24 cases). This result demonstrates that
GraphCode2Vec effectively encodes either or both syntactic and
semantic features, this is evidenced by the effectiveness of models
encoding desired feature(s) for feature-specific tasks.
instance, there is a threat that GraphCode2Vec does not generalize to other (SE) tasks and other Java programs. To mitigate this
threat, we have evaluated GraphCode2Vec using mature Java programs with varying sizes and complexity (see Table 3), as well as
downstream tasks with varying complexities and requirements.
Internal Validity: This threat refers to the correctness of our implementation, if we have correctly represented lexical and semantic
features in our code embedding. We mitigate this threat by evaluating the validity of our implementation with probing analysis
and ablation studies (see Section 5). We have also compared GraphCode2Vec to 7 baselines using four (4) major downstream tasks. In
addition, we have conducted further analysis to test our implementation using different pre-training strategies and GNN configurations. We also provide our implementation, (pre-trained) models
and experimental data for scrutiny, replication and reuse.
Construct Validity: This is the threat posed by our design/implementation choices and their implications on our findings. Notably,
our choice of intermediate code representation (i.e., Jimple) instead
of source code implies that our approach lacks natural language
text (such as code comments) in the (pre-)training dataset. Indeed,
GraphCode2Vec would not capture this information as it is. However, it is possible to extend GraphCode2Vec to also capture natural language text. This can be achieved by performing lexical and
program dependence analysis at the source code level.
GraphCode2Vec effectively encodes the syntactic and/or semantic
features, feature-specific models performed best in 75% of cases.
Ablation Study: We investigate the impact of syntactic/lexical
embedding and semantic/dependence embedding on addressing
downstream tasks. Using method name prediction and solution
classification, we examine how removing lexical embedding or dependence embedding during the fine-tuning of GraphCode2Vec’s
pre-trained model impacts the effectiveness of the approach.
Our results show that GraphCode2Vec’s dependence embedding
is important to effectively address our downstream SE tasks. Table 9
presents the ablation study results. In particular, results show that
models fine-tuned with only semantic information outperformed
those fine-tuned with syntactic features in almost all (23/24 = 96%
of) cases. This result demonstrates the effectiveness of dependence
embedding in addressing downstream SE tasks.
7 CONCLUSION
In this paper, we have proposed GraphCode2Vec, a novel and
generic code embedding approach that captures both syntactic and
semantic program features. We have evaluated it in comparison to
the state-of-the-art generic code embedding approaches, as well as
specialised, task-specific learning based applications. Using seven
(7) baselines and four (4) major downstream SE tasks, we show
that GraphCode2Vec is stable and effectively applicable to several
downstream SE tasks, e.g., patch classification and solution classification. Moreover, we show that it indeed captures both lexical
and dependency features, and we demonstrate the importance of
generically embedding both features to solve downstream SE tasks.
In the future, we plan to address certain limitations of GraphCode2Vec. GraphCode2Vec does not capture dynamic information (e.g., program execution) and textual source code details (e.g.,
comments), both of which are important for some SE tasks. To
address the lack of dynamic code embedding, we plan to investigate
how GraphCode2Vec can effectively capture run-time information (such as program traces or code coverage) so it is applicable to
SE tasks that are dependent on dynamic information (e.g., testing,
debugging and program repair). Additionally, we plan to extend
GraphCode2Vec to account for textual source code information
such that it is amenable to tasks requiring such features (e.g., program search and code completion).
To encourage replication and reuse, we provide our prototypical
implementation of GraphCode2Vec and our experimental data:
Results show that dependence/semantic embedding is vital to the
effectiveness of GraphCode2Vec on downstream SE tasks.
GNN Sensitivity: This experiment evaluates the sensitivity of our
approach to the choice of GNN. Table 8 provides details of the GNN
sensitivity analysis, tasks and GNN configurations. To evaluate
this, we compute the variance and standard deviation (SD) of the
effectiveness of GraphCode2Vec when employing different GNNs.
Our evaluation results show that GraphCode2Vec is stable, it is
not highly sensitive to the choice of GNN. Table 8 shows the details of
the SD and variance of our approach for each GNN configuration.
Across all tasks, the variance and SD of the GraphCode2Vec is
mostly low, it is maximum 0.0064 and 0.0413, respectively.
GraphCode2Vec is stable across GNN configurations, the variance
and SD of its effectiveness are very low for all configurations.
https://github.com/graphcode2vec/graphcode2vec
6
THREATS TO VALIDITY
ACKNOWLEDGMENTS
External Validity: This refers to the generalizability of our approach
and results, especially beyond our data sets, tasks and models. For
This work is supported by the Luxembourg National Research Funds
(FNR) through the CORE project grant C17/IS/11686509/CODEMATES.
534
Ma and Zhao, et al.
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
REFERENCES
Neural Information Processing Systems. 1025–1035.
[25] Benjamin Heinzerling and Michael Strube. 2018. BPEmb: Tokenization-free Pretrained Subword Embeddings in 275 Languages. In Proceedings of the Eleventh
International Conference on Language Resources and Evaluation (LREC 2018).
European Language Resources Association (ELRA), Miyazaki, Japan. https:
//aclanthology.org/L18-1473
[26] Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec: Distributed Representations of Code Changes. In Proceedings of the ACM/IEEE
42nd International Conference on Software Engineering (Seoul, South Korea)
(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 518–529.
https://doi.org/10.1145/3377811.3380361
[27] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande,
and Jure Leskovec. 2019. Strategies for pre-training graph neural networks. arXiv
preprint arXiv:1905.12265 (2019).
[28] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
2017. Densely connected convolutional networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition. 4700–4708.
[29] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic
code search. arXiv preprint arXiv:1909.09436 (2019).
[30] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007.
Deckard: Scalable and accurate tree-based detection of code clones. In 29th
International Conference on Software Engineering (ICSE’07). IEEE, 96–105.
[31] Dan Jurafsky and James H. Martin. 2021. Speech & language processing.
[32] René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings
of the 2014 International Symposium on Software Testing and Analysis. 437–440.
[33] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020.
Learning and evaluating contextual embedding of source code. In International
Conference on Machine Learning. PMLR, 5110–5121.
[34] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[35] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph
convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[36] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv
preprint arXiv:1611.07308 (2016).
[37] Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language
independent subword tokenizer and detokenizer for Neural Text Processing. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations. Association for Computational Linguistics,
Brussels, Belgium, 66–71. https://doi.org/10.18653/v1/D18-2012
[38] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2016. Fractalnet:
Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648
(2016).
[39] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated
graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).
[40] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[41] Fan Long and Martin Rinard. 2016. Automatic patch generation by learning
correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages. 298–312.
[42] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May
2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
http://arxiv.org/abs/1301.3781
[43] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
[44] T Nathan Mundhenk, Daniel Ho, and Barry Y Chen. 2018. Improvements to
context based self-supervised learning. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 9339–9348.
[45] Hiroki Ohashi and Yutaka Watanobe. 2019. Convolutional neural network for
classification of source codes. In 2019 IEEE 13th International Symposium on
Embedded Multicore/Many-core Systems-on-Chip (MCSoC). IEEE, 194–200.
[46] Oyebade K Oyedotun and Djamila Aouada. 2020. Why do Deep Neural Networks
with Skip Connections and Concatenated Hidden Representations Work?. In
International Conference on Neural Information Processing. Springer, 380–392.
[47] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark
Harman. 2019. Mutation testing advances: an analysis and survey. In Advances
in Computers. Vol. 112. Elsevier, 275–378.
[48] Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are
mutation scores correlated with real fault detection?: a large scale empirical study
on the relationship between mutants and real faults. In Proceedings of the 40th
International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden,
May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and
[1] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning
to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017).
[2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating
Sequences from Structured Representations of Code. arXiv:1808.01400 [cs.LG]
[3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming
Languages 3, POPL (2019), 1–29.
[4] Richard E Bellman. 2015. Adaptive control processes. Princeton university press.
[5] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code
comprehension: A learnable representation of code semantics. arXiv preprint
arXiv:1806.07336 (2018).
[6] Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-Supervised
Learning of Code Representations by Predicting Subtrees. In 2021 IEEE/ACM 43rd
International Conference on Software Engineering (ICSE). IEEE, 1186–1197.
[7] Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng,
Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang,
et al. 2020. Exploring software naturalness through neural language models.
arXiv preprint arXiv:2006.12641 (2020).
[8] Thierry Titcheu Chekam, Mike Papadakis, Tegawendé F. Bissyandé, Yves Le
Traon, and Koushik Sen. 2020. Selecting fault revealing mutants. Empir. Softw.
Eng. 25, 1 (2020), 434–487. https://doi.org/10.1007/s10664-019-09778-7
[9] Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and
Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing
sentence embeddings for linguistic properties. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, Melbourne, Australia, 2126–2136.
https://doi.org/10.18653/v1/P18-1198
[10] Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, and Michael O’Boyle. 2020. Deep Data Flow Analysis. arXiv preprint
arXiv:2012.01470 (2020).
[11] Jeffrey Dean, David Grove, and Craig Chambers. 1995. Optimization of ObjectOriented Programs Using Static Class Hierarchy Analysis. In Proceedings of the
9th European Conference on Object-Oriented Programming (ECOOP ’95). SpringerVerlag, Berlin, Heidelberg, 77–101.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
4171–4186. https://doi.org/10.18653/v1/N19-1423
[14] Hyunsook Do, Sebastian Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Software Engineering 10, 4 (2005), 405–435.
[15] Arni Einarsson and Janus Dam Nielsen. 2008. A survivor’s guide to Java program
analysis with soot. BRICS, Department of Computer Science, University of Aarhus,
Denmark 17 (2008).
[16] Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. 2010. Why
does unsupervised pre-training help deep learning?. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 201–208.
[17] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained
model for programming and natural languages. arXiv preprint arXiv:2002.08155
(2020).
[18] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program
Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst.
9, 3 (July 1987), 319–349. https://doi.org/10.1145/24039.24041
[19] Abel Garcia and Cosimo Laneve. 2017. JaDA–the Java deadlock analyser. Behavioural Types: from Theories to Tools (2017), 169–192.
[20] Sahar Ghannay, Benoit Favre, Yannick Esteve, and Nathalie Camelin. 2016. Word
embedding evaluation and combination. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC’16). 300–305.
[21] Shlok Gilda. 2017. Source code classification using Neural Networks. In 2017
14th International Joint Conference on Computer Science and Software Engineering
(JCSSE). IEEE, 1–6.
[22] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and
George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. CoRR
abs/1704.01212 (2017). arXiv:1704.01212 http://arxiv.org/abs/1704.01212
[23] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long
Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert:
Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366
(2020).
[24] William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation
learning on large graphs. In Proceedings of the 31st International Conference on
535
GraphCode2Vec
MSR ’22, May 23–24, 2022, Pittsburgh, PA, USA
Mark Harman (Eds.). ACM, 537–548. https://doi.org/10.1145/3180155.3180183
[49] Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu.
2021. How could Neural Networks understand Programs? arXiv preprint
arXiv:2105.04297 (2021).
[50] Ádám Pintér and Sándor Szénási. 2018. Classification of source code solutions
based on the solved programming tasks. In 2018 IEEE 18th International Symposium on Computational Intelligence and Informatics (CINTI). IEEE, 000277–000282.
[51] Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi,
Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al.
2021. Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity
of Coding Tasks. arXiv preprint arXiv:2105.12655 (2021).
[52] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
[53] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A Primer in BERTology:
What We Know About How BERT Works. Transactions of the Association for
Computational Linguistics 8 (2020), 842–866. https://doi.org/10.1162/tacl_a_00349
[54] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. 2008. The graph neural network model. IEEE transactions on neural
networks 20, 1 (2008), 61–80.
[55] Cedric Seger. 2018. An investigation of categorical variable encoding techniques
in machine learning: binary versus one-hot and feature hashing.
[56] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, Berlin, Germany, 1715–1725. https:
//doi.org/10.18653/v1/P16-1162
[57] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.
Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). 1–9. https://doi.org/10.1109/CVPR.2015.7298594
[58] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016. Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
2818–2826.
[59] Siddhaling Urolagin, KV Prema, and NV Subba Reddy. 2011. Generalization
capability of artificial neural network incorporated with pruning method. In International Conference on Advanced Computing, Networking and Security. Springer,
171–178.
[60] Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and
Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. In
CASCON First Decade High Impact Papers. 214–224.
[61] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint
arXiv:1710.10903 (2017).
[62] S VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Sankar Desarkar,
Ramakrishna Upadrasta, and YN Srikant. 2020. Ir2vec: Llvm ir based scalable
program embeddings. ACM Transactions on Architecture and Code Optimization
(TACO) 17, 4 (2020), 1–27.
[63] Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou,
Xiaoguang Mao, and Hai Jin. 2020. Automated patch correctness assessment:
How far are we?. In Proceedings of the 35th IEEE/ACM International Conference
on Automated Software Engineering. 968–980.
[64] Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones
with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE
27th International Conference on Software Analysis, Evolution and Reengineering
(SANER). IEEE, 261–271.
[65] Wenhan Wang, Kechi Zhang, Ge Li, and Zhi Jin. 2020. Learning to Represent
Programs with Heterogeneous Graphs. arXiv preprint arXiv:2012.04188 (2020).
[66] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
2016. Deep learning code fragments for code clone detection. In 2016 31st
IEEE/ACM International Conference on Automated Software Engineering (ASE).
87–98.
[67] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
2016. Google’s neural machine translation system: Bridging the gap between
human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[68] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE
transactions on neural networks and learning systems 32, 1 (2020), 4–24.
[69] Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018.
Identifying patch correctness in test-based program repair. In Proceedings of the
40th international conference on software engineering. 789–799.
[70] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful
are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[71] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and
Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language
Understanding. In Advances in Neural Information Processing Systems, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.),
Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/
dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
[72] He Ye, Jian Gu, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2021.
Automated classification of overfitting patches with statically extracted code
features. IEEE Transactions on Software Engineering (2021).
[73] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong
Liu. 2019. A novel neural source code representation based on abstract syntax
tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering
(ICSE). IEEE, 783–794.
[74] Gang Zhao and Jeff Huang. 2018. DeepSim: Deep Learning Code Functional
Similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering
(Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 141–151. https://doi.org/10.1145/3236024.3236068
[75] Mengjie Zhao, Philipp Dufter, Yadollah Yaghoobzadeh, and Hinrich Schütze. 2020.
Quantifying the Contextualization of Word Representations with Semantic Class
Probing. In Findings of the Association for Computational Linguistics: EMNLP
2020. Association for Computational Linguistics, Online, 1219–1234. https:
//doi.org/10.18653/v1/2020.findings-emnlp.109
[76] Mengjie Zhao and Hinrich Schütze. 2019. A Multilingual BPE Embedding Space
for Universal Sentiment Lexicon Induction. In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3506–3517. https://doi.org/10.18653/v1/P191341
[77] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu,
Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks:
A review of methods and applications. AI Open 1 (2020), 57–81.
[78] Andrew Zisserman. 2018. Self-Supervised Learning. https://project.inria.fr/paiss/
files/2018/07/zisserman-self-supervised.pdf.
536