(2023) A Survey On Language Models For Code
(2023) A Survey On Language Models For Code
(2023) A Survey On Language Models For Code
Codex (Chen et al., 2021), PaLM Coder (Chowdhery et al., 2022), Minerva (Lewkowycz et al., 2022),
Adaped LM PaLM 2* (Anil et al., 2023), Code LLaMA (Rozière et al., 2023)
Instruction WizardCoder (Luo et al., 2023), PanGu-Coder2 (Shen et al., 2023), Oc-
Finetuning toCoder (Muennighoff et al., 2023), MFTCoder (Liu et al., 2023)
Code
Finetuning
Reinforcement CompCoder (Wang et al., 2022), CodeRL (Le et al., 2022), PPOCoder (Sho-
Learning jaee et al., 2023), RLTF (Liu et al., 2023)
ing, including instruction tuning (Honovich et al., learning, and engineering improvements. Then, in
2023; Xu et al., 2023; Luo et al., 2023), infilling §6, we discuss unique features of code that are not
objectives (Tay et al., 2023; Li et al., 2023; Rozière available to natural languages but have been uti-
et al., 2023), recontemplation of scaling laws (Hoff- lized to aid code processing. In §7, we review the
mann et al., 2022; Gunasekar et al., 2023; Li et al., most recent integration between LLMs and soft-
2023), architectural improvements (Shazeer, 2019; ware development, before finally concluding this
Su et al., 2021; Dao et al., 2022), and autonomous work in §8 and highlighting the current challenges
agents (Qian et al., 2023; Hong et al., 2023), while in code processing.
in return SE requirements are providing real-world
testbeds for these technologies and driving the de- 2 Background
velopment of LLMs forward into production. We
believe a systematic review of these advancements In this section, we briefly review the preliminaries
would benefit both communities. of Transformer-based language modeling, includ-
ing common objectives for unidirectional and bidi-
The rest of this work is organized following the
rectional models, as well as some popular models
taxonomy presented in Figure 1. In §2 we first
and designs in NLP.
provide the preliminaries of language modeling
and Transformer models, and then in §3 we con-
2.1 Causal Language Modeling
textualize the evaluation of language models for
code, highlighting the historical transition from Unidirectional language models (also known as
various code understanding tasks to more practical causal language models2 ) factor the probability of
text-to-code generation tasks. In §4 we discuss the a sentence into the product of each token’s con-
plethora of LLMs that have demonstrated coding ditional probability with the chain rule. A piece
ability, and then in §5 we review the specialized of input text x = [x1 , x2 , · · · , xn ] consisting of n
and often smaller models by their architecture, with 2
The training objective of such language models is Causal
special attention on the recent application of infill- Language Modeling (CLM), but also referred to as Next Token
ing objectives, instruction tuning, reinforcement Prediction.
tokens is modeled as 2.3 Denoising Objectives
n
Y GPT-style causal LM and BERT-style bidirectional
P (x) = pθ (xi |x1:i−1 ), (1) LM each has its own strengths and weaknesses.
i=1 While GPT can be used for autoregressive gen-
eration, it lacks a bidirectional representation of
where x1:i−1 is a shorthand for tokens before xi input text, and is thus unsuitable for sequence-to-
in the input, and θ is the parameters of the model. sequence (seq2seq) generation tasks such as transla-
With Transformer decoders such as GPT (Radford tion and summarization. BERT, on the other hand,
et al., 2018; Radford et al., 2019; Brown et al., can produce bidirectional representations, but is
2020) and LLaMA (Touvron et al., 2023; Touvron pretrained only for mask filling, not generation.
et al., 2023), the conditional probability in (1) is The vanilla Transformer encoder-decoder ar-
modeled by adding an attention mask to the atten- chitecture combines the respective advantages of
tion matrix of each Transformer block, ensuring GPT and BERT. T5 (Raffel et al., 2020) is such
that xi can only attend to previous tokens. During a model pretrained with span corruption, which
training, the cross entropy loss on all tokens in the can be regarded as a variant of MLM. During pre-
input is calculated in parallel, while at inference training, spans of text in the input are replaced
time new token is generated autoregressively. For with sentinel tokens, which plays the same role
further details about the Transformer architecture as [MASK] in BERT. The noisy input is first pro-
we refer to Vaswani et al. (2017). cessed by the encoder with bidirectional atten-
tion, and the masked spans are then generated
2.2 Masked Language Modeling autoregressively by the decoder. Formally, if k
Unlike causal language models, bidirectional lan- spans are sampled for corruption in input x, the
guage models are trained to acquire a better con- noisy input x̂ is then constructed by replacing
textual representation of text rather than generating each span with a special token <extra_id_i>, for
text autoregressively. In the vanilla Transformer, i = 1, 2, · · · , k, and the target y is constructed
the encoder part is allowed to attend to a token’s left by concatenating all spans prepended with corre-
as well as right context for this purpose. BERT (De- sponding sentinels: [<extra_id_1>, span1 , · · · ,
vlin et al., 2019) takes one step further and trained <extra_id_k>, spank ]. The model is then trained
only a Transformer encoder. A set M of randomly with a standard seq2seq objective, by maximizing
chosen tokens in the input are replaced by a special
ny
token [MASK] to obtain a noisy input x̂, for example Y
pθ (y|x̂) = pθ (yi |x̂, y1:i−1 ). (3)
[[CLS], x1 , [MASK], x3 , [MASK], x5 , [EOS]]3 , and
i=1
the model is trained to recover the original tokens
by maximizing Lester et al. (2021) show that models pretrained
Y with such objectives can be adapted for autore-
pθ (m|x̂). (2) gressive language modeling with extra pretraining
m∈M using the prefix language modeling objective, i.e.
spliting the text into two parts, processing the first
While this objective requires the model to have a part with encoder and generating the second part
deep understanding of the input text to reconstruct with decoder.
it, it suffers from low training efficiency, since only Tay et al. (2023) argue that span corruption is
a small set of tokens (usually 15%) are masked (and also closely related to CLM, since one can mask out
thus “trained on”). To address this issue, Clark et al. the whole input text as a single span and train the
(2020) proposed ELECTRA, which is trained to decoder to generate it autoregressively. Inspired by
discriminate whether or not each token in the input this relation, they propose UL2, which is the com-
has been replaced by a BERT-like model instead. bination of many span corruption objectives that
3
Both [CLS] and [EOS] are artificial tokens added to the differ in corruption rate and span length. Apply-
input text. [CLS] is added at the beginning and its representa- ing it to both encoder-decoder models and decoder-
tion is used for sentence classification, while [EOS] indicates only models, they find that encoder-decoder models
end of sentence. The original BERT also uses another special
token [SEP], which is no longer in common use, and we refer performs better under the same computation bud-
to Devlin et al. (2019) for details. get constraint. Other researches have also found
that such encoder-decoder models generally per- and such pre-norm has since become a standard
form better than causal decoder-only models (Wang practice in Transformer decoders.
et al., 2022; Soltan et al., 2022). GPT-J (Wang and Komatsuzaki, 2021) modifies
the Transformer block to compute FFN sub-layer
2.4 Auxiliary Objectives
and self-attention sub-layer in parallel to increase
Language modeling objectives, such as previously computation throughput:
discussed CLM and MLM, mainly train the model
to capture token-level information and are ineffec-
tive at modeling document structures. Thus, auxil-
y = x + FFN(LN(x)) + Attention(LN(x)), (8)
iary objectives are often added to help the models
learn such global information. BERT is pretrained
with next sentence prediction (NSP) along with and Chowdhery et al. (2022) observes limited per-
MLM, which is formulated as a binary classifica- formance degradation when applying this design to
tion task to predict whether two segments in the larger models.
input are neighboring in the original corpus. Lan PaLM (Chowdhery et al., 2022) introduces Ro-
et al. (2020) propose a more challenging sentence- tary Position Embedding (RoPE) and Multi-Query
order prediction (SOP) task, where the negative Attention (MQA) into LLMs. RoPE (Su et al.,
samples are constructed by swapping the order of 2021) multiplies the keys and queries of each self-
two neighboring sentences instead of sampling a attention layer by a position-dependent rotation
random sentence from other documents. matrix to inject position information, and is later
Relatedly, Raffel et al. (2020) mix supervised shown to enable position interpolation for process-
downstream samples such as GLUE (Wang et al., ing of longer sequences (Chen et al., 2023; Roz-
2018) into T5’s pretraining dataset to conduct multi- ière et al., 2023). Alternative to RoPE, Press et al.
task pretraining. However, it is worth noting that (2022) propose ALiBi, which directly attenuates
since they unify all tasks into a text-to-text format, the attention scores according to the relative posi-
the training objective is the same for their self- tion between key and query. This position embed-
supervised pretraining and supervised downstream ding scheme is later adopted by BLOOM (Scao
tasks. et al., 2022).
Apart from position embeddings, another issue
2.5 Implementation Design in Transformer that has long troubled researchers is
While most researches on pretraining language the fact that the complexity of self-attention scales
models have focused on designing training objec- quadratically with the input sequence length. Some
tives, low-level implementation of the Transformer works such as Reformer (Kitaev et al., 2020), Lin-
architecture itself is also being continuously im- former (Wang et al., 2020), Performer (Choroman-
proved over the years in pursuit of stability, perfor- ski et al., 2021) and cosFormer (Qin et al., 2022)
mance, and efficiency. use approximate attention to reduce this complex-
The original Transformer block proposed by ity, but they mostly come at the cost of degraded
Vaswani et al. (2017) is formulated as performance. Other works tackle this issue from an
engineering point-of-view. MQA (Shazeer, 2019)
h = LN(Attention(x) + x), (4) shares the same set of keys and values across all
y = LN(FFN(h) + h), (5) attention heads to optimize memory-to-arithmetic
ratio and significantly improves inference speed
where x is the layer’s input, y is the layer’s output, at small costs of model performance. Its variant
“Attention” is the self-attention sublayer, “FFN” is Grouped-Query Attention (GQA, Ainslie et al.,
the feed-forward sublayer, and “LN” is layer nor- 2023) takes a middle-ground approach by divid-
malization (Ba et al., 2016). ing attention heads into groups and sharing the
GPT-2 (Radford et al., 2019) moves layer nor- same set of keys/values within each group. Orthog-
malization to the input of each Transformer sub- onally, Dao et al. (2022) introduce FlashAttention,
block to stabilize training: which is an exact but improved implementation of
self-attention that optimizes IO operations on the
h = Attention(LN(x)) + x, (6)
accelerating device via tiling to improve memory
y = FFN(LN(h)) + h, (7) efficiency.
80
CodeFuse
WizardCoder 34B 34B
70
GPT-4
60 PanGu-Coder 2 16B
WizardCoder 16B
HumanEval pass@1
CONCODE (Iyer et al., 2018); APPS (Hendrycks et al., 2021); HumanEval (Chen et al., 2021); MBPP (Austin et al., 2021);
Code
BIG-Bemch (Srivastava et al., 2023); AixBench (Hao et al., 2022); MultiPL-E (Cassano et al., 2023); DS-1000 (Lai et al., 2023);
Synthesis
CoderEval (Yu et al., 2023); HumanEval-X (Zheng et al., 2023); HumanEval+ (Liu et al., 2023); CodeApex (Fu et al., 2023)
Text-to-Code
Zhong et al. (2017); Suhr et al. (2018); Yu et al. (2018); Dong and Lapata (2018); Finegan-Dollak et al. (2018); Yu et al. (2018); Yu
et al. (2018); Hwang et al. (2019); Guo et al. (2019); Bogin et al. (2019); Yu et al. (2019); Yu et al. (2019); Zhang et al. (2019); Wang
Text-to-SQL et al. (2020); Kelkar et al. (2020); Choi et al. (2021); Rubin and Berant (2021); Shaw et al. (2021); Cao et al. (2021); Scholak et al.
(2021); Yang et al. (2021); Xie et al. (2022); Trummer (2022); Roy et al. (2022); Arcadinho et al. (2022); Gao et al. (2022); Chen et al.
(2023); Chang and Fosler-Lussier (2023) Nan et al. (2023)
Math Pro-
Austin et al. (2021); Drori et al. (2022); Chowdhery et al. (2022); Gao et al. (2023); Chen et al. (2022); Wang et al. (2023)
gramming
Code Kim et al. (2018); Luan et al. (2019); Bui et al. (2021); Bui et al. (2021); Guo et al. (2022); Li et al. (2022)
Search
Bruch et al. (2009); Hindle et al. (2012); Allamanis et al. (2014); Raychev et al. (2014); Tu et al. (2014); White et al. (2015); Raychev
Code et al. (2016); Bielik et al. (2016); Raychev et al. (2016); Hellendoorn and Devanbu (2017); Li et al. (2018); Parvez et al. (2018); Alon
Completion et al. (2020); Svyatkovskiy et al. (2019); Karampatsis et al. (2020); Svyatkovskiy et al. (2020); Liu et al. (2020); Lu et al. (2021); Niu
et al. (2022); Guo et al. (2022); Wang et al. (2023); Guo et al. (2023); Rozière et al. (2023)
Nguyen et al. (2013); Karaivanov et al. (2014); Nguyen et al. (2015); Nguyen et al. (2016); Chen et al. (2018); Drissi et al. (2018);
Code Rozière et al. (2020); Guo et al. (2021); Lu et al. (2021); Lachaux et al. (2021); Ahmad et al. (2021); Wang et al. (2021); Wang et al.
Code Evaluation Translation (2021); Rozière et al. (2022); Niu et al. (2022); Tipirneni et al. (2022); Chakraborty et al. (2022); Szafraniec et al. (2023); Zheng et al.
(2023); Chen et al. (2023); Pan et al. (2023)
Long and Rinard (2016); Bhatia and Singh (2016); Gupta et al. (2017); Bhatia et al. (2018); Tufano et al. (2019); Chakraborty et al.
(2022); Chen et al. (2021); Tufano et al. (2019); Vasic et al. (2019); Hellendoorn et al. (2020); Kanade et al. (2020); Yasunaga and Liang
Code (2020); Lutellier et al. (2020); Li et al. (2020); Guo et al. (2021); Mastropaolo et al. (2021); Lu et al. (2021); Jiang et al. (2021); Drain
Repair et al. (2021); Yasunaga and Liang (2021); Zhu et al. (2021); Berabi et al. (2021); Chakraborty and Ray (2021); Wang et al. (2021); Niu
et al. (2022); Fan et al. (2023); Chakraborty et al. (2022); Xia and Zhang (2022); Joshi et al. (2023); Xia et al. (2023); Bui et al. (2022);
Paul et al. (2023); Cao et al. (2023); Muennighoff et al. (2023)
Cloze Test Feng et al. (2020); Lu et al. (2021); Puri et al. (2021)
Code
Fried et al. (2023); Bavarian et al. (2022); Allal et al. (2023); Li et al. (2023); Rozière et al. (2023)
Infilling
Code-to-Code Obfuscation Collberg and Thomborson (2002); Murad et al. (2010); Bichsel et al. (2016); Vasilescu et al. (2017); Tran et al. (2019); Lacomis et al.
/ Deobfusca- (2019); Ding et al. (2022); Lachaux et al. (2021); Liu et al. (2022)
tion
Code-to-Text
Code-to-Patterns Pacheco and Ernst (2007); Fraser and Zeller (2012); McMinn (2011); Fraser and Arcuri (2011); Shamshiri (2015); Almasi et al. (2017);
Unit Test Panichella et al. (2018); Selakovic et al. (2018); Tufano et al. (2020); Arteca et al. (2022); Shimmi and Rahimi (2022); Bareiß et al.
Text-to-Text Generation (2022); Chen et al. (2023); Lahiri et al. (2022); Schäfer et al. (2023); Liu et al. (2023); Lemieux et al. (2023)
Assertion Fraser and Zeller (2012); Fraser and Arcuri (2011); Shamshiri (2015); Almasi et al. (2017); Watson et al. (2020); Tufano et al. (2022);
Generation Mastropaolo et al. (2021); Dinella et al. (2021); Blasi et al. (2021); Bareiß et al. (2022); Yuan et al. (2023)
Schuler and Zeller (2009); Fraser and Zeller (2012); Fraser and Arcuri (2011); Just (2014); Coles et al. (2016); Allamanis et al. (2016);
Mutant Brown et al. (2017); Tufano et al. (2019); Tufano et al. (2020); Khanfir et al. (2023); Mastropaolo et al. (2021); Ojdanic et al. (2021);
Generation Degiovanni and Papadakis (2022); Bareiß et al. (2022); Bartocci et al. (2023); Khanfir et al. (2023); Ojdanic et al. (2023)
Holler et al. (2012); Cha et al. (2015); Böhme et al. (2019); Böhme et al. (2017); She et al. (2019); Odena et al. (2019); Winterer et al.
Fuzzing (2020); She et al. (2020); Guo et al. (2020); Wang et al. (2020); Xie et al. (2022); Wei et al. (2022); Molina et al. (2022); Gu et al.
(2022); Wu et al. (2022); Deng et al. (2022); Liu et al. (2023); Deng et al. (2023); Yang et al. (2023); Liu et al. (2023)
Raychev et al. (2015); Pradel et al. (2015); Xu et al. (2016); Alon et al. (2018); Hassan et al. (2018); Hellendoorn et al. (2018); Jangda
and Anand (2019); Malik et al. (2019); Boone et al. (2019); Pradel et al. (2020); Kanade et al. (2020); Pandi et al. (2020); Allamanis
Type et al. (2020); Wei et al. (2020); Liu et al. (2020); Mir et al. (2022); Peng et al. (2022); Cui et al. (2021); Wang et al. (2021); Jesse et al.
Prediction (2021); Wang et al. (2021); Fried et al. (2023); Wang et al. (2022); Yee and Guha (2023); Wei et al. (2023); Li et al. (2023); Cassano
et al. (2023); Peng et al. (2023); Peng et al. (2023)
Figure 3: Evaluation tasks for code processing, to be continued in Figure 4. Black: non-neural methods. Red:
non-Transformer neural methods (such as LSTM). Orange: Transformer encoder based methods (such as BERT).
Violet: Transformer based seq2seq methods (such as T5). Blue: Transformer decoder based methods (such as GPT).
Gray: Other Transformer-based methods (such as UniLM). For code synthesis we only list several representative
benchmarks here, and refer to §4 and §5 for more details. We note that here “method” differs from “target”. For
example, Pearce et al. (2022) examine the code generated by GitHub Copilot for vulnerabilities, but the method
they use is non-neural.
by directly generating the source code in the au- and wide application in data management. We refer
toregressive language modeling style, even without to Kumar et al. (2022) and Deng et al. (2022) for
task-specific finetuning (Chen et al., 2021). We surveys on this topic.
discuss this task in more detail in §3.3. - Math programming is also a special case of
- Text-to-SQL is a special (and arguably easier) code synthesis, where a language model is required
case of code synthesis, where the model is tasked to solve mathematical reasoning problems via gen-
to generate SQL commands from natural language erating code that will be executed by external inter-
queries. It has been a topic of special interest due preters. This task abstracts the reasoning process
to SQL’s structured nature (when compared with from numerical calculations, and is thus of special
general-purpose languages such as Python and C) interest in evaluating LLMs.
3.1.2 Code-to-Code requires a decoder to generate autoregressively.
Code-to-code tasks take code as input, and output - Obfuscation refers to the process of renaming
code. identifiers (e.g. variables, methods, and classes),
- Code search is a task similar to code retrieval, for example to generic names like var_1, var_2
and differs from the later only in that the input or x, y. It is an important technique in virus de-
is an existing code snippet, often in a different tection, intellectual property protection, and code
programming language from the target. size reduction (Collberg and Thomborson, 2002;
- Code completion aims to complete a piece Murad et al., 2010; Vasilescu et al., 2017). De-
of code given its prefix. This is essentially lan- obfuscation refers to the reverse process, where
guage modeling applied to code, and related tech- meaningful identifier names are recovered from
nologies have been progressively introduced: n- obfuscated programs. Obfuscation has seen few
gram, RNN, and Transformer. However, due to application of language models as it can be easily
the structured nature of programming languages, achieved statically, but deobfuscation has been a
many early works found grammar-aided statistical subject of more interest in recent years, and has
models to perform better (Bielik et al., 2016; Hel- been adopted as a pretraining objective for code
lendoorn and Devanbu, 2017), and neural models language models (Lachaux et al., 2021; Ding et al.,
only became dominant after 2018 (see Figure 3 for 2022).
an intuitive overview.) - Unit test generation aims to generate unit tests
- Code translation aims to translate a piece of for a given program. Prior to the rise of Codex and
code (usually a function or method) into another other code LLMs, almost all works in this area em-
programming language. The relation between code ployed non-neural methods (see Figure 3). In the
translation and cross-lingual code search is similar age of LLMs, however, this task is ever more im-
to the one between code synthesis and text-to-code portant, as researches have shown that the current
retrieval, and SMT/MNT models have also been unit tests for evaluating LLMs’ program synthesis
widely applied to this task. Unlike code synthe- capability may be insufficient (Liu et al., 2023).
sis, which is useful in aiding programmers to write - Assertion generation is a task closely related to
snippets of code, code translation is an important unit testing. Given a program and a set of unit tests,
technique in migrating old projects written in ob- this task aims to generate assertions (also known
solete languages. However, we are yet to witness as oracles in software engineering) to evaluate the
such applications, as the context window of even program using the unit tests. This task has gener-
the most powerful language models are quite lim- ally went unnoticed by the NLP community, as the
ited in the face of such projects. program synthesis task used for evaluating LLMs
- Code repair, also known as bug fix, aims to fix often concern standalone, competition-style meth-
a piece of buggy code. Like code translation, it is a ods, for which the simple assertion of the equality
traditional sequence-to-sequence generation task, between program output and expected answer suf-
and surveys are abundant on this topic (Gazzola fices.
et al., 2018; Monperrus, 2018; Zhong et al., 2022; - Mutant generation aims to generate mutants of
Zhang et al., 2023; Huang et al., 2023). a given program for the purpose of mutation test-
- Cloze test is a recently proposed task for code ing, and relates closely to unit test generation and
processing, after the rise of BERT-style pretraining. assertion generation. A mutant that is not detected
Due to the unique semantics of programming lan- by a given set of unit tests and assertions indicates
guages, several keywords are often selected for this that either additional test cases or better assertions
test, such as min and max (Feng et al., 2020). are required (Fraser and Arcuri, 2011). Recently,
- Code infilling is another recently proposed task, masking out tokens in the source code and sam-
after fill-in-the-middle pretraining (Bavarian et al., pling them from the output of a masked language
2022) became popular. It is a generalization of model has become a common method for this task.
code completion, where not only the left context, Papadakis et al. (2019) provides a survey on this
but also the right context is given. However, it topic.
differs from cloze test in that the target of cloze test - Fuzzing aims to mutate a given set of unit tests
is only one token, while the target of code infilling to generate new test cases, and is another task re-
can be an entire line or even multiple lines, which lated to testing software. While many recent works
Wong et al. (2015); Iyer et al. (2016); Barone and Sennrich (2017); Hu et al. (2018); Hu et al. (2018); Alon et al. (2019); Fernandes
et al. (2019); Wan et al. (2018); Wei et al. (2019); Feng et al. (2020); Wang et al. (2020); Haque et al. (2020); Ahmad et al. (2020);
Code Sum- Bui et al. (2021); Wu et al. (2021); Xie et al. (2021); Clement et al. (2020); Mastropaolo et al. (2021); Lu et al. (2021); Lachaux et al.
marization (2021); Ahmad et al. (2021); Jiang et al. (2021); Wang et al. (2021); Chen et al. (2021); Wang et al. (2021); Niu et al. (2022); Guo et al.
(2022); Fried et al. (2023); Wang et al. (2023); Li et al. (2023); Sun et al. (2023); Yuan et al. (2023); Muennighoff et al. (2023); Su and
McMillan (2023)
Text-to-Code
Code-to-Code Balachandran (2013); Sadowski et al. (2015); Barnett et al. (2015); Lal and Pahwa (2017); Gupta and Sundaresan (2018); Li et al.
Code (2019); Shi et al. (2019); Siow et al. (2020); Tufano et al. (2021); Hellendoorn et al. (2021); Tufano et al. (2022); Li et al. (2022); Li
Review et al. (2022); Hong et al. (2022); Yin et al. (2023); Lu et al. (2023)
Code-to-Text
Raychev et al. (2015); Allamanis et al. (2015); Allamanis et al. (2016); Allamanis et al. (2018); Alon et al. (2018); Alon et al. (2019);
Identifier Alon et al. (2019); Fernandes et al. (2019); Nguyen et al. (2020); Bui et al. (2021); Li et al. (2021); Xie et al. (2021); Jiang et al. (2021);
Prediction Wang et al. (2021); Niu et al. (2022); Liu et al. (2022); Wang et al. (2022); Yang et al. (2022); Fried et al. (2023); Wang et al. (2023)
Pseudo Code Generation (Oda et al., 2015), API Mining (Hu et al., 2018), Logic-to-Text (Chen et al., 2020; Shu et al., 2021; Xie et al.,
Others 2022)
Ayewah et al. (2007); Ray et al. (2016); Wang et al. (2016); Wang et al. (2016); Li et al. (2018); Pradel and Sen (2018); Russell et al.
Defect (2018); Bian et al. (2018); Zhou et al. (2019); Li et al. (2019); Kanade et al. (2020); Chakraborty et al. (2022); Lu et al. (2021); Ahmad
Detection et al. (2021); Wang et al. (2021); Pearce et al. (2022); Wang et al. (2021); Ding et al. (2022); Wang et al. (2022); Li et al. (2023); Chan
et al. (2023); Yuan et al. (2023)
Code Evaluation
Jiang et al. (2007); Svajlenko et al. (2014); Sajnani et al. (2016); White et al. (2016); Wei and Li (2017); Wang et al. (2018); Saini
et al. (2018); Zhang et al. (2019); Yu et al. (2019); Wu et al. (2020); Wang et al. (2020); Fang et al. (2020); Guo et al. (2021); Bui et al.
Clone (2021); Lu et al. (2021); Lachaux et al. (2021); Ahmad et al. (2021); Puri et al. (2021); Wang et al. (2021); Nakagawa et al. (2021);
Detection Wang et al. (2021); Ding et al. (2022); Wang et al. (2022); Wang et al. (2022); Guo et al. (2022); Yahya and Kim (2022); Li et al.
(2022); Dou et al. (2023); Yuan et al. (2023); Li et al. (2023)
Code-to-Patterns
Code Mou et al. (2016); Zhang et al. (2019); Bui et al. (2021); Puri et al. (2021); Wang et al. (2022)
Classification
Code
Hendrycks et al. (2021); Huang et al. (2023); Li et al. (2023); Fu et al. (2023)
Reasoning
Others Author Identification (Frantzeskou et al., 2011; Mahbub et al., 2022), Markdown Ordering (Li et al., 2022)
Document
Lu et al. (2021)
Translation
Text-to-Text
Zhu et al. (2019); Dai et al. (2022); Le and Zhang (2021); Liu et al. (2022); Wang et al. (2022); Tao et al. (2022); Le and Zhang (2023);
Log Parsing
Le and Zhang (2023); Jiang et al. (2023)
Figure 4: Evaluation tasks for code processing, continued from Figure 3. Black: non-neural methods. Red:
non-Transformer neural methods. Orange: Transformer encoder based methods. Violet: Transformer based seq2seq
methods. Blue: Transformer decoder based methods. Gray: Other Transformer-based methods. We note that here
“method” differs from “target”. For example, Pearce et al. (2022) examine the code generated by GitHub Copilot for
vulnerabilities, but the method they use is non-neural.
on fuzzing target deep learning libraries, few have technologies to recommend comments from a pool
utilized langauge models to conduct this process of existing reviews. As generative models be-
(see Figure 3). came more capable, however, researchers have also
- Type prediction aims to predict the type of dy- studied directly generating review comments as a
namic programming languages such as Python and sequence-to-sequence learning task.
JavaScript. It has been used as a pretraining objec- - Identifier prediction is the task of predicting
tive for code language models (Wang et al., 2022), identifier names in the code. As these names are
where it is often simplified as a binary tagging deemed to contain important semantic information,
task to predict which tokens in the code are identi- this task has been utilized for code summariza-
fiers (Wang et al., 2021; Wang et al., 2021). tion (Allamanis et al., 2016), as well as pretraining
code models (Wang et al., 2021; Niu et al., 2022).
3.1.3 Code-to-Text
Code-to-text tasks take code as input, and output 3.1.4 Code-to-Patterns
text. Code-to-patterns tasks conduct classification on
- Code summarization, also referred to as doc- code.
string generation, aims to generate a natural lan- - Defect detection predicts whether the input
guage description for a given piece of code (often code is buggy or not, and is a standard single-
a function or method). This is the opposite of code sentence classification task.
synthesis, and SMT/NMT techniques have been - Clone detection predicts whether or not two
likewise applied. Zhang et al. (2022) provides a pieces of code are clones of each other. In software
survey on this topic. engineering there exist four types of code clones,
- Code review aims to automate the process of and the most challenging type to identify is seman-
peer code review, and may come in many forms. tic clones, i.e. syntactically dissimilar code that
Many early works formulated it as a binary classi- have the same functionality. As this task can be
fication task to accept or reject changes at commit viewed as a two-sentence classification task, BERT-
time, while others utilized information retrieval style language models have been widely applied to
it. that for code infilling only one span in the input is
- Code classification, popularized by Mou et al. masked. Similarly, cloze test is an understanding
(2016), aims to predict the functionality of a piece task in the same form as (2).
of code within a predefined set of labels. A very Defect detection, clone detection, code classifica-
similar task is author identification, which predicts tion, and type prediction are sequence classification
the author of the input code. Both tasks are stan- tasks. In these tasks, a set of labels Y is defined
dard single-sentence classification tasks. over the input, and each instance is assigned a la-
- Code reasoning is a recently introduced task bel y ∈ Y (e.g. for defect detection Y = {0, 1},
for evaluating LLMs, and often comes as a sub- while for type prediction a possible Y is {int, float,
set of general evaluation benchmarks such as string, bool, others}). The model is then tasked to
MMLU (Hendrycks et al., 2021). This task requires maximize
the model to reason about the code or algorithms, pθ (y|x). (9)
and answer related questions which are written in
multiple-choice form and may range from concep- The last two tasks - code retrieval and code
tual understanding to numerical calculation and search - also belong to understanding tasks. In
complexity analysis. these tasks, each source sequence x is paired with
a positive target sequence y and a set of negative
3.1.5 Text-to-Text targets ȳ ∈ {y1 , · · · , yk }. The model’s task is to
Text-to-text tasks take text as input, and output text. find a similarity metric s such that s(x, y) is larger
- Document translation is the automatic trans- than s(x, ȳ).
lation of code-related documents. Since models,
3.2 Evaluation Metrics
datasets, and prompting strategies for machine
translation is abundant in NLP (Vaswani et al., Of the tasks mentioned in §3.1, the understand-
2017; Goyal et al., 2022; He et al., 2023), we do ing tasks are similar in form to natural language
not go into detail about this task. understanding tasks (Wang et al., 2018; Wang
- Log parsing aims to analyze the system logs et al., 2019) and evaluated likewise by metrics
produced by software products, for example pars- such as accuracy, F1 and Mean Reciprocal Rank
ing logs into structured templates or finding anoma- (MRR), while short generation tasks such as iden-
lies from raw logs. Zhu et al. (2019) provides a tifier prediction is also evaluated by accuracy of
survey on traditional methods for this task up to exact matches. Code-to-text tasks are evaluated
2018, while Zhang et al. (2023) also cover more with common metrics for text generation such as
recent methods. BLEU (Papineni et al., 2002),
Evaluation of tasks involving code generation,
3.1.6 NLP Point-of-View on the other hand, is more complicated. Most early
Among the previously listed tasks, code synthesis, works evaluated syntactical correctness, i.e. the
code translation, code repair, deobfuscation, unit percentage of generations that can be successfully
test generation, assertion generation, mutant gen- parsed. Chen et al. (2018) argued against such met-
eration, fuzzing, code summarization, code review, rics and suggested reference match instead, which
and identifier prediction are sequence-to-sequence is the percentage of generations that are exactly the
generation tasks. Formally, each instance of these same as the references. Ren et al. (2020) proposed
tasks has a source sequence x (e.g. a piece of CodeBLUE, a variant of BLEU that takes code syn-
source code) and a target sequence y (e.g. its corre- tax and semantics into account by evaluating the
sponding summarization), and the language model overlap of abstract syntax tree (AST) and data flow.
is tasked to maximize the conditional probability As code generation models became more capa-
given by (3), where θ can be either a decoder-only ble over the years, however, these metrics based
model or an encoder-decoder model. In the former on content-overlap have been found to be inad-
case, x and y are concatenated. In the later case, x equate (Rozière et al., 2020; Hendrycks et al.,
is processed by the encoder and y is processed by 2021; Austin et al., 2021), since functionally equiv-
the decoder. alent snippets of code can differ dramatically in
Code completion and code infilling are also gen- their lexical forms. Consequently, researchers have
eration tasks, and correspond exactly to the two turned their attention to functional correctness. One
pretraining objectives given in (1) and (3), except popular example of such metrics is pass@k, pro-
posed by Kulal et al. (2019) and refined by Chen and temporal editing, and Jimenez et al. (2023) in-
et al. (2021), which is an unbiased estimator of the troduce a corresponding benchmark, SWE-bench.
model’s chance in passing all unit tests of a pro-
gram with any of k generated samples. This metric 4 General Language Models for Code
can be generalized to passn@k (Li et al., 2022),
which limits the number of model submissions to n Since language models scaled to hundreds of bil-
but allows filtering by unit tests given in the input lions of parameters (Brown et al., 2020; Chowdhery
from k samples. et al., 2022), many of them have demonstrated non-
trivial coding capability, even if they are not specif-
3.3 Program Synthesis ically designed or trained for code. Pioneered by
Codex, researchers have also found continual pre-
As code models advanced over the years, re- training on code to significantly benefit language
searchers have gradually turned their attention to models’ performance on code4 .
the practical task of program synthesis. CON-
CODE (Iyer et al., 2018) is one of the early 4.1 Off-the-Shelf Language Models
datasets in this area, which includes more than
100K Java methods and is incorporated as a sub- Large language models are often pretrained on tril-
net of CodeXGLUE benchmark (Lu et al., 2021). lions of tokens following the scaling laws (Kaplan
Since 2021, the community has witnessed an abun- et al., 2020; Hoffmann et al., 2022), and such
dance of datasets for this task. Most of them, an amount of text data is often a diverse com-
including APPS (Hendrycks et al., 2021), Hu- posite with a non-negligible part of code. The
manEval (Chen et al., 2021), and MBPP (Austin Pile (Gao et al., 2021), for example, includes
et al., 2021), focuse on python, but recent works 95GB of code crawled from GitHub out of its
have also extended HumanEval into other pro- 800GB raw dataset, while the multilingual pre-
gramming languages (Cassano et al., 2023; Zheng training dataset ROOTS (Laurençon et al., 2022)
et al., 2023; Muennighoff et al., 2023). DS-1000 also contains 163GB of code spanning 13 pro-
is a more realistic Python dataset that focuses on gramming languages in its 1.6TB compound. As
data science libraries such as NumPy and SciPy, two of the largest open-source pretraining datasets,
while several math reasoning benchmarks have also they have supported many language models with
been converted to programming tasks, including coding ability. GPT-J (Wang and Komatsuzaki,
MathQA-Python (Amini et al., 2019; Austin et al., 2021), for example, is reported by Chen et al.
2021) and GSM8K-Python (Cobbe et al., 2021; (2021) to demonstrate non-trivial performance on
Chowdhery et al., 2022; Wang et al., 2023). HumanEval, while Scao et al. (2022) report simi-
lar results for GPT-NeoX (Black et al., 2022) and
3.4 Repository-Level Evaluation BLOOM. LLaMA (Touvron et al., 2023), whose
pretraining dataset includes 328GB code from
Most evaluation tasks discussed in §3.1 and Fig- GitHub, achieves 23.7 pass@1 performance on Hu-
ure 3 are limited to a single file or even a single manEval, and its successor LLaMA 2 (Touvron
function, as cross-file code modeling poses chal- et al., 2023), achieves an even higher score of 29.9.
lenges that are beyond the capability of most exist- Closed-source models, on the other hand, per-
ing language models. Recently, however, position form generally better. LaMDA (Thoppilan et al.,
interpolation techniques (Chen et al., 2023; Rozière 2022) and PaLM (Chowdhery et al., 2022), whose
et al., 2023; Peng et al., 2023) have extended the pretraining dataset contains 12.5% and 5% code
context window of LLMs to hundreds of thousands respectively, achieve 14.0 and 26.2 pass@1 per-
of tokens, making it possible to contextualize the formance on HumanEval, while GPT-4 (OpenAI,
evaluation of code modeling within entire reposito- 2023) set a staggering record of 67.0 (and an early
ries. Several works (Shrivastava et al., 2023; Ding version is reported by Bubeck et al. (2023) to be
et al., 2022; Zhang et al., 2023; Shrivastava et al., 82) that until recently has remained higher than
2023) have studied code completion leveraging
repository-level context, and Liu et al. (2023) pro- 4
While some works refer to this process as “finetuning on
pose RepoBench to evaluate such systems. More code", it is still self-supervised in nature. Thus we choose to
adopt the term “extra/additional/continual pretraining" in this
recently, Bairi et al. (2023) investigate the more work to avoid confusion with supervised in-task finetuning or
challenging tasks of repository-level API migration instruction finetuning.
HumanEval (0) MBPP (3) models are presented in Table 1.
k=1 k=100 k=1 k=80
4.2 Language Models with Additional
GPT-Ja 11.6 27.7
LaMDAbc 14.0 47.3 14.8 62.4 Pretraining on Code
PaLMb 26.2 76.2 36.8 75.0 Along with the seminal benchmark HumanEval,
GPT-NeoXd 15.4 41.2 Chen et al. (2021) kick-started the age of LLM
BLOOMd 15.5 55.5
for code with Codex, which are GPT-3 check-
LLaMAe 23.7 79.3 37.7 76.8
points pretrained on 100B additional code tokens
GPT-4 67.0f /82g
LLaMA 2h 29.9 89.0 45.0 81.5 and one of the earliest multi-billion models for
Phi-1.5i 41.4 43.5 code. Following their work, other researchers have
Baichuan 2j 17.1 30.2 also specialized their LLMs on code with addi-
Qwenk 32.3 40.8 tional pretraining. Chowdhery et al. (2022) train
Codexa 28.8 72.3 PaLM on 7.8B additional code tokens to obtain
PaLM-Coderb 36.0 88.4 47.0 80.8 PaLM-Coder, setting new state-of-the-art on Hu-
PaLM 2-S*l 37.6 88.4 50.0 86.8 manEval and MBPP (Table 1) that are only bro-
Code LLaMAm 53.7 94.7 56.2 ken later by its successor PaLM 2-S*, the small-
CodeFusen 74.4 61.0 est version of PaLM 2 (Anil et al., 2023) further
trained on an undisclosed amount of code. Sim-
Table 1: Pass@k performance of raw language models ilarly, Lewkowycz et al. (2022) train PaLM on
(top) and language models with extra training on code
38.5B tokens of arXiv papers and mathematical
(bottom) on HumanEval (0-shot) and MBPP (3-shot),
ordered chronologically. For Phi-1.5 we consider Phi- content, while Rozière et al. (2023) train LLaMA
1.5-web version, and for Code LLaMA we consider its 2 (Touvron et al., 2023) on more than 500B code to-
Python version. a Chen et al. (2021); b Chowdhery kens to acquire Code LLaMA, whose performance
et al. (2022); c Austin et al. (2021); d Scao et al. (2022); on HumanEval surpasses all previous LMs except
e
Touvron et al. (2023); f OpenAI (2023); g Bubeck GPT-4 (Table 1). Liu et al. (2023) further train
et al. (2023); h Touvron et al. (2023); i Li et al. (2023); Code LLaMA with multi-task finetuning (MFT) to
j
Yang et al. (2023); k Bai et al. (2023); l Anil et al. introduce CodeFuse-CodeLLaMA, achieving 74.4
(2023); m Rozière et al. (2023); n Liu et al. (2023).
pass@1 on HumanEval and surpassing even the per-
formance of GPT-4 published in OpenAI (2023).
While almost all of these models are Trans-
any specialized models pretrained or instruction-
former decoders pretrained with CLM, several
finetuned for code.
architectural modifications have been introduced
More recently, the general trend has been to along the way, as we noted in §2.5. All these mod-
train smaller models with larger datasets, fol- els use pre-norm, and GPT-J introduces parallel
lowing the revised scaling law (Hoffmann et al., attention, which is later adopted by PaLM, GPT-
2022). Baichuan 2 (Yang et al., 2023), for exam- NeoX, and Phi-1.5. PaLM introduces MQA and
ple, is a 13B model trained on 2.6T tokens, while RoPE into LLMs, and RoPE is now employed by
Qwen (Bai et al., 2023) is a 14B model trained on most language models, including GPT-NeoX, two
3T tokens. They achieve 17.1 and 32.3 pass@1 on generations of LLaMA, Qwen, and the 7B version
HumanEval, respectively. Li et al. (2023), how- of Baichuan 2. BLOOM and the 13B version of
ever, demonstrate that models as small as 1.3B Baichuan 2, however, use ALiBi for position em-
can acquire coding capability that’s comparable beddings, while LLaMA 2 and Code LLaMA adopt
to much larger models while also maintaining a GQA instead of MHA or MQA. In §5, we show that
reasonable performance on general text process- specialized models pretrained exclusively on code
ing and even manifesting some emergent abili- have also followed these advancements closely.
ties (Wei et al., 2022) such as chain-of-though rea-
soning (Wei et al., 2022). Their model, Phi-1.5, is 5 Specialized Language Models for Code
trained on 21B tokens of textbook data generated
by ChatGPT, and 100B tokens of filtered web data As pretrained Transformers such as GPT and
from Stack Overflow and Refined Web (Penedo BERT achieved remarkable success in natural lan-
et al., 2023), and attains 41.4 pass@1 performance guage processing, such model architectures, learn-
on HumanEval. The exact performance of these ing paradigms, and training objectives were soon
adopted by the software engineering community to Dataset Size (GB) Files (M) # PL
produce specialized models for code understand-
CodeSearchNeta 20 6.5 6
ing and generation. In this section, we first re-
The Pilebc 95 19 -
view common datasets used for pretraining code CodeParrotd 1K 115 30
language models (§5.1), and then dive into the The Stacke 3136 317 30
complex family of code LMs by their model ar- ROOTSf 163 15 13
chitecture: encoder-only models (§5.2), encoder-
decoder models (§5.3), decoder-only models (§5.4), Table 2: Statistics of several pretraining datasets for
UniLM (§5.5), and diffusion models (§5.6). Lastly, code models: size in bytes, number of files, and
in §5.7 we also illustrate the current trend of ap- number of programming languages. In CodeSearch-
plying more recent techniques in NLP, such as Net each file is a function. For Pile and ROOTS
we only consider their code composite. a Husain
instruction tuning (Wei et al., 2022; Sanh et al.,
et al. (2019); b Gao et al. (2021); c Biderman
2022; Chung et al., 2022) and reinforcement learn- et al. (2022); d https://huggingface.co/datasets/
ing (Ouyang et al., 2022) to code processing. An codeparrot/github-code; e Kocetkov et al. (2022);
overview of these pretrained models are provided f
Laurençon et al. (2022).
in Table 3.
5.1 Training Dataset for Code et al. (2023) utilize this feature and construct a 2GB
dataset CommitPackFT containing 742K samples
While text data for pretraining language models of instruction data for code, obviating the need of
are often crawled from the web and must un- extensive human labor that’s required to construct
dergo meticulous and often aggressive preprocess- natural language instructions (Sanh et al., 2022;
ing (Raffel et al., 2020), code data come naturally Wang et al., 2022).
as whole documents from public GitHub reposi- Apart from bimodal training and instruction fine-
tories. Even better, they come with readily avail- tuning, another recent trend in constructing code
able quality indicators such as the count of stars dataset is synthesizing data with powerful models
or forks (although Allal et al. (2023) suggest that such as ChatGPT. While this method is originally
star count correlates poorly with downstream per- proposed for generating instruction data in natural
formance). As a result, many large-scale code pre- language (Wang et al., 2023; Honovich et al., 2023),
training datasets have been introduced, including Gunasekar et al. (2023) take one step further and
CodeSearchNet (Husain et al., 2019), CodePar- synthesize 1B tokens of Python textbooks and cod-
rot (Tunstall et al., 2022), and the Stack (Kocetkov ing exercises to pretrain a 1.3B model, achieving
et al., 2022), totaling 20GB, 50GB and 3TB of state-of-the-art results on HumanEval that’s compa-
code documents respectively (Table 2). rable to much larger models trained on significantly
While these datasets are meant for training code larger datasets.
models, it should be noted that code is ultimately a
special form of natural language, as the vocabulary 5.2 Encoders
of most programming languages is a small sub- Pretrained Transformer encoders such as
set of English. Besides, high-quality code is often BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
interleaved with natural language comments or doc- 2019), and ELECTRA (Clark et al., 2020) have
umentations, which also enables models to acquire attained impressive results on natural language
certain knowledge of general text representation. understanding tasks, and these methods were
In fact, of the 6.5M functions in CodeSearchNet, soon introduced into code processing after their
2.3M are paired with natural language documen- advent. Kanade et al. (2020) replicate the training
tation, allowing models to train explicitly on such procedure of BERT on a code corpus to produce
bimodal data. CuBERT, showcasing its superior performance
Compared with natural language, another over LSTM (Hochreiter and Schmidhuber, 1997)
byproduct of scraping code from GitHub is com- and non-pretrained Transformers. Feng et al.
mit histories, which consist of code before com- (2020), on the other hand, train CodeBERT with
mit, code after commit, and a short message de- MLM and ELECTRA’s RTD on CodeSearchNet.
scribing the commit, which can loosely serve as They also utilize the explicit text-code pairs in
an instruction for language models. Muennighoff CodeSearchNet, and use them respectively as the
Atten. Parallel Pre- Flash
Date Model Arch. Size Vocab Context PE Init. from Objectives Dataset Training PL Inst.
Type Atten. Norm Atten.
2019-12 CuBERT BERT 350M 50K 1024 absolute MHA - MLM + NSP 9.3B 93B 1 Google
2020-02 CodeBERT RoBERTa 125M 50K 512 absolute MHA RoBERTa MLM + RTD 20GB 105B 6 Microsoft
GraphCode-
2020-09 RoBERTa 125M 50K 640 absolute MHA CodeBERT MLM + Edge Predic-
tion + Node Alignment
20GB 131B 6 Microsoft
BERT
MLM + IP + AST Edge
2021-08 SynCoBERT RoBERTa 125M 50K 512 absolute MHA CodeBERT Prediction + CL
20GB 7B 6 Huawei
MLM + Node Type Columbia
2021-10 DISCO BERT 100M 20K 512 absolute MHA - MLM + CL
1.8GB 2
& IBM
GraphCode- MLM + Type Inference
2022-05 Code-MVP RoBERTa 125M 50K 512 absolute MHA + CL
2GB 39B 1 Huawei
BERT
2020-05 GPT-C GPT-2 374M 60K 1024 absolute MHA ! - CLM 11B 270B 4 Microsoft
2021-02 CodeGPT GPT-2 124M 50K 1024 absolute MHA ! GPT-2 CLM 2GB 1 Microsoft
2022-02 PolyCoder GPT-2 160M-2.7B 50K 2048 absolute MHA ! - CLM 254GB 39B 12 CMU
CodeGen- 350M- 1.6TB(1.8TB)/
2022-03 GPT-3 50K 2048 RoPE MHA ! ! - CLM 1T(1.2T) 6(1) Salesforce
Multi(Mono) 16.1B 506B(577B)
2022-04 InCoder GPT-3 6.7B 50K 2048 Cosine MHA ! - Causal Masking 204GB 52B 28 Meta
2022-06 PyCodeGPT GPT-Neo 110M 32K 1024 absolute MHA ! - CLM 96GB 100B 1 Microsoft
PanGu-
2022-07 PanGu-α 317M-2.6B 42K 1024 absolute MHA ! - CLM 147GB 230B 1 Huawei
Coder
2023-01 SantaCoder GPT-2 1.1B 49K 2048 absolute MQA ! - FIM 268GB 236B 3 BigCode
2023-03 CodeGeeX PanGu-α 13B 52K 2048 absolute MHA ! - CLM 158B 850B 23 Tsinghua
2023-05 StarCoder GPT-2 15.5B 49K 8192 absolute MQA ! ! - FIM 815GB 1T 86 BigCode
2023-06 Phi-1 GPT-J 1.3B 51K 2048 RoPE MHA ! ! ! - CLM 7B 53B 1 Microsoft
2023-10 CodeFuse GPT-J 350M-13B 101K 4096 RoPE MHA ! ! ! - CLM 1.6TB / 1T 40+ Ant Group
2023-10 CodeShell GPT-2 7B 70K 8192 RoPE GQA ! - CLM 500B Peking U.
2020-10 PyMT5 GPT-2 374M 50K 1024+1024 absolute MHA ! - SC 27GB 1 Microsoft
Universita
Mastropaolo della
2021-02 T5 60M 32k 512+512 T5 MHA ! - SC 1GB 1 Svizzera
et al.
italiana
2021-02 DOBF 250M 50K 512+512 absolute MHA - MLM + Deobfuscation 45GB 2 Meta
UCLA &
2021-03 PLBART BART 140M 50K 1024+1024 absolute MHA - DAE 655GB / 71B 210B 2 Columbia
SC + IP + Masked IP +
2021-09 CodeT5 T5 60M-220M 32K 512+256 T5 MHA ! - Text2Code + Code2Text
∼25GB 8 Salesforce
NSP + SC + Method
2022-01 SPT-Code BART 262M 80K 512+512 absolute MHA - Name Prediction
20GB 6 Nanjing U.
2022-02 AlphaCode 300M-41B 8K 1536+768 MQA - MLM + CLM 715GB 967B 13 DeepMind
Columbia &
2022-06 NatGen T5 220M 32K 512+256 T5 MHA ! CodeT5 Naturalization ∼26GB 14B 8 UC Davis
T5/GPT- CodeGen- SC + CLM + CL +
2023-05 CodeT5+ 220M-16B 50K 2048+2048 absolute MHA ! ! Text2Code + Code2Text
52B 9 Salesforce
3 mono
2020-12 CugLM BERT 51M 50K 128 absolute MHA - MLM + NSP + CLM 8M 1.2B 2 Peking U.
MLM + CLM + SC +
2022-03 UniXcoder RoBERTa 125M 51K 1024 absolute MHA - CL + Code2Text
20GB+ 839B 6 Microsoft
Table 3: An overview of pretrained code language models’ architecture and training details: their base architecture,
model size, vocabulary, context length, position embedding, attention type (Multi-Head Attention (Vaswani
et al., 2017), Multi-Query Attention (Shazeer, 2019), or Grouped-Query Attention (Ainslie et al., 2023)), layer
normalization type (post-norm or pre-norm), usage of FlashAttention (Dao et al., 2022), training initialization,
objectives, dataset size (either in disk size, measured by GB/TB, or in number of tokens, measured by B/T),
tokens seen during training, supported number of programming languages, and institute. We note that the number
of training tokens does not count the training tokens of the model used for initialization, if any. The common
training objectives are: MLM (Masked Language Modeling), NSP (Next Sentence Prediction), RTD (Replaced
Token Detection), IP (Identifier Prediction), CL (Contrastive Learning), SC (Span Corruption), DAE (Denoising
Auto-Encoding). Missing information (such as AlphaCode’s position embedding type) is left as blank.
first and second segment in BERT’s input. When CoBERT and Code-MVP (Wang et al., 2022) also
using CodeBERT to initialize the encoder part of add type inference to their pretraining stage in the
a vanilla Transformer for sequence-to-sequence form of tagging. Another common objective is
generation tasks such as code summarization, contrastive learning: SynCoBERT and Code-MVP
they observe a moderate performance gain over contrast between different views of the input (such
non-pretrained baselines. as code, comment, AST, and transformed code),
while DISCO (Ding et al., 2022) constructs posi-
Apart from these standard training objectives, tive sample pairs by semantic-preserving transfor-
many auxiliary objectives specifically designed mations such as obfuscation, and negative pairs by
for code have also been introduced. GraphCode- injecting artificial bugs.
BERT (Guo et al., 2021) and SynCoBERT (Wang
et al., 2021) both extract graphs from the source
5.3 Encoder-Decoders
code (data flow graph and abstract syntax tree, re-
spectively) and train the models to predict the ty- In NLP, pretrained Transformer encoder-decoders
pological relations between the nodes, while Syn- such as T5 (Raffel et al., 2020) and BART (Lewis
et al., 2020) have also left a notable mark in the ilar to deobfuscation: semantically equivalent but
past few years’ advancement in language modeling. unnatural code is generated by predefined opera-
T5, for example, unifies all textual tasks into a se- tions such as loop transformation, dead code injec-
quence to sequence format and sets new records on tion, and variable renaming, and the model is pre-
GLUE (Wang et al., 2018) and SuperGLUE (Wang trained to translate these unnatural code back to its
et al., 2019). Compared with encoder-only models, original form. We note that some of these models
encoder-decoders are naturally more powerful as are built on previous works. For example, NatGen
they can be used for conditional text generation, is initialized with CodeT5, while the largest ver-
while their encoder part can always be taken out to sion of CodeT5+ is initialized from a decoder-only
perform tasks that require an encoder-only archi- model, CodeGen (Nijkamp et al., 2023).
tecture, such as regression (Tay et al., 2023). Apart from these general pretraining objec-
Inspired by these advantages of encoder-decoder tives, several works have also trained Transformer
architecture, many such models have been pro- encoder-decoders with a focus on code translation,
posed for code processing. PyMT5 (Clement et al., which is a natural application of Transformer mod-
2020) and Mastropaolo et al. (2021) replicate the els in code as the Transformer architecture was
pretraining and multi-task finetuning process of T5 originally proposed by Vaswani et al. (2017) for
on code corpus, while Ahmad et al. (2021) intro- machine translation (MT). However, unlike natu-
duce PLBART, a BART pretrained on 655GB com- ral languages, where parallel corpus across two or
bined data of Java, Pyhton, and natural language. more human languages exist in abundance, there
Lachaux et al. (2021) argue that MLM could be too is little parallel data for code. To tackle this issue,
easy a task for programming languages as identi- Rozière et al. (2020) propose Transcoder, which
fier names often occur multiple times in a single first pretrains an encoder with XLM (Conneau and
context window, and propose a deobfuscation pre- Lample, 2019), and then initializes a vanilla Trans-
training objective, where the model is trained to former with this encoder and continue to pretrain it
convert obfuscated code back to its original form. with Denoising Auto-Encoding (DAE, Lewis et al.,
Related to this method, we note that meaningful 2020) and back translation (Sennrich et al., 2016),
variable names have also been found to have a posi- while its follow-up work (Szafraniec et al., 2023)
tive impact on the code generation process of large also utilize language-independent intermediate rep-
language models (Chen et al., 2022). resentations to enhance this process, which we dis-
cuss in more detail in §6.
Building on these early works, Wang et al.
(2021) propose CodeT5, which is pretrained al- Apart from training data and objectives, these
ternatively with 1) T5’s original span corruption; models mostly keep to the original architectures
2) identifier tagging (where each token in the code proposed by the NLP community, as shown in
input is tagged as either identifier or non-identifier); Table 3. Models based on BART, for example,
3) masked identifier prediction (a special form of use post-normalization and learnable absolute po-
span corruption where all identifiers are masked); sition embeddings, while those based on T5 use
and 4) text-to-code & code-to-text generation. Its its simplified relative position embeddings and pre-
successor, CodeT5+ (Wang et al., 2023), take inspi- normalization.
ration from UL2 (Tay et al., 2023) and introduce
causal language modeling (CLM) into pretraining, 5.4 Decoders
along with additional contrastive objectives based After the monumental debut of GPT-3 (Brown
on text-code matching. et al., 2020) and the discovery of in-context
AlphaCode (Li et al., 2022) is also trained learning, decoder-only Transformer models have
with multiple objectives, where the encoder is become dominant in language modeling (Rae
trained with MLM and the decoder is trained et al., Hoffmann et al., Chowdhery et al., Scao
with CLM, with architecture modifications such et al., Touvron et al., Touvron et al., 2021, 2022,
as shallow-encoder & deep-decoder, multi-query 2022, 2022, 2023, 2023, inter alia). Many mod-
attention (Shazeer, 2019), and being much larger els similarly pretrained with CLM have also
than CodeT5 (up to 41B parameters). Nat- emerged in code processing, such as GPT-C (Svy-
Gen (Chakraborty et al., 2022), on the other hand, atkovskiy et al., 2020), CodeGPT (Lu et al.,
is pretrained with a "naturalization" objective sim- 2021), PolyCoder (Xu et al., 2022), CodeGen (Ni-
jkamp et al., 2023), PyCodeGPT (Zan et al., Model Size HumanEval MBPP
2022), Pangu-Coder (Christopoulou et al., 2022),
PolyCoder 2.7B 5.6 -
CodeGeeX (Zheng et al., 2023), Phi-1 (Gu- CodeGen-Mono 16.1B 29.3 35.3
nasekar et al., 2023), CodeFuse (Di et al., 2023), InCoder 6.7B 15.2 19.4
CodeShell5 , and DeepSeek Coder6 . Of these mod- PyCodeGPT 110M 8.3 -
els, several alternative training objectives have been Pangu-Coder 2.6B 23.8 23.0
experimented with, such as MLM and Masked SantaCoder 1.1B 14.0 35.0
CLM7 in Pangu-Coder, but are found to underper- CodeGeeX 13B 22.9 24.4
form compared with CLM-only training. Zan et al. StarCoder 15.5B 33.6 52.7
(2022) also propose continual training on sketches, CodeT5+ 16B 30.9 -
where the model learns to first generate a sketch Phi-1 1.3B 50.6 55.5
of a program and then the actual code. Notably, CodeFuse 13B 24.8 -
Gunasekar et al. (2023) present Phi-1, a 1.3B small InstructCodeT5+ 16B 35.0 -
model trained on a dataset of only 7B tokens con- WizardCoder 15.5B 57.3 51.8
sisting of 6B tokens from StackOverflow and 1B Pangu-Coder 2 15.5B 61.6 -
synthetic data generated by ChatGPT but achieving OctoCoder 15.5B 46.2 -
50.6 pass@1 on HumanEval and 55.5 pass@1 on CodeFuse-SFT 13B 37.1 -
MBPP, comparable to much larger (both in model GPT-4 - 67.0/82 -
size and training data size) models such as Code PaLM 2-S* - 37.6 50.0
LLaMA or PaLM 2. Code LLaMA 34B 53.7 56.2
Although Christopoulou et al. (2022) report de- Phi-1.5 1.3B 41.4 43.5
noising objectives to underperform in decoder-only
Table 4: Pass@1 performance of pretrained code models
models, there have been other works that success-
(top), instruction finetuned code models (middle), in
fully combine denoising or multi-task pretraining comparison with some of the best general language
with decoder architecture. Incoder (Fried et al., models (bottom), with models in each category ordered
2023), SantaCoder (Allal et al., 2023), and Star- chronologically. The sources of these figures can be
Coder (Li et al., 2023) are all trained with fill- found in §5.3, §5.4, and Table 1.
in-the-middle (FIM) objective, also referred to as
causal masking by Fried et al. (2023), which is
essentially span corruption (Raffel et al., 2020) Notably, the three most recent models - StarCoder,
adopted to decoder-only architecture. One of the Phi-1, and CodeFuse - also employ FlashAttention
visible advantages of these infilling objectives is to improve model throughput.
that they inject the models with the ability to fill in 5.5 UniLMs
blanks in the middle of input code at inference time,
while CLM allows only for autoregressive genera- Following UniLM (Dong et al., 2019) in NLP, sev-
tion. As Table 4 shows, however, these objectives eral works in code processing have also pretrained
also lead to higher performance on downstream this fourth family of Transformer models on code.
tasks when compared with CLM-only models such CugLM (Liu et al., 2020) is trained with both CLM
as CodeGen. and MLM + NSP via alternating attention masks,
Observing Table 3, it is clear that decoder-only while UniXcoder is trained with CLM, MLM, Span
models for code have generally followed the prac- Corruption (in Prefix LM style) along with auxil-
tices in NLP more closely, when compared with iary objectives including contrastive learning and
other model architectures. All these models use text-code mutual generation. Both two models,
pre-normalization, while MQA, RoPE, and parallel however, are relatively small in size, and whether
attention have also been adopted by several models. or not this architecture is suitable for code process-
ing is yet to be explored.
5
https://github.com/WisdomShell/codeshell
6
https://github.com/deepseek-ai/ 5.6 Diffusion Models
DeepSeek-Coder
7
In their paper, MLM is conducted by replacing tokens Currently the Transformer architecture dominate
in the input with <mask> and predicting it from only the left text generation, but several works (Li et al., 2022;
context, while Masked CLM is performed by adding a <mask>
in the input and predict the the next token from it. Both tasks Lin et al., 2023) have also adopted Diffusion Mod-
do not change the attention mask patterns of the model. els (Ho et al., 2020) from computer vision for text
generation. Recently CodeFusion (Singh et al., human feedback required in aligning LLMs often
2023) also introduces diffusion models into code involves extensive labor on annotation. In com-
modeling, and demonstrates that a 75M diffusion parison, applying reinforcement learning to code
model can outperform StarCoder, CodeT5+, and models has a natural advantage, as compilers can
GPT-3 on 3 code synthesis datasets. be used for automatically generating feedback for
code samples produced by language models.
5.7 Instruction Finetuning and Reinforcement CodeRL (Le et al., 2022) is one such model,
Learning for Code which defines four levels of rewards for each
generated program (viz. compile error, runtime
In natural language processing, training models on
error, unit test failure, pass) as well as fine-
a diverse set of tasks with instruction prefix, known
grained token-level reward estimated by a critic
as instruction finetuning, has been shown to unlock
model. The actor model, which is an extention
the ability of cross-task generalization (Ouyang
of CodeT5, is then trained with REINFORCE
et al., 2022; Chung et al., 2022; Iyer et al., 2022).
algorithm (Williams, 1992). Similarly, Comp-
At first, these instruction data samples are manu-
Coder (Wang et al., 2022) and PPOCoder (Shojaee
ally compiled or crowd-sourced (Wei et al., 2022;
et al., 2023) train CodeGPT and CodeT5 respec-
Sanh et al., 2022), but later researches find LLM-
tively with proximal policy optimization (Schul-
generated instructions to be sufficient (Wang et al.,
man et al., 2017), while RLTF (Liu et al., 2023)
2023; Honovich et al., 2023).
proposes fine-grained feedback based on the error
Following these works in natural language, re- information and location provided by the compiler,
searchers from the code community have applied as well as adaptive feedback that takes the ratio of
instruction tuning to their models as well. Wang passed test cases into account.
et al. (2023) finetune CodeT5+ with 20K in-
struction data generated by InstructGPT (Ouyang 6 Code Features for Language Models
et al., 2022) to obtain InstructCodeT5+. Wizard-
Coder (Luo et al., 2023) follows the methods of A major difference between programming lan-
WizardLM (Xu et al., 2023) to evolve 20K code Al- guages and natural languages is that the former is
paca (Taori et al., 2023) samples into a 78K dataset artificially defined to be precise and unambiguous,
and uses it to finetune StarCoder. Pangu-Coder and need to be compiled (or interpreted) without er-
2 (Shen et al., 2023) also uses WizardLM’s Evol- ror before execution. This allows for a much larger
Instruct to generate 68K instruction samples from flexibility in designing pretraining objectives on
20K code Alpaca, but also introduces reinforce- code, beside lexical manipulations such as CLM,
ment learning via Rank Responses to align Test MLM, and Span Corruption. A similar trend can
& Teacher Feedback (RRTF). OctoCoder (Muen- be observed in the last years before neural net-
nighoff et al., 2023), on the other hand, takes a dif- works were introduced into mainstream NLP lit-
ferent path and uses Git commit histories as instruc- erature (Sutskever et al., 2014; Bahdanau et al.,
tion data to finetune StarCoder and CodeGeeX2. 2015), when researchers in the MT community
More recently, CodeFuse (Di et al., 2023) also utilized alternative views of text such as syntactic
employs multitask-finetuning and explicitly intro- features to improve the performance of SMT sys-
duces multiple downstream tasks into their instruc- tems (Galley et al., 2006; Chiang, 2007). These
tion data. The performance of these instruction features, however, are not universally applicable or
finetuned code models can also be found in Ta- even agreed upon, and often result in highly com-
ble 4. plicated systems (for example, the size of English
In NLP, another technology closely related to in- part-of-speech tagging’s label set may range from
struction finetuning is reinforcement learning from dozens to hundreds).
human feedback (RLHF), which has played a sig- Programming languages, however, fare much
nificant role in aligning LLMs with human val- better in these aspects. Each mainstream program-
ues (Ouyang et al., 2022; Bai et al., 2022). The ming language, such as C, Python, and Java, comes
merit of reinforcement learning is that it can incor- with readily available compiler toolkits that allow
porate non-differentiable reward signals into train- for easy and accurate extraction of semantic in-
ing, such as BLEU (Bahdanau et al., 2017) and formation such as Abstract Syntax Tree (AST),
human preference (Christiano et al., 2017), but the language-independent Intermediate Representation
(IR), and auxiliary information such as type of each generation, a special form of span corruption where
token and control/data flow graph (CFG/DFG). a method name is masked. Different from other
Thus, in the context of Transformer-based language works, however, they do not take the docstrings as
modeling for code, many works have incorporated the text segment in their input, but instead concate-
these features into their training procedure. nate all method names appearing in the code as a
succinct natural language description. Likewise,
6.1 Abstract Syntax Tree and Intermediate UniXcoder (Guo et al., 2022) takes flattened AST
Representation instead of source code as its input during training.
In the compiling pipeline, AST is usually fol-
AST is one of the most common intermediate re-
lowed by language-independent intermediate rep-
sults of the compiling process, where a program is
resentations, such as LLVM IR (Lattner and Adve,
parsed into a tree of operations and their operands.
2004). Such features’ independence from spe-
Before the popularization of Transformer in the
cific programming languages makes them suitable
code processing community, there had been works
candidates for translation pivots, as is English in
such as InferCode (Bui et al., 2021) that processes
machine translation of low-resource natural lan-
these representations with special network archi-
guages (Leng et al., 2019). Szafraniec et al. (2023)
tectures like Tree-Based CNN and conducts self-
take advantage of this characteristic and extend
supervised pretraining by predicting subtrees.
Transcoder (Rozière et al., 2020) with translation
TreeBERT (Jiang et al., 2021) is one of the first language modeling (Conneau and Lample, 2019)
attempts to take AST into the Transformer-based over code and IR, as well as IR generation from
pretraining-finetuning framework. It’s a Trans- code. They also investigate other objectives such
former encoder-decoder pretrained with Tree MLM as IR decompilation (i.e. generating code from IR)
and Node Order Prediction, where the encoder and IR pivot (i.e. directly generating code in one
takes a set of constituent paths in the AST as in- language from the IR of another language), both
put (with each token being a path, which is the showing promising results.
concatenation of its nodes’ representations) while
the decoder takes the code as input. Tree MLM is
6.2 Control Flow and Data Flow
then performed by masking certain nodes in a path
representation and its corresponding code tokens While AST and IR have proved to be useful infor-
in the decoder input, while Node Order Prediction mation in certain tasks such as code translation,
is accomplished by swapping nodes in a path and they are static by nature, just like the source code,
predicting it with a [CLS] token similar to BERT. and may fail to capture semantic properties of code
The method used by TreeBERT, however, is that are only revealed at runtime (Wang and Su,
complicated and does not scale well. Later works 2020). Such semantics, however, are contained
mostly opt to first process AST into a text sequence in dynamic features such as control flow and data
and treat it like a normal part of the input. Wang flow. Similar to AST, specialized networks were
et al. (2021), for example, process AST with depth- used to process such information before the rise
first traversal and concatenate it with code and com- of pretrained Transformers, such as Message Pass-
ment, and then train SynCoBERT (which, unlike ing Neural Network used by ProGraML (Cummins
TreeBERT, is actually a BERT-like encoder-only et al., 2021). Unlike AST, however, even after pre-
model) with four objectives: 1) MLM; 2) identi- trained Transformers became dominant few works
fier tagging; 3) AST edge prediction (predicting have looked in this direction.
whether there exists an edge between two AST GraphCodeBERT (Guo et al., 2021) is one of
nodes from the dot product of these nodes’ repre- such works, which creates special tokens and po-
sentations); and 4) contrastive learning over i) code sition embeddings for variables in the flow graph,
and AST pairs, as well as ii) text and code-AST and concatenates the variable sequence after text
pairs. Similarly, SPT-Code (Niu et al., 2022), a and source code to construct model input, with
Transformer encoder-decoder, takes the concatena- tailored attention masks on the code and variable
tion of code, sequentialized AST, and text as input, segments: tokens from code segment and variable
and is pretrained with 1) span corruption; 2) code- segment can attend to each other if and only if the
AST prediction (NSP with one segment being code variable is identified from the code token, and for
and one segment being AST); and 3) method name tokens within the variable segment, vi is allowed
to attend to vj if there is a direct edge from vj to vi while ViperGPT (Surís et al., 2023) extends it fur-
in the dataflow. The model is then pretrained with ther by calling vision APIs to extract information
MLM in combination with edge prediction and from visual input and answer related questions.
node alignment, both of which are accomplished Apart from alleviating the burden of numeri-
by binary classification from the dot product of two cal calculation in abstract reasoning tasks, inter-
tokens’ representations (one from code segment preter also provides feedback on the process of
and one from variable segment for node alignment, code generation itself, together with unit tests.
and both from variable segment for edge predic- CodeT (Bareiß et al., 2022) and TiCoder (Chen
tion). et al., 2023) use Codex to generate unit tests, which
are run against generated code samples to improve
6.3 Type the model’s performance on code synthesis. Sim-
Apart from AST, IR, and data flow, type informa- ilarly, TransCoder-ST (Rozière et al., 2022) aug-
tion has also been used to aid language models in ments TransCoder and DOBF with external unit
processing code. CugLM (Liu et al., 2020), for tests for code translation. In §5.7, we have also
example, uses type information during finetuning shown that the execution results on unit tests serve
to aid in the prediction of tokens for unidirectional as natural supervision signals for reinforcement
MLM (i.e. MLM with unidirectional attention learning on code.
mask): the type of a masked token is first predicted Notably, in March 2023 OpenAI also released an
from the final Transformer layer’s representation, interpreter plugin for ChatGPT8 , which can accept
and then the token itself is predicted based on both file inputs from users, generate code according to
the hidden representation and predicted type. In user instructions, and provide feedback via real-
contrast, both CodeT5 (Wang et al., 2021) and Syn- time execution. Zhou et al. (2023) show that this
CoBERT (Wang et al., 2021) include identifier tag- feature allows GPT-4 to self-debug.
ging in their pretraining objectives, which can be A topic closely related to tool using in LLM
viewed as coarse-grained type prediction. researches is planning as intelligent agents, which
Notably, Wang et al. (2022) integrate many has been shown to enhance LLMs’ capability both
of the aforementioned features into Code-MVP: theoretically and empirically (Feng et al., 2023).
source code, docstrings, AST, CFG, and trans- Ruan et al. (2023) find that LLMs can plan to solve
formed source code via identifier renaming, loop complex tasks using external SQL generators and
exchange, and dead code insertion. The model, Python generators, while CodePlan (Bairi et al.,
initialized from GraphCodeBERT, is then trained 2023) demonstrates they can perform repository-
with MLM, fine-grained type prediction, and con- level coding via adaptive planning.
trastive learning across different views, such as text Another stream of works use LLMs to create
vs. code, code vs. AST, and code vs. CFG. multi-agent systems for code generation, such
as self-collaboration (Dong et al., 2023), Chat-
7 LLMs in Software Development Dev (Qian et al., 2023), and MetaGPT (Hong et al.,
2023). In these frameworks, multiple LLMs are
As language models set new records on software en-
prompted to play distinct roles such as programmer,
gineering benchmarks, software engineering tech-
reviewer, and manager. These roles interact with
nologies are also expanding the boundaries of lan-
each other, breakdown code generation into differ-
guage models in return, and have subsequently led
ent phases (e.g. designing, coding, testing, and
them into real-world development cycles.
documenting), and collaborate to complete com-
7.1 LLMs Extended with Coding Tools plex tasks.
Researches in the NLP community have shown 7.2 LLMs Integrated into Software
that LLMs can learn to use external tools such as Development
calculators, MT systems, and search engines (Thop-
pilan et al., 2022; Schick et al., 2023). As such, With the increase in LLMs’ interactive coding ca-
interpreter has been used to augment LLMs in pability, researchers have also started to integrate
complex reasoning tasks. PAL (Gao et al., 2023) them into each and every process of software de-
and PoT (Chen et al., 2022) both extend Codex 8
https://openai.com/blog/chatgpt-plugins#
with Python interpreters for numerical calculations, code-interpreter
velopment. With these in mind, we identify several chal-
Auto code completion is one of the earliest ap- lenges in the current development of code model-
plications of language models in software develop- ing.
ment, as they require only the ability to predict the - Comprehensive benchmarks to push code
next token. Even before language models scaled to LLMs to the next stage. The widely used Hu-
billions of parameters, there had been integration of manEval benchmark plays a key role in the evo-
completion systems such as Pythia (Svyatkovskiy lution of Code LLMs. However, it is relatively
et al., 2019) and IntelliCode (Svyatkovskiy et al., small and its scoreboard has been manipulated to
2020) into popular IDEs. near perfect, which does not exactly reflect real-
Recently, however, the application of code lan- world behaviors. Many other benchmarks for Code
guage models have transcended simple code com- LLMs have been proposed, but they are still not
pletion. GitHub Copilot is arguably one of the most comprehensive enough to reflect production-level
popular AI code assistants, with diverse features in- requirements. The community is eager for a new
cluding code generation, vulnerability detection, standard benchmark after HumanEval to further
and license management9 , while CodeFuse (Di boost the progress of Code LLMs to the next stage.
et al., 2023) also integrates code generation, code - Acquisition of high-quality data. With Gu-
translation, code commenting, and testcase genera- nasekar et al. (2023) achieving SOTA performance
tion into a single IDE extension. As code language with a 1.3B model trained on textbook data, we
models become larger, however, their client-side believe the selection of training data and utilization
deployment and real-time performance also raise of synthetic data will be ever more prominent in
new challenges. the near future, for both self-supervised pretraining
As LLMs continue to advance, building applica- and supervised finetuning.
tions on top of them is also evolving into a conse- - Integration of code features into language
quential task itself. Many open-source frameworks models. As we noted in §6.2, CFG and DFG are
for such applications have been released, including yet to be employed at scale in code language model-
LangChain10 , AutoGPT11 , and WorkGPT12 . These ing. The few works that do employ data flow make
frameworks provide abstractions over language changes to the models’ attention masks, which
models for developers, and are actively revolution- severely limits their cross-task generalization and
izing the entire process of software development scaling ability. We believe the seamless integra-
even as this survey is being finalized. tion of such features into textual input is worth
researching in the future.
8 Conclusion and Challenges - Application of LLMs in more code down-
stream tasks. As we have pointed out in §3, cur-
In this work, we systematically reviewed the history
rent evaluation of LLMs’ coding capability is fo-
of code processing with pretrained Transformer
cused on program synthesis, and Figure 3 clearly
language models, and highlighted their relations
shows that tasks related to software testing (viz.
and comparisons to models pretrained on general
unit test generation, assertion generation, mutant
domains. The advancement in code modeling gen-
generation, and fuzzing) and deobfuscation have
erally follows the history course of NLP, evolv-
seen few application of LLMs. Besides, since the
ing from SMT models, to NMT models, and then
context window of LLMs are currently quite lim-
to finetuning pretrained Transformers and lastly
ited, generation tasks such as program synthesis
to few-shot application of LLMs and even au-
and code translation are yet to be applied beyond
tonomous agents in real-world production. Unlike
method level. In §3.4, we have listed several works
natural languages, the nature of code makes it easy
on repository-level code completion and temporal
to extract auxiliary information from alternative
editing, and we believe the application of LLMs
views, and to utilize interpreter and unit tests for
in more repository-level tasks will become a hot
automatic feedback.
research top in the future.
9
https://github.com/features/copilot - Alternative model architectures and training
10
https://www.langchain.com/
11 objectives. In Table 3, we have shown that many
https://github.com/Significant-Gravitas/
AutoGPT code language models are pretrained with auxiliary
12
https://github.com/team-openpm/workgpt objectives specific to code, but these models all be-
long to the encoder-only or encoder-decoder family, July 5-10, 2020, pages 4998–5007. Association for
while decoder-only models are yet to be augmented Computational Linguistics.
with alternative objectives. Also, as pioneered by
[2] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi
Singh et al. (2023), we believe diffusion models Ray, and Kai-Wei Chang. 2021. Unified pre-training
will find its ground in code modeling in the future. for program understanding and generation. In Pro-
- Building code LLM ecosystem for full- ceedings of the 2021 Conference of the North Amer-
life-cycle of software development. While the ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL-
academia have witnessed an abundance of code HLT 2021, Online, June 6-11, 2021, pages 2655–
models, most have been deployed in the coding 2668. Association for Computational Linguistics.
stage as IDE plugins while neglecting other stages
in the life-cycle of software development. In §7.2 [3] Wasi Uddin Ahmad, Md Golam Rahman Tushar,
Saikat Chakraborty, and Kai-Wei Chang. 2023.
we mentioned several inspiring examples, and we
AVATAR: A parallel corpus for java-python program
are hoping to see more applications of code LMs translation. In Findings of the Association for Com-
throughout the full life-cycle of software develop- putational Linguistics: ACL 2023, Toronto, Canada,
ment, from requirement analysis to DevOps, even- July 9-14, 2023, pages 2268–2281. Association for
tually leading to full-scale ecosystems like those Computational Linguistics.
around PyTorch (Paszke et al., 2019) and Hugging [4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong,
Face13 . Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
- Safety and ethics issues related to code hai. 2023. GQA: training generalized multi-query
LLMs. As language models grow in might, they transformer models from multi-head checkpoints.
CoRR, abs/2305.13245.
also raise safety concerns including but not limited
to data contamination, toxic or biased generation, [5] Loubna Ben Allal, Raymond Li, Denis Kocetkov,
personal information leak, and hallucinations. In Chenghao Mou, Christopher Akiki, Carlos Muñoz
software development, these models should be de- Ferrandis, Niklas Muennighoff, Mayank Mishra,
ployed with extra caution, as their generated code Alex Gu, Manan Dey, Logesh Kumar Umapathi,
Carolyn Jane Anderson, Yangtian Zi, Joel Lamy-
may contain security risks leading to catastrophic Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry
results. Pretraining data is also becoming a sensi- Abulkhanov, Manuel Romero, Michael Lappert,
tive topic of ethics, and Kocetkov et al. (2022) take Francesco De Toni, Bernardo García del Río, Qian
a meaningful step towards this issue by allowing Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue
Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab
developers to remove their code from the Stack. Mangrulkar, David Lansky, Huu Nguyen, Danish
As synthetic training data becomes widespread, re- Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau,
searchers should also proceed with caution about Yacine Jernite, Sean Hughes, Daniel Fried, Arjun
such practice, as the consequence of training AI Guha, Harm de Vries, and Leandro von Werra. 2023.
Santacoder: don’t reach for the stars! CoRR,
models with AI generated data is yet to be investi- abs/2301.03988.
gated at scale.
With the presentation of this survey, we hope [6] Miltiadis Allamanis, Earl T. Barr, Christian Bird,
to provide a global view of language models’ ap- and Charles Sutton. 2014. Learning natural coding
conventions. In Proceedings of the 22nd ACM SIG-
plication in software engineering and connect the
SOFT International Symposium on Foundations of
researches from the two communities. We believe Software Engineering, (FSE-22), Hong Kong, China,
the current surge of LLMs will be ultimately trans- November 16 - 22, 2014, pages 281–293. ACM.
formed into real world applications, and lead hu-
manity into a brighter future. [7] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and
Charles Sutton. 2015. Suggesting accurate method
and class names. In Proceedings of the 2015 10th
Joint Meeting on Foundations of Software Engineer-
References ing, ESEC/FSE 2015, Bergamo, Italy, August 30 -
September 4, 2015, pages 38–49. ACM.
[1] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi
Ray, and Kai-Wei Chang. 2020. A transformer-based [8] Miltiadis Allamanis, Earl T. Barr, Soline Ducousso,
approach for source code summarization. In Proceed- and Zheng Gao. 2020. Typilus: neural type hints. In
ings of the 58th Annual Meeting of the Association Proceedings of the 41st ACM SIGPLAN International
for Computational Linguistics, ACL 2020, Online, Conference on Programming Language Design and
Implementation, PLDI 2020, London, UK, June 15-
13
https://huggingface.co/ 20, 2020, pages 91–105. ACM.
[9] Miltiadis Allamanis, Earl T. Barr, René Just, and [18] Uri Alon, Meital Zilberstein, Omer Levy, and Eran
Charles Sutton. 2016. Tailored mutants fit bugs better. Yahav. 2019. code2vec: learning distributed rep-
CoRR, abs/1611.02516. resentations of code. Proc. ACM Program. Lang.,
3(POPL):40:1–40:29.
[10] Miltiadis Allamanis, Marc Brockschmidt, and Mah-
moud Khademi. 2018. Learning to represent pro- [19] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik
grams with graphs. In 6th International Conference Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha-
on Learning Representations, ICLR 2018, Vancouver, jishirzi. 2019. Mathqa: Towards interpretable math
BC, Canada, April 30 - May 3, 2018, Conference word problem solving with operation-based for-
Track Proceedings. OpenReview.net. malisms. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for
[11] Miltiadis Allamanis, Hao Peng, and Charles Sutton. Computational Linguistics: Human Language Tech-
2016. A convolutional attention network for extreme nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
summarization of source code. In Proceedings of the June 2-7, 2019, Volume 1 (Long and Short Papers),
33nd International Conference on Machine Learning, pages 2357–2367. Association for Computational
ICML 2016, New York City, NY, USA, June 19-24, Linguistics.
2016, volume 48 of JMLR Workshop and Conference [20] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin
Proceedings, pages 2091–2100. JMLR.org. Johnson, Dmitry Lepikhin, Alexandre Passos, Sia-
mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng
[12] Miltiadis Allamanis and Charles Sutton. 2013. Min- Chen, Eric Chu, Jonathan H. Clark, Laurent El
ing source code repositories at massive scale using Shafey, Yanping Huang, Kathy Meier-Hellstern, Gau-
language modeling. In Proceedings of the 10th Work- rav Mishra, Erica Moreira, Mark Omernick, Kevin
ing Conference on Mining Software Repositories, Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao,
MSR ’13, San Francisco, CA, USA, May 18-19, 2013, Yuanzhong Xu, Yujing Zhang, Gustavo Hernández
pages 207–216. IEEE Computer Society. Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham,
Jan A. Botha, James Bradbury, Siddhartha Brahma,
[13] Miltiadis Allamanis and Charles Sutton. 2014. Min- Kevin Brooks, Michele Catasta, Yong Cheng, Colin
ing idioms from source code. In Proceedings of Cherry, Christopher A. Choquette-Choo, Aakanksha
the 22nd ACM SIGSOFT International Symposium Chowdhery, Clément Crepy, Shachi Dave, Mostafa
on Foundations of Software Engineering, (FSE-22), Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,
Hong Kong, China, November 16 - 22, 2014, pages Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxi-
472–483. ACM. aoyu Feng, Vlad Fienber, Markus Freitag, Xavier
Garcia, Sebastian Gehrmann, Lucas Gonzalez, and
[14] Mohammad Moein Almasi, Hadi Hemmati, Gordon et al. 2023. Palm 2 technical report. CoRR,
Fraser, Andrea Arcuri, and Janis Benefelds. 2017. abs/2305.10403.
An industrial evaluation of unit test generation: Find-
ing real faults in a financial application. In 39th [21] Samuel Arcadinho, David Aparício, Hugo Veiga,
IEEE/ACM International Conference on Software En- and António Alegria. 2022. T5QL: taming language
gineering: Software Engineering in Practice Track, models for SQL generation. CoRR, abs/2209.10254.
ICSE-SEIP 2017, Buenos Aires, Argentina, May 20-
28, 2017, pages 263–272. IEEE Computer Society. [22] Ellen Arteca, Sebastian Harner, Michael Pradel,
and Frank Tip. 2022. Nessie: Automatically test-
[15] Uri Alon, Shaked Brody, Omer Levy, and Eran Ya- ing javascript apis with asynchronous callbacks. In
hav. 2019. code2seq: Generating sequences from 44th IEEE/ACM 44th International Conference on
structured representations of code. In 7th Inter- Software Engineering, ICSE 2022, Pittsburgh, PA,
national Conference on Learning Representations, USA, May 25-27, 2022, pages 1494–1505. ACM.
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net. [23] Jacob Austin, Augustus Odena, Maxwell I. Nye,
Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le,
[16] Uri Alon, Roy Sadaka, Omer Levy, and Eran Ya-
and Charles Sutton. 2021. Program synthesis with
hav. 2020. Structural language models of code. In
large language models. CoRR, abs/2108.07732.
Proceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Vir- [24] Nathaniel Ayewah, William W. Pugh, J. David Mor-
tual Event, volume 119 of Proceedings of Machine genthaler, John Penix, and YuQian Zhou. 2007. Eval-
Learning Research, pages 245–256. PMLR. uating static analysis defect warnings on production
software. In Proceedings of the 7th ACM SIGPLAN-
[17] Uri Alon, Meital Zilberstein, Omer Levy, and Eran SIGSOFT Workshop on Program Analysis for Soft-
Yahav. 2018. A general path-based representation for ware Tools and Engineering, PASTE’07, San Diego,
predicting program properties. In Proceedings of the California, USA, June 13-14, 2007, pages 1–8. ACM.
39th ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI 2018, [25] Lei Jimmy Ba, Jamie Ryan Kiros, and Geof-
Philadelphia, PA, USA, June 18-22, 2018, pages 404– frey E. Hinton. 2016. Layer normalization. CoRR,
419. ACM. abs/1607.06450.
[26] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, [33] Mike Barnett, Christian Bird, João Brunet, and Shu-
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. vendu K. Lahiri. 2015. Helping developers help
Courville, and Yoshua Bengio. 2017. An actor-critic themselves: Automatic decomposition of code re-
algorithm for sequence prediction. In 5th Inter- view changesets. In 37th IEEE/ACM International
national Conference on Learning Representations, Conference on Software Engineering, ICSE 2015,
ICLR 2017, Toulon, France, April 24-26, 2017, Con- Florence, Italy, May 16-24, 2015, Volume 1, pages
ference Track Proceedings. OpenReview.net. 134–144. IEEE Computer Society.
[27] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua [34] Antonio Valerio Miceli Barone and Rico Sennrich.
Bengio. 2015. Neural machine translation by jointly 2017. A parallel corpus of python functions and
learning to align and translate. In 3rd International documentation strings for automated code documen-
Conference on Learning Representations, ICLR 2015, tation and code generation. In Proceedings of the
San Diego, CA, USA, May 7-9, 2015, Conference Eighth International Joint Conference on Natural
Track Proceedings. Language Processing, IJCNLP 2017, Taipei, Taiwan,
November 27 - December 1, 2017, Volume 2: Short
[28] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Papers, pages 314–319. Asian Federation of Natural
Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Language Processing.
Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei [35] Ezio Bartocci, Leonardo Mariani, Dejan Nickovic,
Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao and Drishti Yadav. 2023. Property-based mutation
Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui testing. In IEEE Conference on Software Testing, Ver-
Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, ification and Validation, ICST 2023, Dublin, Ireland,
Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, April 16-20, 2023, pages 222–233. IEEE.
Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu,
An Yang, Hao Yang, Jian Yang, Shusheng Yang, [36] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak,
Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, John Schulman, Christine McLeavey, Jerry Tworek,
Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, and Mark Chen. 2022. Efficient training of language
Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan models to fill in the middle. CoRR, abs/2207.14255.
Zhou, and Tianhang Zhu. 2023. Qwen technical re-
port. CoRR, abs/2309.16609. [37] Berkay Berabi, Jingxuan He, Veselin Raychev, and
Martin T. Vechev. 2021. Tfix: Learning to fix cod-
[29] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda ing errors with a text-to-text transformer. In Pro-
Askell, Anna Chen, Nova DasSarma, Dawn Drain, ceedings of the 38th International Conference on
Stanislav Fort, Deep Ganguli, Tom Henighan, Machine Learning, ICML 2021, 18-24 July 2021, Vir-
Nicholas Joseph, Saurav Kadavath, Jackson Kernion, tual Event, volume 139 of Proceedings of Machine
Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Learning Research, pages 780–791. PMLR.
Hatfield-Dodds, Danny Hernandez, Tristan Hume,
Scott Johnston, Shauna Kravec, Liane Lovitt, Neel [38] Guru Prasad Bhandari, Amara Naseer, and Leon
Nanda, Catherine Olsson, Dario Amodei, Tom B. Moonen. 2021. Cvefixes: automated collection of
Brown, Jack Clark, Sam McCandlish, Chris Olah, vulnerabilities and their fixes from open-source soft-
Benjamin Mann, and Jared Kaplan. 2022. Train- ware. In PROMISE ’21: 17th International Con-
ing a helpful and harmless assistant with rein- ference on Predictive Models and Data Analytics in
forcement learning from human feedback. CoRR, Software Engineering, Athens Greece, August 19-20,
abs/2204.05862. 2021, pages 30–39. ACM.
[30] Ramakrishna Bairi, Atharv Sonwane, Aditya [39] Sahil Bhatia, Pushmeet Kohli, and Rishabh Singh.
Kanade, Vageesh D. C, Arun Iyer, Suresh 2018. Neuro-symbolic program corrector for intro-
Parthasarathy, Sriram K. Rajamani, Balasubra- ductory programming assignments. In Proceedings
manyan Ashok, and Shashank Shet. 2023. Code- of the 40th International Conference on Software En-
plan: Repository-level coding using llms and plan- gineering, ICSE 2018, Gothenburg, Sweden, May 27
ning. CoRR, abs/2309.12499. - June 03, 2018, pages 60–70. ACM.
[152] Shilin He, Jieming Zhu, Pinjia He, and Michael R. [161] Geert Heyman and Tom Van Cutsem. 2020. Neu-
Lyu. 2020. Loghub: A large collection of system ral code search revisited: Enhancing code snippet
log datasets towards automated log analytics. CoRR, retrieval through natural language intent. CoRR,
abs/2008.06448. abs/2008.12193.
[153] Zhiwei He, Tian Liang, Wenxiang Jiao, Zhu- [162] Abram Hindle, Earl T. Barr, Zhendong Su, Mark
osheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Gabel, and Premkumar T. Devanbu. 2012. On the
Tu, Shuming Shi, and Xing Wang. 2023. Exploring naturalness of software. In 34th International Con-
human-like translation strategy with large language ference on Software Engineering, ICSE 2012, June
models. CoRR, abs/2305.04118. 2-9, 2012, Zurich, Switzerland, pages 837–847. IEEE
Computer Society.
[154] Vincent J. Hellendoorn, Christian Bird, Earl T.
Barr, and Miltiadis Allamanis. 2018. Deep learning [163] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020.
type inference. In Proceedings of the 2018 ACM Denoising diffusion probabilistic models. In Ad-
Joint Meeting on European Software Engineering vances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Process- transferred API knowledge. In Proceedings of the
ing Systems 2020, NeurIPS 2020, December 6-12, Twenty-Seventh International Joint Conference on
2020, virtual. Artificial Intelligence, IJCAI 2018, July 13-19, 2018,
Stockholm, Sweden, pages 2269–2275. ijcai.org.
[164] Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural Comput., [173] Yang Hu, Umair Z. Ahmed, Sergey Mechtaev,
9(8):1735–1780. Ben Leong, and Abhik Roychoudhury. 2019. Re-
factoring based program repair applied to program-
[165] Jordan Hoffmann, Sebastian Borgeaud, Arthur ming assignments. In 34th IEEE/ACM Interna-
Mensch, Elena Buchatskaya, Trevor Cai, Eliza tional Conference on Automated Software Engineer-
Rutherford, Diego de Las Casas, Lisa Anne Hen- ing, ASE 2019, San Diego, CA, USA, November 11-
dricks, Johannes Welbl, Aidan Clark, Tom Henni- 15, 2019, pages 388–398. IEEE.
gan, Eric Noland, Katie Millican, George van den
Driessche, Bogdan Damoc, Aurelia Guy, Simon Osin- [174] Junjie Huang, Duyu Tang, Linjun Shou, Ming
dero, Karen Simonyan, Erich Elsen, Jack W. Rae, Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan
Oriol Vinyals, and Laurent Sifre. 2022. Training Duan. 2021. Cosqa: 20, 000+ web queries for code
compute-optimal large language models. CoRR, search and question answering. In Proceedings of
abs/2203.15556. the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International
[166] Christian Holler, Kim Herzig, and Andreas Zeller. Joint Conference on Natural Language Processing,
2012. Fuzzing with code fragments. In Proceed- ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual
ings of the 21th USENIX Security Symposium, Belle- Event, August 1-6, 2021, pages 5690–5700. Associa-
vue, WA, USA, August 8-10, 2012, pages 445–458. tion for Computational Linguistics.
USENIX Association.
[175] Kai Huang, Zhengzi Xu, Su Yang, Hongyu Sun,
[167] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Xuejun Li, Zheng Yan, and Yuqing Zhang. 2023.
Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven A survey on automated program repair techniques.
Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, CoRR, abs/2303.18184.
Lingfeng Xiao, and Chenglin Wu. 2023. Metagpt:
Meta programming for multi-agent collaborative [176] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei
framework. CoRR, abs/2308.00352. Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,
Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu,
[168] Yang Hong, Chakkrit Tantithamthavorn, Patana- Maosong Sun, and Junxian He. 2023. C-eval: A
mon Thongtanunam, and Aldeida Aleti. 2022. Com- multi-level multi-discipline chinese evaluation suite
mentfinder: a simpler, faster, more accurate code re- for foundation models. CoRR, abs/2305.08322.
view comments recommendation. In Proceedings of
the 30th ACM Joint European Software Engineering [177] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Mil-
Conference and Symposium on the Foundations of tiadis Allamanis, and Marc Brockschmidt. 2019.
Software Engineering, ESEC/FSE 2022, Singapore, Codesearchnet challenge: Evaluating the state of se-
Singapore, November 14-18, 2022, pages 507–519. mantic code search. CoRR, abs/1909.09436.
ACM.
[178] Wonseok Hwang, Jinyeung Yim, Seunghyun Park,
[169] Or Honovich, Thomas Scialom, Omer Levy, and and Minjoon Seo. 2019. A comprehensive explo-
Timo Schick. 2023. Unnatural instructions: Tuning ration on wikisql with table-aware word contextual-
language models with (almost) no human labor. In ization. CoRR, abs/1902.01069.
Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: [179] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,
Long Papers), ACL 2023, Toronto, Canada, July 9-14, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017.
2023, pages 14409–14428. Association for Computa- Learning a neural semantic parser from user feed-
tional Linguistics. back. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics, ACL
[170] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, 2017, Vancouver, Canada, July 30 - August 4, Vol-
Kailong Wang, Li Li, Xiapu Luo, David Lo, John C. ume 1: Long Papers, pages 963–973. Association for
Grundy, and Haoyu Wang. 2023. Large language Computational Linguistics.
models for software engineering: A systematic litera-
ture review. CoRR, abs/2308.10620. [180] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,
and Luke Zettlemoyer. 2016. Summarizing source
[171] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. code using a neural attention model. In Proceedings
2018. Deep code comment generation. In Proceed- of the 54th Annual Meeting of the Association for
ings of the 26th Conference on Program Comprehen- Computational Linguistics, ACL 2016, August 7-12,
sion, ICPC 2018, Gothenburg, Sweden, May 27-28, 2016, Berlin, Germany, Volume 1: Long Papers. The
2018, pages 200–210. ACM. Association for Computer Linguistics.
[172] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, [181] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,
and Zhi Jin. 2018. Summarizing source code with and Luke Zettlemoyer. 2018. Mapping language to
code in programmatic context. In Proceedings of the [191] Mingsheng Jiao, Tingrui Yu, Xuan Li, Guanjie
2018 Conference on Empirical Methods in Natural Qiu, Xiaodong Gu, and Beijun Shen. 2023. On the
Language Processing, Brussels, Belgium, October 31 evaluation of neural code translation: Taxonomy and
- November 4, 2018, pages 1643–1652. Association benchmark. CoRR, abs/2308.08961.
for Computational Linguistics.
[192] Carlos E. Jimenez, John Yang, Alexander Wet-
[182] Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pa- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik
sunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Narasimhan. 2023. Swe-bench: Can language
Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, models resolve real-world github issues? CoRR,
Xian Li, Brian O’Horo, Gabriel Pereyra, Jeff Wang, abs/2310.06770.
Christopher Dewan, Asli Celikyilmaz, Luke Zettle-
moyer, and Ves Stoyanov. 2022. OPT-IML: scaling [193] Harshit Joshi, José Pablo Cambronero Sánchez,
language model instruction meta learning through Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan
the lens of generalization. CoRR, abs/2212.12017. Radicek. 2023. Repair is nearly generation: Multilin-
gual program repair with llms. In Thirty-Seventh
[183] Abhinav Jangda and Gaurav Anand. 2019. Predict- AAAI Conference on Artificial Intelligence, AAAI
ing variable types in dynamically typed programming 2023, Thirty-Fifth Conference on Innovative Applica-
languages. CoRR, abs/1901.05138. tions of Artificial Intelligence, IAAI 2023, Thirteenth
[184] Kevin Jesse and Premkumar T. Devanbu. 2022. Symposium on Educational Advances in Artificial In-
Manytypes4typescript: A comprehensive typescript telligence, EAAI 2023, Washington, DC, USA, Febru-
dataset for sequence-based type inference. In 19th ary 7-14, 2023, pages 5131–5140. AAAI Press.
IEEE/ACM International Conference on Mining Soft-
ware Repositories, MSR 2022, Pittsburgh, PA, USA, [194] René Just. 2014. The major mutation framework:
May 23-24, 2022, pages 294–298. ACM. efficient and scalable mutation analysis for java. In
International Symposium on Software Testing and
[185] Kevin Jesse, Premkumar T. Devanbu, and Toufique Analysis, ISSTA ’14, San Jose, CA, USA - July 21 -
Ahmed. 2021. Learning type annotation: is big data 26, 2014, pages 433–436. ACM.
enough? In ESEC/FSE ’21: 29th ACM Joint Eu-
ropean Software Engineering Conference and Sym- [195] René Just, Darioush Jalali, and Michael D. Ernst.
posium on the Foundations of Software Engineering, 2014. Defects4j: a database of existing faults to
Athens, Greece, August 23-28, 2021, pages 1483– enable controlled testing studies for java programs.
1486. ACM. In International Symposium on Software Testing and
Analysis, ISSTA ’14, San Jose, CA, USA - July 21 -
[186] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, 26, 2014, pages 437–440. ACM.
and Stéphane Glondu. 2007. DECKARD: scalable
and accurate tree-based detection of code clones. In [196] Aditya Kanade, Petros Maniatis, Gogul Balakrish-
29th International Conference on Software Engineer- nan, and Kensen Shi. 2020. Learning and evaluating
ing (ICSE 2007), Minneapolis, MN, USA, May 20-26, contextual embedding of source code. In Proceed-
2007, pages 96–105. IEEE Computer Society. ings of the 37th International Conference on Ma-
chine Learning, ICML 2020, 13-18 July 2020, Vir-
[187] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. tual Event, volume 119 of Proceedings of Machine
CURE: code-aware neural machine translation for Learning Research, pages 5110–5121. PMLR.
automatic program repair. In 43rd IEEE/ACM Inter-
national Conference on Software Engineering, ICSE [197] Jared Kaplan, Sam McCandlish, Tom Henighan,
2021, Madrid, Spain, 22-30 May 2021, pages 1161– Tom B. Brown, Benjamin Chess, Rewon Child, Scott
1173. IEEE. Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
[188] Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, 2020. Scaling laws for neural language models.
and Lei Lyu. 2021. Treebert: A tree-based pre- CoRR, abs/2001.08361.
trained model for programming language. In Pro- [198] Svetoslav Karaivanov, Veselin Raychev, and Mar-
ceedings of the Thirty-Seventh Conference on Un- tin T. Vechev. 2014. Phrase-based statistical transla-
certainty in Artificial Intelligence, UAI 2021, Virtual tion of programming languages. In Onward! 2014,
Event, 27-30 July 2021, volume 161 of Proceedings Proceedings of the 2014 ACM International Sympo-
of Machine Learning Research, pages 54–63. AUAI sium on New Ideas, New Paradigms, and Reflections
Press. on Programming & Software, part of SPLASH ’14,
[189] Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Portland, OR, USA, October 20-24, 2014, pages 173–
Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, 184. ACM.
Jiazhen Gu, and Michael R. Lyu. 2023. Llm-
parser: A llm-based log parsing framework. CoRR, [199] Rafael-Michael Karampatsis, Hlib Babii, Romain
abs/2310.01796. Robbes, Charles Sutton, and Andrea Janes. 2020.
Big code != big vocabulary: open-vocabulary models
[190] Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen for source code. In ICSE ’20: 42nd International
Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jiem- Conference on Software Engineering, Seoul, South
ing Zhu, and Michael R. Lyu. 2023. A large-scale Korea, 27 June - 19 July, 2020, pages 1073–1085.
benchmark for log parsing. CoRR, abs/2308.10828. ACM.
[200] Rafael-Michael Karampatsis and Charles Sutton. 2021, December 6-14, 2021, virtual, pages 14967–
2020. How often do single-statement bugs occur?: 14979.
The manysstubs4j dataset. In MSR ’20: 17th Interna-
tional Conference on Mining Software Repositories, [210] Jeremy Lacomis, Pengcheng Yin, Edward J.
Seoul, Republic of Korea, 29-30 June, 2020, pages Schwartz, Miltiadis Allamanis, Claire Le Goues, Gra-
573–577. ACM. ham Neubig, and Bogdan Vasilescu. 2019. DIRE: A
neural approach to decompiled identifier naming. In
[201] Amol Kelkar, Rohan Relan, Vaishali Bhardwaj, 34th IEEE/ACM International Conference on Auto-
Saurabh Vaichal, and Peter Relan. 2020. Bertrand- mated Software Engineering, ASE 2019, San Diego,
dr: Improving text-to-sql using a discriminative re- CA, USA, November 11-15, 2019, pages 628–639.
ranker. CoRR, abs/2002.00557. IEEE.
[202] Ahmed Khanfir, Renzo Degiovanni, Mike Pa- [211] Shuvendu K. Lahiri, Aaditya Naik, Georgios
padakis, and Yves Le Traon. 2023. Efficient muta- Sakkas, Piali Choudhury, Curtis von Veh, Madan-
tion testing via pre-trained language models. CoRR, lal Musuvathi, Jeevana Priya Inala, Chenglong Wang,
abs/2301.03543. and Jianfeng Gao. 2022. Interactive code genera-
tion via test-driven user-intent formalization. CoRR,
[203] Ahmed Khanfir, Anil Koyuncu, Mike Papadakis, abs/2208.05950.
Maxime Cordy, Tegawendé F. Bissyandé, Jacques
Klein, and Yves Le Traon. 2023. ibir: Bug-report- [212] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi
driven fault injection. ACM Trans. Softw. Eng. Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau
Methodol., 32(2):33:1–33:31. Yih, Daniel Fried, Sida I. Wang, and Tao Yu. 2023.
DS-1000: A natural and reliable benchmark for data
[204] Kisub Kim, Dongsun Kim, Tegawendé F. Bis- science code generation. In International Conference
syandé, Eunjong Choi, Li Li, Jacques Klein, and on Machine Learning, ICML 2023, 23-29 July 2023,
Yves Le Traon. 2018. Facoy: a code-to-code search Honolulu, Hawaii, USA, volume 202 of Proceedings
engine. In Proceedings of the 40th International of Machine Learning Research, pages 18319–18345.
Conference on Software Engineering, ICSE 2018, PMLR.
Gothenburg, Sweden, May 27 - June 03, 2018, pages
946–957. ACM. [213] Harsh Lal and Gaurav Pahwa. 2017. Code review
[205] Nikita Kitaev, Lukasz Kaiser, and Anselm Lev- analysis of software system using machine learning
skaya. 2020. Reformer: The efficient transformer. techniques. In 2017 11th International Conference
In 8th International Conference on Learning Repre- on Intelligent Systems and Control (ISCO), pages
sentations, ICLR 2020, Addis Ababa, Ethiopia, April 8–13.
26-30, 2020. OpenReview.net.
[214] Zhenzhong Lan, Mingda Chen, Sebastian Good-
[206] Denis Kocetkov, Raymond Li, Loubna Ben Al- man, Kevin Gimpel, Piyush Sharma, and Radu Sori-
lal, Jia Li, Chenghao Mou, Carlos Muñoz Ferran- cut. 2020. ALBERT: A lite BERT for self-supervised
dis, Yacine Jernite, Margaret Mitchell, Sean Hughes, learning of language representations. In 8th Inter-
Thomas Wolf, Dzmitry Bahdanau, Leandro von national Conference on Learning Representations,
Werra, and Harm de Vries. 2022. The stack: 3 ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
TB of permissively licensed source code. CoRR, 2020. OpenReview.net.
abs/2211.15533.
[215] Chris Lattner and Vikram S. Adve. 2004. LLVM:
[207] Sumith Kulal, Panupong Pasupat, Kartik Chan- A compilation framework for lifelong program anal-
dra, Mina Lee, Oded Padon, Alex Aiken, and Percy ysis & transformation. In 2nd IEEE / ACM Interna-
Liang. 2019. Spoc: Search-based pseudocode to tional Symposium on Code Generation and Optimiza-
code. In Advances in Neural Information Processing tion (CGO 2004), 20-24 March 2004, San Jose, CA,
Systems 32: Annual Conference on Neural Informa- USA, pages 75–88. IEEE Computer Society.
tion Processing Systems 2019, NeurIPS 2019, De-
cember 8-14, 2019, Vancouver, BC, Canada, pages [216] Hugo Laurençon, Lucile Saulnier, Thomas Wang,
11883–11894. Christopher Akiki, Albert Villanova del Moral,
Teven Le Scao, Leandro von Werra, Chenghao Mou,
[208] Ayush Kumar, Parth Nagarkar, Prabhav Nalhe, and Eduardo González Ponferrada, Huu Nguyen, Jörg
Sanjeev Vijayakumar. 2022. Deep learning driven Frohberg, Mario Sasko, Quentin Lhoest, Angelina
natural languages text to SQL query conversion: A McMillan-Major, Gérard Dupont, Stella Biderman,
survey. CoRR, abs/2208.04415. Anna Rogers, Loubna Ben Allal, Francesco De Toni,
Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor,
[209] Marie-Anne Lachaux, Baptiste Rozière, Marc Maraim Masoud, Pierre Colombo, Javier de la Rosa,
Szafraniec, and Guillaume Lample. 2021. DOBF: Paulo Villegas, Tristan Thrush, Shayne Longpre, Se-
A deobfuscation pre-training objective for program- bastian Nagel, Leon Weber, Manuel Muñoz, Jian
ming languages. In Advances in Neural Information Zhu, Daniel van Strien, Zaid Alyafeai, Khalid Al-
Processing Systems 34: Annual Conference on Neu- mubarak, Minh Chien Vu, Itziar Gonzalez-Dios,
ral Information Processing Systems 2021, NeurIPS Aitor Soroa, Kyle Lo, Manan Dey, Pedro Ortiz
Suarez, Aaron Gokaslan, Shamik Bose, David Ife- [225] Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
oluwa Adelani, Long Phan, Hieu Tran, Ian Yu, Suhas jan Ghazvininejad, Abdelrahman Mohamed, Omer
Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Margaret Mitchell, Alexandra Sasha Luccioni, and BART: denoising sequence-to-sequence pre-training
Yacine Jernite. 2022. The bigscience ROOTS corpus: for natural language generation, translation, and com-
A 1.6tb composite multilingual dataset. In NeurIPS. prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
[217] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, ACL 2020, Online, July 5-10, 2020, pages 7871–7880.
Silvio Savarese, and Steven Chu-Hong Hoi. 2022. Association for Computational Linguistics.
Coderl: Mastering code generation through pre-
trained models and deep reinforcement learning. In [226] Aitor Lewkowycz, Anders Andreassen, David Do-
NeurIPS. han, Ethan Dyer, Henryk Michalewski, Vinay V. Ra-
masesh, Ambrose Slone, Cem Anil, Imanol Schlag,
[218] Van-Hoang Le and Hongyu Zhang. 2021. Log- Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,
based anomaly detection without log parsing. In Guy Gur-Ari, and Vedant Misra. 2022. Solving quan-
36th IEEE/ACM International Conference on Auto- titative reasoning problems with language models. In
mated Software Engineering, ASE 2021, Melbourne, NeurIPS.
Australia, November 15-19, 2021, pages 492–504.
IEEE. [227] Fei Li and H. V. Jagadish. 2014. Constructing an
interactive natural language interface for relational
[219] Van-Hoang Le and Hongyu Zhang. 2023. An databases. Proc. VLDB Endow., 8(1):73–84.
evaluation of log parsing with chatgpt. CoRR,
abs/2306.01590. [228] Haochen Li, Chunyan Miao, Cyril Leung, Yanx-
ian Huang, Yuan Huang, Hongyu Zhang, and Yanlin
[220] Van-Hoang Le and Hongyu Zhang. 2023. Log Wang. 2022. Exploring representation-level augmen-
parsing with prompt-based few-shot learning. In 45th tation for code search. In Proceedings of the 2022
IEEE/ACM International Conference on Software Conference on Empirical Methods in Natural Lan-
Engineering, ICSE 2023, Melbourne, Australia, May guage Processing, EMNLP 2022, Abu Dhabi, United
14-20, 2023, pages 2438–2449. IEEE. Arab Emirates, December 7-11, 2022, pages 4924–
4936. Association for Computational Linguistics.
[221] Chia-Hsuan Lee, Oleksandr Polozov, and Matthew
Richardson. 2021. Kaggledbqa: Realistic evaluation [229] Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang,
of text-to-sql parsers. In Proceedings of the 59th An- Hai Zhao, Yeyun Gong, Nan Duan, and Timothy
nual Meeting of the Association for Computational Baldwin. 2023. CMMLU: measuring massive mul-
Linguistics and the 11th International Joint Confer- titask language understanding in chinese. CoRR,
ence on Natural Language Processing, ACL/IJCNLP abs/2306.09212.
2021, (Volume 1: Long Papers), Virtual Event, Au- [230] Heng-Yi Li, Shu-Ting Shi, Ferdian Thung, Xuan
gust 1-6, 2021, pages 2261–2273. Association for Huo, Bowen Xu, Ming Li, and David Lo. 2019.
Computational Linguistics. Deepreview: Automatic code review using deep
multi-instance learning. In Advances in Knowledge
[222] Caroline Lemieux, Jeevana Priya Inala, Shu-
Discovery and Data Mining - 23rd Pacific-Asia Con-
vendu K. Lahiri, and Siddhartha Sen. 2023. Co-
ference, PAKDD 2019, Macau, China, April 14-17,
damosa: Escaping coverage plateaus in test genera-
2019, Proceedings, Part II, volume 11440 of Lecture
tion with pre-trained large language models. In 45th
Notes in Computer Science, pages 318–330. Springer.
IEEE/ACM International Conference on Software En-
gineering, ICSE 2023, Melbourne, Australia, May [231] Hongyu Li, Seohyun Kim, and Satish Chandra.
14-20, 2023, pages 919–931. IEEE. 2019. Neural code search evaluation dataset. CoRR,
abs/1908.09804.
[223] Yichong Leng, Xu Tan, Tao Qin, Xiang-Yang Li,
and Tie-Yan Liu. 2019. Unsupervised pivot transla- [232] Jia Li, Chongyang Tao, Zhi Jin, Fang Liu, Jia Allen
tion for distant languages. In Proceedings of the 57th Li, and Ge Li. 2023. ZC3: zero-shot cross-language
Conference of the Association for Computational Lin- code clone detection. CoRR, abs/2308.13754.
guistics, ACL 2019, Florence, Italy, July 28- August
2, 2019, Volume 1: Long Papers, pages 175–183. [233] Jian Li, Yue Wang, Michael R. Lyu, and Irwin
Association for Computational Linguistics. King. 2018. Code completion with neural attention
and pointer networks. In Proceedings of the Twenty-
[224] Brian Lester, Rami Al-Rfou, and Noah Constant. Seventh International Joint Conference on Artificial
2021. The power of scale for parameter-efficient Intelligence, IJCAI 2018, July 13-19, 2018, Stock-
prompt tuning. In Proceedings of the 2021 Confer- holm, Sweden, pages 4159–4165. ijcai.org.
ence on Empirical Methods in Natural Language Pro-
cessing, EMNLP 2021, Virtual Event / Punta Cana, [234] Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian
Dominican Republic, 7-11 November, 2021, pages Luo, Zihan Hua, Geng Liang, and Chun Zuo. 2022.
3045–3059. Association for Computational Linguis- AUGER: automatically generating review comments
tics. with pre-training models. In Proceedings of the 30th
ACM Joint European Software Engineering Confer- Engineering, ICSE 2021, Madrid, Spain, 22-30 May
ence and Symposium on the Foundations of Software 2021, pages 574–586. IEEE.
Engineering, ESEC/FSE 2022, Singapore, Singapore,
November 14-18, 2022, pages 1009–1021. ACM. [241] Yi Li, Shaohua Wang, Tien N. Nguyen, and
Son Van Nguyen. 2019. Improving bug detection
[235] Raymond Li, Loubna Ben Allal, Yangtian Zi, via context-based code representation learning and
Niklas Muennighoff, Denis Kocetkov, Chenghao attention-based neural networks. Proc. ACM Pro-
Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny gram. Lang., 3(OOPSLA):162:1–162:30.
Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue
Zhuo, Thomas Wang, Olivier Dehaene, Mishig [242] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Al-
Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh lie Del Giorno, Suriya Gunasekar, and Yin Tat Lee.
Shliazhko, Nicolas Gontier, Nicholas Meade, Armel 2023. Textbooks are all you need II: phi-1.5 technical
Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, report. CoRR, abs/2309.05463.
Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov,
Zhiruo Wang, Rudra Murthy V, Jason Stillerman, [243] Yujia Li, David Choi, Junyoung Chung, Nate
Siva Sankalp Patel, Dmitry Abulkhanov, Marco Kushman, Julian Schrittwieser, Rémi Leblond, Tom
Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa- Eccles, James Keeling, Felix Gimeno, Agustin Dal
Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
Singh, Sasha Luccioni, Paulo Villegas, Maxim Ku- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
nakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Sen Huang, Johannes Welbl, Sven Gowal, Alexey
Nadav Timor, Jennifer Ding, Claire Schlesinger, Hai- Cherepanov, James Molloy, Daniel J. Mankowitz,
ley Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Esme Sutherland Robson, Pushmeet Kohli, Nando
Alex Gu, Jennifer Robinson, Carolyn Jane Ander- de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
son, Brendan Dolan-Gavitt, Danish Contractor, Siva 2022. Competition-level code generation with alpha-
Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jer- code. Science, 378(6624):1092–1097.
nite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas
Wolf, Arjun Guha, Leandro von Werra, and Harm [244] Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou,
de Vries. 2023. Starcoder: may the source be with Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong.
you! CoRR, abs/2305.06161. 2018. Vuldeepecker: A deep learning-based system
for vulnerability detection. In 25th Annual Network
[236] Xiang Li, John Thickstun, Ishaan Gulrajani, and Distributed System Security Symposium, NDSS
Percy Liang, and Tatsunori B. Hashimoto. 2022. 2018, San Diego, California, USA, February 18-21,
Diffusion-lm improves controllable text generation. 2018. The Internet Society.
In NeurIPS.
[245] Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan,
[237] Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Shailesh Jannu, Grant Jenks, Deep Majumder, Jared
Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel
Jiang, Weizhu Chen, and Nan Duan. 2022. Codere- Sundaresan. 2022. Automating code review activi-
triever: A large scale contrastive pre-training method ties by large-scale pre-training. In Proceedings of
for code search. In Proceedings of the 2022 Con- the 30th ACM Joint European Software Engineering
ference on Empirical Methods in Natural Language Conference and Symposium on the Foundations of
Processing, EMNLP 2022, Abu Dhabi, United Arab Software Engineering, ESEC/FSE 2022, Singapore,
Emirates, December 7-11, 2022, pages 2898–2910. Singapore, November 14-18, 2022, pages 1035–1047.
Association for Computational Linguistics. ACM.
[238] Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Ye- [246] Derrick Lin, James Koppel, Angela Chen, and
long Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen, Armando Solar-Lezama. 2017. Quixbugs: a multi-
and Nan Duan. 2022. Soft-labeled contrastive pre- lingual program repair benchmark set based on the
training for function-level code representation. In quixey challenge. In Proceedings Companion of
Findings of the Association for Computational Lin- the 2017 ACM SIGPLAN International Conference
guistics: EMNLP 2022, Abu Dhabi, United Arab on Systems, Programming, Languages, and Applica-
Emirates, December 7-11, 2022, pages 118–129. As- tions: Software for Humanity, SPLASH 2017, Van-
sociation for Computational Linguistics. couver, BC, Canada, October 23 - 27, 2017, pages
55–56. ACM.
[239] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020.
Dlfix: context-based code transformation learning [247] Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xi-
for automated program repair. In ICSE ’20: 42nd ang. 2019. Deep learning-based vulnerable function
International Conference on Software Engineering, detection: A benchmark. In Information and Com-
Seoul, South Korea, 27 June - 19 July, 2020, pages munications Security - 21st International Conference,
602–614. ACM. ICICS 2019, Beijing, China, December 15-17, 2019,
Revised Selected Papers, volume 11999 of Lecture
[240] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2021. Notes in Computer Science, pages 219–232. Springer.
A context-based automated approach for method
name consistency checking and suggestion. In 43rd [248] Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong
IEEE/ACM International Conference on Software Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu
Chen. 2023. Text generation with diffusion language [258] Xiaoyu Liu, Jinu Jang, Neel Sundaresan, Mil-
models: A pre-training approach with continuous tiadis Allamanis, and Alexey Svyatkovskiy. 2022.
paragraph denoise. In International Conference on Adaptivepaste: Code adaptation through learn-
Machine Learning, ICML 2023, 23-29 July 2023, ing semantics-aware variable usage representations.
Honolulu, Hawaii, USA, volume 202 of Proceedings CoRR, abs/2205.11023.
of Machine Learning Research, pages 21051–21064.
PMLR. [259] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
[249] Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming 2019. Roberta: A robustly optimized BERT pre-
Wu, and Shouling Ji. 2021. Deep graph matching and training approach. CoRR, abs/1907.11692.
searching for semantic code retrieval. ACM Trans.
Knowl. Discov. Data, 15(5):88:1–88:21. [260] Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang,
Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei
[250] Bingchang Liu, Chaoyu Chen, Cong Liao, Lin, Yingnong Dang, Saravan Rajmohan, and Dong-
Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, mei Zhang. 2022. Uniparser: A unified log parser
Dajun Chen, Min Shen, Hailian Zhou, Hang Yu, and for heterogeneous log data. In WWW ’22: The ACM
Jianguo Li. 2023. Mftcoder: Boosting code llms with Web Conference 2022, Virtual Event, Lyon, France,
multitask fine-tuning. CoRR, abs/2311.02303. April 25 - 29, 2022, pages 1893–1901. ACM.
[251] Fang Liu, Ge Li, Zhiyi Fu, Shuai Lu, Yiyang Hao, [261] Fan Long and Martin C. Rinard. 2016. Auto-
and Zhi Jin. 2022. Learning to recommend method matic patch generation by learning correct code. In
names with global context. In 44th IEEE/ACM 44th Proceedings of the 43rd Annual ACM SIGPLAN-
International Conference on Software Engineering, SIGACT Symposium on Principles of Programming
ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, Languages, POPL 2016, St. Petersburg, FL, USA,
pages 1294–1306. ACM. January 20 - 22, 2016, pages 298–312. ACM.
[252] Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. [262] Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun
Multi-task learning based pre-trained language model Zuo. 2023. Llama-reviewer: Advancing code re-
for code completion. In 35th IEEE/ACM Interna- view automation with large language models through
tional Conference on Automated Software Engineer- parameter-efficient fine-tuning (practical experience
ing, ASE 2020, Melbourne, Australia, September 21- report). CoRR, abs/2308.11148.
25, 2020, pages 473–485. IEEE.
[253] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, [263] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang,
Xiao Han, Wei Yang, and Deheng Ye. 2023. RLTF: Alexey Svyatkovskiy, Ambrosio Blanco, Colin B.
reinforcement learning from unit test feedback. Clement, Dawn Drain, Daxin Jiang, Duyu Tang,
CoRR, abs/2307.04349. Ge Li, Lidong Zhou, Linjun Shou, Long Zhou,
Michele Tufano, Ming Gong, Ming Zhou, Nan Duan,
[254] Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and
Jinyang Li, Aurojit Panda, and Lingming Zhang. Shujie Liu. 2021. Codexglue: A machine learning
2023. Nnsmith: Generating diverse and valid test benchmark dataset for code understanding and gen-
cases for deep learning compilers. In Proceedings eration. In Proceedings of the Neural Information
of the 28th ACM International Conference on Archi- Processing Systems Track on Datasets and Bench-
tectural Support for Programming Languages and marks 1, NeurIPS Datasets and Benchmarks 2021,
Operating Systems, Volume 2, ASPLOS 2023, Vancou- December 2021, virtual.
ver, BC, Canada, March 25-29, 2023, pages 530–543.
ACM. [264] Sifei Luan, Di Yang, Celeste Barnaby, Koushik
Sen, and Satish Chandra. 2019. Aroma: code rec-
[255] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, ommendation via structural code search. Proc. ACM
and Lingming Zhang. 2023. Is your code generated Program. Lang., 3(OOPSLA):152:1–152:28.
by chatgpt really correct? rigorous evaluation of
large language models for code generation. CoRR, [265] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun,
abs/2305.01210. Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma,
Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder:
[256] Shangqing Liu, Xiaofei Xie, Jing Kai Siow, Lei Empowering code large language models with evol-
Ma, Guozhu Meng, and Yang Liu. 2023. Graph- instruct. CoRR, abs/2306.08568.
searchnet: Enhancing gnns via capturing global de-
pendencies for semantic code search. IEEE Trans. [266] Thibaud Lutellier, Hung Viet Pham, Lawrence
Software Eng., 49(4):2839–2855. Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Co-
conut: combining context-aware neural translation
[257] Tianyang Liu, Canwen Xu, and Julian J. McAuley. models using ensemble for program repair. In ISSTA
2023. Repobench: Benchmarking repository- ’20: 29th ACM SIGSOFT International Symposium
level code auto-completion systems. CoRR, on Software Testing and Analysis, Virtual Event, USA,
abs/2306.03091. July 18-22, 2020, pages 101–114. ACM.
[267] Parvez Mahbub, Naz Zarreen Oishie, and Swayam Singh, Xiangru Tang, Leandro von Werra,
S. M. Rafizul Haque. 2022. Authorship identifica- and Shayne Longpre. 2023. Octopack: Instruc-
tion of source code segments written by multiple tion tuning code large language models. CoRR,
authors using stacking ensemble method. CoRR, abs/2308.07124.
abs/2212.05610.
[277] Khurram Murad, Syed Noor-ul-Hassan Shirazi,
[268] Rabee Sohail Malik, Jibesh Patra, and Michael Yousaf Bin Zikria, and Nassar Ikram. 2010. Evading
Pradel. 2019. Nl2type: inferring javascript func- virus detection using code obfuscation. In Future
tion types from natural language information. In Generation Information Technology - Second Inter-
Proceedings of the 41st International Conference on national Conference, FGIT 2010, Jeju Island, Korea,
Software Engineering, ICSE 2019, Montreal, QC, December 13-15, 2010. Proceedings, volume 6485 of
Canada, May 25-31, 2019, pages 304–315. IEEE / Lecture Notes in Computer Science, pages 394–401.
ACM. Springer.
[269] Antonio Mastropaolo, Simone Scalabrino, Nathan [278] Kawser Wazed Nafi, Tonny Shekha Kar, Banani
Cooper, David Nader-Palacio, Denys Poshyvanyk, Roy, Chanchal K. Roy, and Kevin A. Schneider. 2019.
Rocco Oliveto, and Gabriele Bavota. 2021. Study- CLCDSA: cross language code clone detection us-
ing the usage of text-to-text transfer transformer to ing syntactical features and API documentation. In
support code-related tasks. In 43rd IEEE/ACM Inter- 34th IEEE/ACM International Conference on Auto-
national Conference on Software Engineering, ICSE mated Software Engineering, ASE 2019, San Diego,
2021, Madrid, Spain, 22-30 May 2021, pages 336– CA, USA, November 11-15, 2019, pages 1026–1037.
347. IEEE. IEEE.
[270] Phil McMinn. 2011. Search-based software test- [279] Tasuku Nakagawa, Yoshiki Higo, and Shinji
ing: Past, present and future. In Fourth IEEE Interna- Kusumoto. 2021. NIL: large-scale detection of large-
tional Conference on Software Testing, Verification variance clones. In ESEC/FSE ’21: 29th ACM Joint
and Validation, ICST 2012, Berlin, Germany, 21-25 European Software Engineering Conference and Sym-
March, 2011, Workshop Proceedings, pages 153–163. posium on the Foundations of Software Engineering,
IEEE Computer Society. Athens, Greece, August 23-28, 2021, pages 830–841.
ACM.
[271] Amir M. Mir, Evaldas Latoskinas, and Georgios
Gousios. 2021. Manytypes4py: A benchmark python [280] Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu
dataset for machine learning-based type inference. In Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and
18th IEEE/ACM International Conference on Mining Dragomir Radev. 2023. Enhancing few-shot text-to-
Software Repositories, MSR 2021, Madrid, Spain, sql capabilities of large language models: A study on
May 17-19, 2021, pages 585–589. IEEE. prompt design strategies. CoRR, abs/2305.12586.
[272] Amir M. Mir, Evaldas Latoskinas, Sebastian [281] Anh Tuan Nguyen, Tung Thanh Nguyen, and
Proksch, and Georgios Gousios. 2022. Type4py: Tien N. Nguyen. 2013. Lexical statistical machine
Practical deep similarity learning-based type infer- translation for language migration. In Joint Meeting
ence for python. In 44th IEEE/ACM 44th Interna- of the European Software Engineering Conference
tional Conference on Software Engineering, ICSE and the ACM SIGSOFT Symposium on the Founda-
2022, Pittsburgh, PA, USA, May 25-27, 2022, pages tions of Software Engineering, ESEC/FSE’13, Saint
2241–2252. ACM. Petersburg, Russian Federation, August 18-26, 2013,
pages 651–654. ACM.
[273] Facundo Molina, Marcelo d’Amorim, and
Nazareno Aguirre. 2022. Fuzzing class specifica- [282] Anh Tuan Nguyen, Tung Thanh Nguyen, and
tions. In 44th IEEE/ACM 44th International Con- Tien N. Nguyen. 2015. Divide-and-conquer ap-
ference on Software Engineering, ICSE 2022, Pitts- proach for multi-phase statistical migration for source
burgh, PA, USA, May 25-27, 2022, pages 1008–1020. code (T). In 30th IEEE/ACM International Confer-
ACM. ence on Automated Software Engineering, ASE 2015,
Lincoln, NE, USA, November 9-13, 2015, pages 585–
[274] Martin Monperrus. 2018. Automatic software 596. IEEE Computer Society.
repair: A bibliography. ACM Comput. Surv.,
51(1):17:1–17:24. [283] Son Nguyen, Hung Phan, Trinh Le, and Tien N.
Nguyen. 2020. Suggesting natural method names
[275] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi to check name consistencies. In ICSE ’20: 42nd
Jin. 2016. Convolutional neural networks over tree International Conference on Software Engineering,
structures for programming language processing. In Seoul, South Korea, 27 June - 19 July, 2020, pages
Proceedings of the Thirtieth AAAI Conference on Ar- 1372–1384. ACM.
tificial Intelligence, February 12-17, 2016, Phoenix,
Arizona, USA, pages 1287–1293. AAAI Press. [284] Trong Duc Nguyen, Anh Tuan Nguyen, and
Tien N. Nguyen. 2016. Mapping API elements for
[276] Niklas Muennighoff, Qian Liu, Armel Zebaze, code migration with vector representations. In Pro-
Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, ceedings of the 38th International Conference on
Software Engineering, ICSE 2016, Austin, TX, USA, artificial and real faults in mutation testing studies.
May 14-22, 2016 - Companion Volume, pages 756– CoRR, abs/2112.14508.
758. ACM.
[293] Milos Ojdanic, Ahmed Khanfir, Aayush Garg,
[285] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Renzo Degiovanni, Mike Papadakis, and Yves Le
Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Traon. 2023. On comparing mutation testing
Caiming Xiong. 2023. Codegen: An open large lan- tools through learning-based mutant selection. In
guage model for code with multi-turn program syn- IEEE/ACM International Conference on Automation
thesis. In The Eleventh International Conference of Software Test, AST 2023, Melbourne, Australia,
on Learning Representations, ICLR 2023, Kigali, May 15-16, 2023, pages 35–46. IEEE.
Rwanda, May 1-5, 2023. OpenReview.net.
[294] OpenAI. 2023. GPT-4 technical report. CoRR,
[286] Georgios Nikitopoulos, Konstantina Dritsa, Panos abs/2303.08774.
Louridas, and Dimitris Mitropoulos. 2021. Crossvul:
a cross-language vulnerability dataset with commit [295] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo
data. In ESEC/FSE ’21: 29th ACM Joint Euro- Almeida, Carroll L. Wainwright, Pamela Mishkin,
pean Software Engineering Conference and Sympo- Chong Zhang, Sandhini Agarwal, Katarina Slama,
sium on the Foundations of Software Engineering, Alex Ray, John Schulman, Jacob Hilton, Fraser Kel-
Athens, Greece, August 23-28, 2021, pages 1565– ton, Luke Miller, Maddie Simens, Amanda Askell,
1569. ACM. Peter Welinder, Paul F. Christiano, Jan Leike, and
Ryan Lowe. 2022. Training language models to fol-
[287] Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao low instructions with human feedback. In NeurIPS.
Chen, Jidong Ge, and Bin Luo. 2023. An empirical
comparison of pre-trained models of source code. In [296] Carlos Pacheco and Michael D. Ernst. 2007. Ran-
45th IEEE/ACM International Conference on Soft- doop: feedback-directed random testing for java. In
ware Engineering, ICSE 2023, Melbourne, Australia, Companion to the 22nd Annual ACM SIGPLAN Con-
May 14-20, 2023, pages 2136–2148. IEEE. ference on Object-Oriented Programming, Systems,
Languages, and Applications, OOPSLA 2007, Octo-
[288] Changan Niu, Chuanyi Li, Vincent Ng, Jidong ber 21-25, 2007, Montreal, Quebec, Canada, pages
Ge, Liguo Huang, and Bin Luo. 2022. Spt- 815–816. ACM.
code: Sequence-to-sequence pre-training for learning
[297] Rangeet Pan, Ali Reza Ibrahimzada, Rahul Kr-
source code representations. In 44th IEEE/ACM 44th
ishna, Divya Sankar, Lambert Pouguem Wassi,
International Conference on Software Engineering,
Michele Merler, Boris Sobolev, Raju Pavuluri,
ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022,
Saurabh Sinha, and Reyhaneh Jabbarvand. 2023. Un-
pages 1–13. ACM.
derstanding the effectiveness of large language mod-
[289] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, els in code translation. CoRR, abs/2308.03109.
Hideaki Hata, Sakriani Sakti, Tomoki Toda, and [298] Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gor-
Satoshi Nakamura. 2015. Learning to generate don, and Charles Sutton. 2020. Opttyper: Probabilis-
pseudo-code from source code using statistical ma- tic type inference by optimising logical and natural
chine translation (T). In 30th IEEE/ACM Interna- constraints. CoRR, abs/2004.00348.
tional Conference on Automated Software Engineer-
ing, ASE 2015, Lincoln, NE, USA, November 9-13, [299] Annibale Panichella, Fitsum Meshesha Kifetew,
2015, pages 574–584. IEEE Computer Society. and Paolo Tonella. 2018. Automated test case gen-
eration as a many-objective optimisation problem
[290] Augustus Odena, Catherine Olsson, David G. An- with dynamic selection of the targets. IEEE Trans.
dersen, and Ian J. Goodfellow. 2019. Tensorfuzz: Software Eng., 44(2):122–158.
Debugging neural networks with coverage-guided
fuzzing. In Proceedings of the 36th International [300] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue
Conference on Machine Learning, ICML 2019, 9-15 Jia, Yves Le Traon, and Mark Harman. 2019. Chap-
June 2019, Long Beach, California, USA, volume 97 ter six - mutation testing advances: An analysis and
of Proceedings of Machine Learning Research, pages survey. Adv. Comput., 112:275–378.
4901–4911. PMLR.
[301] Kishore Papineni, Salim Roukos, Todd Ward, and
[291] Wonseok Oh and Hakjoo Oh. 2022. Pyter: ef- Wei-Jing Zhu. 2002. Bleu: a method for automatic
fective program repair for python type errors. In evaluation of machine translation. In Proceedings of
Proceedings of the 30th ACM Joint European Soft- the 40th Annual Meeting of the Association for Com-
ware Engineering Conference and Symposium on putational Linguistics, July 6-12, 2002, Philadelphia,
the Foundations of Software Engineering, ESEC/FSE PA, USA, pages 311–318. ACL.
2022, Singapore, Singapore, November 14-18, 2022,
pages 922–934. ACM. [302] Md. Rizwan Parvez, Saikat Chakraborty,
Baishakhi Ray, and Kai-Wei Chang. 2018. Building
[292] Milos Ojdanic, Aayush Garg, Ahmed Khanfir, language models for text with named entities. In
Renzo Degiovanni, Mike Papadakis, and Yves Le Proceedings of the 56th Annual Meeting of the As-
Traon. 2021. Syntactic vs. semantic similarity of sociation for Computational Linguistics, ACL 2018,
Melbourne, Australia, July 15-20, 2018, Volume 16th International Conference on Mining Software
1: Long Papers, pages 2373–2383. Association for Repositories, MSR 2019, 26-27 May 2019, Montreal,
Computational Linguistics. Canada, pages 383–387. IEEE / ACM.
[303] Adam Paszke, Sam Gross, Francisco Massa, Adam [312] Michael Pradel, Georgios Gousios, Jason Liu, and
Lerer, James Bradbury, Gregory Chanan, Trevor Satish Chandra. 2020. Typewriter: neural type pre-
Killeen, Zeming Lin, Natalia Gimelshein, Luca diction with search-based validation. In ESEC/FSE
Antiga, Alban Desmaison, Andreas Köpf, Edward Z. ’20: 28th ACM Joint European Software Engineering
Yang, Zachary DeVito, Martin Raison, Alykhan Te- Conference and Symposium on the Foundations of
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Software Engineering, Virtual Event, USA, November
Junjie Bai, and Soumith Chintala. 2019. Pytorch: An 8-13, 2020, pages 209–220. ACM.
imperative style, high-performance deep learning li-
brary. In Advances in Neural Information Processing [313] Michael Pradel, Parker Schuh, and Koushik Sen.
Systems 32: Annual Conference on Neural Informa- 2015. Typedevil: Dynamic type inconsistency analy-
tion Processing Systems 2019, NeurIPS 2019, De- sis for javascript. In 37th IEEE/ACM International
cember 8-14, 2019, Vancouver, BC, Canada, pages Conference on Software Engineering, ICSE 2015,
8024–8035. Florence, Italy, May 16-24, 2015, Volume 1, pages
314–324. IEEE Computer Society.
[304] Rishov Paul, Md. Mohib Hossain, Masum Hasan,
and Anindya Iqbal. 2023. Automated program repair
[314] Michael Pradel and Koushik Sen. 2018. Deepbugs:
based on code review: How do pre-trained trans-
a learning approach to name-based bug detection.
former models perform? CoRR, abs/2304.07840.
Proc. ACM Program. Lang., 2(OOPSLA):147:1–
[305] Hammond Pearce, Baleegh Ahmad, Benjamin 147:25.
Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022.
Asleep at the keyboard? assessing the security of [315] Ofir Press, Noah A. Smith, and Mike Lewis. 2022.
github copilot’s code contributions. In 43rd IEEE Train short, test long: Attention with linear biases
Symposium on Security and Privacy, SP 2022, San enables input length extrapolation. In The Tenth In-
Francisco, CA, USA, May 22-26, 2022, pages 754– ternational Conference on Learning Representations,
768. IEEE. ICLR 2022, Virtual Event, April 25-29, 2022. Open-
Review.net.
[306] Guilherme Penedo, Quentin Malartic, Daniel
Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, [316] Ruchir Puri, David S. Kung, Geert Janssen, Wei
Hamza Alobeidli, Baptiste Pannier, Ebtesam Al- Zhang, Giacomo Domeniconi, Vladimir Zolotov, Ju-
mazrouei, and Julien Launay. 2023. The refined- lian Dolby, Jie Chen, Mihir R. Choudhury, Lindsey
web dataset for falcon LLM: outperforming curated Decker, Veronika Thost, Luca Buratti, Saurabh Pujar,
corpora with web data, and web data only. CoRR, Shyam Ramji, Ulrich Finkler, Susan Malaika, and
abs/2306.01116. Frederick Reiss. 2021. Codenet: A large-scale AI for
code dataset for learning a diversity of coding tasks.
[307] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and In Proceedings of the Neural Information Process-
Enrico Shippole. 2023. Yarn: Efficient context win- ing Systems Track on Datasets and Benchmarks 1,
dow extension of large language models. CoRR, NeurIPS Datasets and Benchmarks 2021, December
abs/2309.00071. 2021, virtual.
[308] Yun Peng, Cuiyun Gao, Zongjie Li, Bowei Gao, [317] Chen Qian, Xin Cong, Cheng Yang, Weize Chen,
David Lo, Qirun Zhang, and Michael R. Lyu. 2022. Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong
Static inference meets deep learning: A hybrid type Sun. 2023. Communicative agents for software de-
inference approach for python. In 44th IEEE/ACM velopment. CoRR, abs/2307.07924.
44th International Conference on Software Engineer-
ing, ICSE 2022, Pittsburgh, PA, USA, May 25-27, [318] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li,
2022, pages 2019–2030. ACM. Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng
[309] Yun Peng, Shuzheng Gao, Cuiyun Gao, Yintong Kong, and Yiran Zhong. 2022. cosformer: Rethink-
Huo, and Michael R. Lyu. 2023. Domain knowledge ing softmax in attention. In The Tenth International
matters: Improving prompts with fix templates for Conference on Learning Representations, ICLR 2022,
repairing python type errors. CoRR, abs/2306.01394. Virtual Event, April 25-29, 2022. OpenReview.net.
[310] Yun Peng, Chaozheng Wang, Wenxuan Wang, [319] Alec Radford, Karthik Narasimhan, Tim Salimans,
Cuiyun Gao, and Michael R. Lyu. 2023. Generative and Ilya Sutskever. 2018. Improving language under-
type inference for python. CoRR, abs/2307.09163. standing by generative pre-training.
[311] Serena Elisa Ponta, Henrik Plate, Antonino Sa- [320] Alec Radford, Jeffrey Wu, Rewon Child, David
betta, Michele Bezzi, and Cédric Dangremont. 2019. Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
A manually-curated dataset of fixes to vulnerabili- guage models are unsupervised multitask learners.
ties of open-source software. In Proceedings of the OpenAI blog, 1(8):9.
[321] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, SIGPLAN-SIGACT Symposium on Principles of Pro-
Katie Millican, Jordan Hoffmann, H. Francis Song, gramming Languages, POPL 2015, Mumbai, India,
John Aslanides, Sarah Henderson, Roman Ring, January 15-17, 2015, pages 111–124. ACM.
Susannah Young, Eliza Rutherford, Tom Henni-
gan, Jacob Menick, Albin Cassirer, Richard Pow- [327] Veselin Raychev, Martin T. Vechev, and Eran Ya-
ell, George van den Driessche, Lisa Anne Hendricks, hav. 2014. Code completion with statistical language
Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Jo- models. In ACM SIGPLAN Conference on Program-
hannes Welbl, Sumanth Dathathri, Saffron Huang, ming Language Design and Implementation, PLDI
Jonathan Uesato, John Mellor, Irina Higgins, Antonia ’14, Edinburgh, United Kingdom - June 09 - 11, 2014,
Creswell, Nat McAleese, Amy Wu, Erich Elsen, Sid- pages 419–428. ACM.
dhant M. Jayakumar, Elena Buchatskaya, David Bud-
den, Esme Sutherland, Karen Simonyan, Michela Pa- [328] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shu-
ganini, Laurent Sifre, Lena Martens, Xiang Lorraine jie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou,
Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Ambrosio Blanco, and Shuai Ma. 2020. Codebleu:
Gribovskaya, Domenic Donato, Angeliki Lazaridou, a method for automatic evaluation of code synthesis.
Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim- CoRR, abs/2009.10297.
poukelli, Nikolai Grigorev, Doug Fritz, Thibault Sot-
tiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, [329] Niklas Risse and Marcel Böhme. 2023. Limits of
Daniel Toyama, Cyprien de Masson d’Autume, Yujia machine learning for automatic vulnerability detec-
Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, tion. CoRR, abs/2306.17193.
Aidan Clark, Diego de Las Casas, Aurelia Guy,
[330] Subhro Roy, Sam Thomson, Tongfei Chen,
Chris Jones, James Bradbury, Matthew J. Johnson,
Richard Shin, Adam Pauls, Jason Eisner, and Ben-
Blake A. Hechtman, Laura Weidinger, Iason Gabriel,
jamin Van Durme. 2022. Benchclamp: A benchmark
William S. Isaac, Edward Lockhart, Simon Osindero,
for evaluating language models on semantic parsing.
Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem
CoRR, abs/2206.10668.
Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hass-
abis, Koray Kavukcuoglu, and Geoffrey Irving. 2021. [331] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle,
Scaling language models: Methods, analysis & in- Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
sights from training gopher. CoRR, abs/2112.11446. Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
[322] Colin Raffel, Noam Shazeer, Adam Roberts, Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man-
Katherine Lee, Sharan Narang, Michael Matena, ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori,
Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Explor- Wenhan Xiong, Alexandre Défossez, Jade Copet,
ing the limits of transfer learning with a unified text- Faisal Azhar, Hugo Touvron, Louis Martin, Nico-
to-text transformer. J. Mach. Learn. Res., 21:140:1– las Usunier, Thomas Scialom, and Gabriel Synnaeve.
140:67. 2023. Code llama: Open foundation models for code.
CoRR, abs/2308.12950.
[323] Baishakhi Ray, Vincent J. Hellendoorn, Saheel
Godhane, Zhaopeng Tu, Alberto Bacchelli, and [332] Baptiste Rozière, Marie-Anne Lachaux, Lowik
Premkumar T. Devanbu. 2016. On the "naturalness" Chanussot, and Guillaume Lample. 2020. Unsuper-
of buggy code. In Proceedings of the 38th Interna- vised translation of programming languages. In Ad-
tional Conference on Software Engineering, ICSE vances in Neural Information Processing Systems 33:
2016, Austin, TX, USA, May 14-22, 2016, pages 428– Annual Conference on Neural Information Process-
439. ACM. ing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual.
[324] Veselin Raychev, Pavol Bielik, and Martin T.
Vechev. 2016. Probabilistic model for code with [333] Baptiste Rozière, Jie Zhang, François Charton,
decision trees. In Proceedings of the 2016 ACM SIG- Mark Harman, Gabriel Synnaeve, and Guillaume
PLAN International Conference on Object-Oriented Lample. 2022. Leveraging automated unit tests for
Programming, Systems, Languages, and Applications, unsupervised code translation. In The Tenth Inter-
OOPSLA 2016, part of SPLASH 2016, Amsterdam, national Conference on Learning Representations,
The Netherlands, October 30 - November 4, 2016, ICLR 2022, Virtual Event, April 25-29, 2022. Open-
pages 731–747. ACM. Review.net.
[325] Veselin Raychev, Pavol Bielik, Martin T. Vechev, [334] Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei
and Andreas Krause. 2016. Learning programs from Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu
noisy data. In Proceedings of the 43rd Annual ACM Mao, Xingyu Zeng, and Rui Zhao. 2023. TPTU: task
SIGPLAN-SIGACT Symposium on Principles of Pro- planning and tool usage of large language model-
gramming Languages, POPL 2016, St. Petersburg, based AI agents. CoRR, abs/2308.03427.
FL, USA, January 20 - 22, 2016, pages 761–774.
ACM. [335] Ohad Rubin and Jonathan Berant. 2021. Smbop:
Semi-autoregressive bottom-up semantic parsing. In
[326] Veselin Raychev, Martin T. Vechev, and Andreas Proceedings of the 2021 Conference of the North
Krause. 2015. Predicting program properties from American Chapter of the Association for Computa-
"big code". In Proceedings of the 42nd Annual ACM tional Linguistics: Human Language Technologies,
NAACL-HLT 2021, Online, June 6-11, 2021, pages Muennighoff, Albert Villanova del Moral, Olatunji
311–324. Association for Computational Linguistics. Ruwase, Rachel Bawden, Stas Bekman, Angelina
McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile
[336] Rebecca L. Russell, Louis Y. Kim, Lei H. Hamil- Saulnier, Samson Tan, Pedro Ortiz Suarez, Vic-
ton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, tor Sanh, Hugo Laurençon, Yacine Jernite, Julien
Paul M. Ellingwood, and Marc W. McConley. 2018. Launay, Margaret Mitchell, Colin Raffel, Aaron
Automated vulnerability detection in source code Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri
using deep representation learning. In 17th IEEE Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg
International Conference on Machine Learning and Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue,
Applications, ICMLA 2018, Orlando, FL, USA, De- Christopher Klamm, Colin Leong, Daniel van Strien,
cember 17-20, 2018, pages 757–762. IEEE. David Ifeoluwa Adelani, and et al. 2022. BLOOM:
A 176b-parameter open-access multilingual language
[337] Caitlin Sadowski, Jeffrey van Gogh, Ciera Jas- model. CoRR, abs/2211.05100.
pan, Emma Söderberg, and Collin Winter. 2015. Tri-
corder: Building a program analysis ecosystem. In [343] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and
37th IEEE/ACM International Conference on Soft- Frank Tip. 2023. Adaptive test generation using a
ware Engineering, ICSE 2015, Florence, Italy, May large language model. CoRR, abs/2302.06527.
16-24, 2015, Volume 1, pages 598–608. IEEE Com-
puter Society. [344] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì,
Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,
[338] Vaibhav Saini, Farima Farmahinifarahani, Yadong Nicola Cancedda, and Thomas Scialom. 2023. Tool-
Lu, Pierre Baldi, and Cristina V. Lopes. 2018. Oreo: former: Language models can teach themselves to
detection of clones in the twilight zone. In Pro- use tools. CoRR, abs/2302.04761.
ceedings of the 2018 ACM Joint Meeting on Euro-
pean Software Engineering Conference and Sympo- [345] Torsten Scholak, Nathan Schucher, and Dzmitry
sium on the Foundations of Software Engineering, Bahdanau. 2021. PICARD: parsing incrementally for
ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, constrained auto-regressive decoding from language
USA, November 04-09, 2018, pages 354–365. ACM. models. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
[339] Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, EMNLP 2021, Virtual Event / Punta Cana, Domini-
Chanchal K. Roy, and Cristina V. Lopes. 2016. can Republic, 7-11 November, 2021, pages 9895–
Sourcerercc: scaling code clone detection to big-code. 9901. Association for Computational Linguistics.
In Proceedings of the 38th International Conference
on Software Engineering, ICSE 2016, Austin, TX, [346] David Schuler and Andreas Zeller. 2009.
USA, May 14-22, 2016, pages 1157–1168. ACM. Javalanche: efficient mutation testing for java. In
Proceedings of the 7th joint meeting of the European
[340] Pasquale Salza, Christoph Schwizer, Jian Gu, and Software Engineering Conference and the ACM
Harald C. Gall. 2023. On the effectiveness of transfer SIGSOFT International Symposium on Foundations
learning for code search. IEEE Trans. Software Eng., of Software Engineering, 2009, Amsterdam, The
49(4):1804–1822. Netherlands, August 24-28, 2009, pages 297–298.
ACM.
[341] Victor Sanh, Albert Webson, Colin Raffel,
Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, [347] John Schulman, Filip Wolski, Prafulla Dhariwal,
Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Alec Radford, and Oleg Klimov. 2017. Proximal pol-
Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, icy optimization algorithms. CoRR, abs/1707.06347.
Shanya Sharma Sharma, Eliza Szczechla, Taewoon
Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti [348] Marija Selakovic, Michael Pradel, Rezwana
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Karim, and Frank Tip. 2018. Test generation for
Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, higher-order functions in dynamic languages. Proc.
Harshit Pandey, Rachel Bawden, Thomas Wang, Tr- ACM Program. Lang., 2(OOPSLA):161:1–161:27.
ishala Neeraj, Jos Rozen, Abheesht Sharma, An-
drea Santilli, Thibault Févry, Jason Alan Fries, Ryan [349] Rico Sennrich, Barry Haddow, and Alexandra
Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Birch. 2016. Improving neural machine translation
Thomas Wolf, and Alexander M. Rush. 2022. Multi- models with monolingual data. In Proceedings of the
task prompted training enables zero-shot task gener- 54th Annual Meeting of the Association for Compu-
alization. In The Tenth International Conference on tational Linguistics, ACL 2016, August 7-12, 2016,
Learning Representations, ICLR 2022, Virtual Event, Berlin, Germany, Volume 1: Long Papers. The Asso-
April 25-29, 2022. OpenReview.net. ciation for Computer Linguistics.
[342] Teven Le Scao, Angela Fan, Christopher Akiki, [350] Sina Shamshiri. 2015. Automated unit test gen-
Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman eration for evolving software. In Proceedings of
Castagné, Alexandra Sasha Luccioni, François Yvon, the 2015 10th Joint Meeting on Foundations of Soft-
Matthias Gallé, Jonathan Tow, Alexander M. Rush, ware Engineering, ESEC/FSE 2015, Bergamo, Italy,
Stella Biderman, Albert Webson, Pawan Sasanka Am- August 30 - September 4, 2015, pages 1038–1041.
manamanchi, Thomas Wang, Benoît Sagot, Niklas ACM.
[351] Peter Shaw, Ming-Wei Chang, Panupong Pasu- test case recommendation. In IEEE/ACM Interna-
pat, and Kristina Toutanova. 2021. Compositional tional Conference on Automation of Software Test,
generalization and natural language variation: Can a AST@ICSE 2022, Pittsburgh, PA, USA, May 21-22,
semantic parsing approach handle both? In Proceed- 2022, pages 65–76. ACM/IEEE.
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International [360] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni,
Joint Conference on Natural Language Processing, and Chandan K. Reddy. 2023. Execution-based code
ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual generation using deep reinforcement learning. Trans-
Event, August 1-6, 2021, pages 922–938. Association actions on Machine Learning Research.
for Computational Linguistics.
[361] Disha Shrivastava, Denis Kocetkov, Harm
[352] Noam Shazeer. 2019. Fast transformer decod- de Vries, Dzmitry Bahdanau, and Torsten Scholak.
ing: One write-head is all you need. CoRR, 2023. Repofusion: Training code models to under-
abs/1911.02150. stand your repository. CoRR, abs/2306.10998.
[353] Dongdong She, Rahul Krishna, Lu Yan, Suman [362] Disha Shrivastava, Hugo Larochelle, and Daniel
Jana, and Baishakhi Ray. 2020. Mtfuzz: fuzzing with Tarlow. 2023. Repository-level prompt generation
a multi-task neural network. In ESEC/FSE ’20: 28th for large language models of code. In International
ACM Joint European Software Engineering Confer- Conference on Machine Learning, ICML 2023, 23-29
ence and Symposium on the Foundations of Software July 2023, Honolulu, Hawaii, USA, volume 202 of
Engineering, Virtual Event, USA, November 8-13, Proceedings of Machine Learning Research, pages
2020, pages 737–749. ACM. 31693–31715. PMLR.
[354] Dongdong She, Kexin Pei, Dave Epstein, Jun- [363] Chang Shu, Yusen Zhang, Xiangyu Dong, Peng
feng Yang, Baishakhi Ray, and Suman Jana. 2019. Shi, Tao Yu, and Rui Zhang. 2021. Logic-consistency
NEUZZ: efficient fuzzing with neural program text generation from semantic parses. In Findings
smoothing. In 2019 IEEE Symposium on Security of the Association for Computational Linguistics:
and Privacy, SP 2019, San Francisco, CA, USA, May ACL/IJCNLP 2021, Online Event, August 1-6, 2021,
19-23, 2019, pages 803–817. IEEE. volume ACL/IJCNLP 2021 of Findings of ACL,
pages 4414–4426. Association for Computational
[355] Xinyu She, Yue Liu, Yanjie Zhao, Yiling He, Linguistics.
Li Li, Chakkrit Tantithamthavorn, Zhan Qin, and
Haoyu Wang. 2023. Pitfalls in language models for [364] Mukul Singh, José Cambronero, Sumit Gulwani,
code intelligence: A taxonomy and survey. CoRR, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023.
abs/2310.17903. Codefusion: A pre-trained diffusion model for code
generation.
[356] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang
Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, [365] Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen
Jichuan Ji, Jingyang Zhao, Yuenan Guo, and Qianxi- Chen, and Yang Liu. 2020. CORE: automating re-
ang Wang. 2023. Pangu-coder2: Boosting large lan- view recommendation for code changes. In 27th
guage models for code with ranking feedback. CoRR, IEEE International Conference on Software Analysis,
abs/2307.14936. Evolution and Reengineering, SANER 2020, London,
ON, Canada, February 18-21, 2020, pages 284–295.
[357] Shu-Ting Shi, Ming Li, David Lo, Ferdian Thung, IEEE.
and Xuan Huo. 2019. Automatic code review by
learning the revision of source code. In The Thirty- [366] Saleh Soltan, Shankar Ananthakrishnan, Jack
Third AAAI Conference on Artificial Intelligence, FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan,
AAAI 2019, The Thirty-First Innovative Applications Charith Peris, Stephen Rawls, Andy Rosenbaum,
of Artificial Intelligence Conference, IAAI 2019, The Anna Rumshisky, Chandana Satya Prakash, Mukund
Ninth AAAI Symposium on Educational Advances in Sridhar, Fabian Triefenbach, Apurv Verma, Gökhan
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, Tür, and Prem Natarajan. 2022. Alexatm 20b: Few-
USA, January 27 - February 1, 2019, pages 4910– shot learning using a large-scale multilingual seq2seq
4917. AAAI Press. model. CoRR, abs/2208.01448.
[358] Tianze Shi, Chen Zhao, Jordan L. Boyd-Graber, [367] Aarohi Srivastava, Abhinav Rastogi, Abhishek
Hal Daumé III, and Lillian Lee. 2020. On the poten- Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam
tial of lexico-logical alignments for semantic parsing Fisch, Adam R. Brown, Adam Santoro, Aditya
to SQL queries. In Findings of the Association for Gupta, Adrià Garriga-Alonso, Agnieszka Kluska,
Computational Linguistics: EMNLP 2020, Online Aitor Lewkowycz, Akshat Agarwal, Alethea Power,
Event, 16-20 November 2020, volume EMNLP 2020 Alex Ray, Alex Warstadt, Alexander W. Kocurek,
of Findings of ACL, pages 1849–1864. Association Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par-
for Computational Linguistics. rish, Allen Nie, Aman Hussain, Amanda Askell,
Amanda Dsouza, Ambrose Slone, Ameet Rahane,
[359] Samiha Shimmi and Mona Rahimi. 2022. Lever- Anantharaman S. Iyer, Anders Johan Andreassen, An-
aging code-test co-evolution patterns for automated drea Madotto, Andrea Santilli, Andreas Stuhlmüller,
Andrew M. Dai, Andrew La, Andrew Lampinen, Donell, Kyle Richardson, Laria Reynolds, Leo Gao,
Andy Zou, Angela Jiang, Angelica Chen, Anh Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-
Vuong, Animesh Gupta, Anna Gottardi, Antonio Ochando, Louis-Philippe Morency, Luca Moschella,
Norelli, Anu Venkatesh, Arash Gholamidavoodi, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng
Arfa Tabassum, Arul Menezes, Arun Kirubara- He, Luis Oliveros-Colón, Luke Metz, Lütfi Kerem
jan, Asher Mullokandov, Ashish Sabharwal, Austin Senel, Maarten Bosma, Maarten Sap, Maartje Ter
Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas
B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Mazeika, Marco Baturan, Marco Marelli, Marco
Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Maru, Maria Jose Ramirez-Quintana, Marie Tolkiehn,
Hedayatnia, Behnam Neyshabur, Benjamin Inden, Mario Giulianelli, Martha Lewis, Martin Potthast,
Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Matthew L Leavitt, Matthias Hagen, Mátyás Schu-
Howald, Bryan Orinion, Cameron Diao, Cameron bert, Medina Orduna Baitemirova, Melody Arnaud,
Dour, Catherine Stinson, Cedrick Argueta, Cesar Melvin McElrath, Michael Andrew Yee, Michael Co-
Ferri, Chandan Singh, Charles Rathkopf, Chenlin hen, Michael Gu, Michael Ivanitskiy, Michael Star-
Meng, Chitta Baral, Chiyu Wu, Chris Callison- ritt, Michael Strube, Michał Sw˛edrowski, Michele
Burch, Christopher Waites, Christian Voigt, Christo- Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike
pher D Manning, Christopher Potts, Cindy Ramirez, Cain, Mimee Xu, Mirac Suzgun, Mitch Walker,
Clara E. Rivera, Clemencia Siro, Colin Raffel, Court- Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor
ney Ashcraft, Cristina Garbacea, Damien Sileo, Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun
Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Peng, Nathan Andrew Chi, Nayeon Lee, Neta Gur-
Roth, C. Daniel Freeman, Daniel Khashabi, Daniel Ari Krakover, Nicholas Cameron, Nicholas Roberts,
Levy, Daniel Moseguí González, Danielle Perszyk, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas
Danny Hernandez, Danqi Chen, Daphne Ippolito, Deckers, Niklas Muennighoff, Nitish Shirish Keskar,
Dar Gilboa, David Dohan, David Drakard, David Ju- Niveditha S. Iyer, Noah Constant, Noah Fiedel,
rgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Nuan Wen, Oliver Zhang, Omar Agha, Omar El-
Denis Kleyko, Deniz Yuret, Derek Chen, Derek baghdadi, Omer Levy, Owain Evans, Pablo Anto-
Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, nio Moreno Casares, Parth Doshi, Pascale Fung,
Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi,
Dylan Schrader, Ekaterina Shutova, Ekin Dogus Peiyuan Liao, Percy Liang, Peter W Chang, Pe-
Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth ter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr
Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti
Rodolà, Emma Lam, Eric Chu, Eric Tang, Erkut Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Ra-
Erdem, Ernie Chang, Ethan A Chi, Ethan Dyer, bin Banjade, Rachel Etta Rudolph, Raefer Gabriel,
Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Rahel Habacker, Ramon Risco, Raphaël Millière,
Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku
Fernando Martínez-Plumed, Francesca Happé, Fran- Arakawa, Robbe Raymaekers, Robert Frank, Ro-
cois Chollet, Frieda Rong, Gaurav Mishra, Genta In- han Sikand, Roman Novak, Roman Sitelew, Ro-
dra Winata, Gerard de Melo, Germán Kruszewski, nan Le Bras, Rosanne Liu, Rowan Jacobs, Rui
Giambattista Parascandolo, Giorgio Mariani, Glo- Zhang, Russ Salakhutdinov, Ryan Andrew Chi,
ria Xinyue Wang, Gonzalo Jaimovitch-Lopez, Gregor Seungjae Ryan Lee, Ryan Stovall, Ryan Teehan,
Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Rylan Yang, Sahib Singh, Saif M. Mohammad,
Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam
Hayden Bogar, Henry Francis Anthony Shevlin, Hin- Wiseman, Samuel Gruetter, Samuel R. Bowman,
rich Schuetze, Hiromu Yakura, Hongming Zhang, Samuel Stern Schoenholz, Sanghyun Han, Sanjeev
Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan
Jack Geissinger, Jackson Kernion, Jacob Hilton, Jae- Ghosh, Sean Casey, Sebastian Bischoff, Sebastian
hoon Lee, Jaime Fernández Fisac, James B Simon, Gehrmann, Sebastian Schuster, Sepideh Sadeghi,
James Koppel, James Zheng, James Zou, Jan Kocon, Shadi Hamdan, Sharon Zhou, Shashank Srivastava,
Jana Thompson, Janelle Wingfield, Jared Kaplan, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixi-
Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, ang Shane Gu, Shubh Pachchigar, Shubham Tosh-
Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle niwal, Shyam Upadhyay, Shyamolima Shammie
Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Debnath, Siamak Shakeri, Simon Thormeyer, Si-
Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming mone Melzi, Siva Reddy, Sneha Priscilla Makini,
Song, Jillian Tang, Joan Waweru, John Burden, John Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar,
Miller, John U. Balis, Jonathan Batchelder, Jonathan Stanislas Dehaene, Stefan Divic, Stefano Ermon,
Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez- Stella Biderman, Stephanie Lin, Stephen Prasad,
Orallo, Joseph Boudeman, Joseph Guerr, Joseph Steven Piantadosi, Stuart Shieber, Summer Mish-
Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce erghi, Svetlana Kiritchenko, Swaroop Mishra, Tal
Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali,
Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Tatsunori Hashimoto, Te-Lin Wu, Théo Desbor-
Markert, Kaustubh Dhole, Kevin Gimpel, Kevin des, Theodore Rothschild, Thomas Phan, Tianle
Omondi, Kory Wallace Mathewson, Kristen Chia- Wang, Tiberius Nkinyili, Timo Schick, Timofei Ko-
fullo, Ksenia Shkaruta, Kumar Shridhar, Kyle Mc- rnev, Titus Tunduny, Tobias Gerstenberg, Trenton
Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Conference on Software Maintenance and Evolution,
Uri Shaham, Vedant Misra, Vera Demberg, Victo- Victoria, BC, Canada, September 29 - October 3,
ria Nyamai, Vikas Raunak, Vinay Venkatesh Ra- 2014, pages 476–480. IEEE Computer Society.
masesh, vinay uday prabhu, Vishakh Padmakumar,
Vivek Srikumar, William Fedus, William Saunders, [376] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu
William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Fu, and Neel Sundaresan. 2020. Intellicode compose:
Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadol- code generation using transformer. In ESEC/FSE
lah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, ’20: 28th ACM Joint European Software Engineering
Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Conference and Symposium on the Foundations of
Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yu- Software Engineering, Virtual Event, USA, November
fang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, 8-13, 2020, pages 1433–1443. ACM.
Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi
[377] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu,
Wu. 2023. Beyond the imitation game: Quantifying
and Neel Sundaresan. 2019. Pythia: Ai-assisted code
and extrapolating the capabilities of language models.
completion system. In Proceedings of the 25th ACM
Transactions on Machine Learning Research.
SIGKDD International Conference on Knowledge
[368] Chia-Yi Su and Collin McMillan. 2023. Dis- Discovery & Data Mining, KDD 2019, Anchorage,
tilled GPT for source code summarization. CoRR, AK, USA, August 4-8, 2019, pages 2727–2735. ACM.
abs/2308.14731. [378] Marc Szafraniec, Baptiste Rozière, Hugh Leather,
Patrick Labatut, François Charton, and Gabriel Syn-
[369] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and
naeve. 2023. Code translation with compiler repre-
Yunfeng Liu. 2021. Roformer: Enhanced trans-
sentations. In The Eleventh International Confer-
former with rotary position embedding. CoRR,
ence on Learning Representations, ICLR 2023, Ki-
abs/2104.09864.
gali, Rwanda, May 1-5, 2023. OpenReview.net.
[370] Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. [379] Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mech-
Learning to map context-dependent sentences to exe- taev, and Abhik Roychoudhury. 2017. Codeflaws: a
cutable formal queries. In Proceedings of the 2018 programming competition benchmark for evaluating
Conference of the North American Chapter of the automated program repair tools. In Proceedings of
Association for Computational Linguistics: Human the 39th International Conference on Software Engi-
Language Technologies, NAACL-HLT 2018, New Or- neering, ICSE 2017, Buenos Aires, Argentina, May
leans, Louisiana, USA, June 1-6, 2018, Volume 1 20-28, 2017 - Companion Volume, pages 180–182.
(Long Papers), pages 2238–2249. Association for IEEE Computer Society.
Computational Linguistics.
[380] Lappoon R. Tang and Raymond J. Mooney. 2000.
[371] Weisong Sun, Chunrong Fang, Yuchen Chen, Automated construction of database interfaces: Inter-
Guanhong Tao, Tingxu Han, and Quanjun Zhang. grating statistical and relational learning for semantic
2022. Code search based on context-aware code parsing. In Joint SIGDAT Conference on Empirical
translation. In 44th IEEE/ACM 44th International Methods in Natural Language Processing and Very
Conference on Software Engineering, ICSE 2022, Large Corpora, EMNLP 2000, Hong Kong, October
Pittsburgh, PA, USA, May 25-27, 2022, pages 388– 7-8, 2000, pages 133–141. Association for Computa-
400. ACM. tional Linguistics.
[372] Weisong Sun, Chunrong Fang, Yudu You, Yun [381] Shimin Tao, Weibin Meng, Yimeng Chen, Yichen
Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Zhu, Ying Liu, Chunning Du, Tao Han, Yongpeng
Huang, Yuchen Chen, Quanjun Zhang, Hanwei Qian, Zhao, Xiangguang Wang, and Hao Yang. 2022.
Yang Liu, and Zhenyu Chen. 2023. Automatic code Logstamp: Automatic online log parsing based on se-
summarization via chatgpt: How far are we? CoRR, quence labelling. SIGMETRICS Perform. Evaluation
abs/2305.12865. Rev., 49(4):93–98.
[373] Dídac Surís, Sachit Menon, and Carl Vondrick. [382] Rohan Taori, Ishaan Gulrajani, Tianyi
2023. Vipergpt: Visual inference via python execu- Zhang, Yann Dubois, Xuechen Li, Carlos
tion for reasoning. CoRR, abs/2303.08128. Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. 2023. Stanford alpaca: An
[374] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. instruction-following llama model. https:
2014. Sequence to sequence learning with neural //github.com/tatsu-lab/stanford_alpaca.
networks. In Advances in Neural Information Pro-
cessing Systems 27: Annual Conference on Neural In- [383] Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier
formation Processing Systems 2014, December 8-13 Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung,
2014, Montreal, Quebec, Canada, pages 3104–3112. Dara Bahri, Tal Schuster, Huaixiu Steven Zheng,
Denny Zhou, Neil Houlsby, and Donald Metzler.
[375] Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, 2023. UL2: unifying language learning paradigms.
Chanchal Kumar Roy, and Mohammad Mamun Mia. In The Eleventh International Conference on Learn-
2014. Towards a big data curated benchmark of ing Representations, ICLR 2023, Kigali, Rwanda,
inter-project code clones. In 30th IEEE International May 1-5, 2023. OpenReview.net.
[384] Romal Thoppilan, Daniel De Freitas, Jamie Hall, [389] Immanuel Trummer. 2022. Codexdb: Synthesiz-
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze ing code for query processing from natural language
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, instructions using GPT-3 codex. Proc. VLDB Endow.,
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, 15(11):2921–2928.
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao [390] Zhaopeng Tu, Zhendong Su, and Premkumar T.
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Devanbu. 2014. On the localness of software. In Pro-
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, ceedings of the 22nd ACM SIGSOFT International
Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Symposium on Foundations of Software Engineering,
Meier-Hellstern, Meredith Ringel Morris, Tulsee (FSE-22), Hong Kong, China, November 16 - 22,
Doshi, Renelito Delos Santos, Toju Duke, Johnny So- 2014, pages 269–280. ACM.
raker, Ben Zevenbergen, Vinodkumar Prabhakaran,
Mark Diaz, Ben Hutchinson, Kristen Olson, Ale- [391] Michele Tufano, Dawn Drain, Alexey Svy-
jandra Molina, Erin Hoffman-John, Josh Lee, Lora atkovskiy, Shao Kun Deng, and Neel Sundaresan.
Aroyo, Ravi Rajakumar, Alena Butryna, Matthew 2020. Unit test case generation with transformers.
Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Co- CoRR, abs/2009.05617.
hen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-
Arcas, Claire Cui, Marian Croak, Ed H. Chi, and [392] Michele Tufano, Dawn Drain, Alexey Svy-
Quoc Le. 2022. Lamda: Language models for dialog atkovskiy, and Neel Sundaresan. 2022. Generat-
applications. CoRR, abs/2201.08239. ing accurate assert statements for unit test cases us-
ing pretrained transformers. In IEEE/ACM Interna-
[385] Sindhu Tipirneni, Ming Zhu, and Chandan K. tional Conference on Automation of Software Test,
Reddy. 2022. Structcoder: Structure-aware trans- AST@ICSE 2022, Pittsburgh, PA, USA, May 21-22,
former for code generation. CoRR, abs/2206.05239. 2022, pages 54–64. ACM/IEEE.
[386] Hugo Touvron, Thibaut Lavril, Gautier Izacard,
[393] Michele Tufano, Jason Kimko, Shiya Wang, Cody
Xavier Martinet, Marie-Anne Lachaux, Timothée
Watson, Gabriele Bavota, Massimiliano Di Penta,
Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham-
and Denys Poshyvanyk. 2020. Deepmutation: a neu-
bro, Faisal Azhar, Aurélien Rodriguez, Armand
ral mutation tool. In ICSE ’20: 42nd International
Joulin, Edouard Grave, and Guillaume Lample. 2023.
Conference on Software Engineering, Companion
Llama: Open and efficient foundation language mod-
Volume, Seoul, South Korea, 27 June - 19 July, 2020,
els. CoRR, abs/2302.13971.
pages 29–32. ACM.
[387] Hugo Touvron, Louis Martin, Kevin Stone, Peter
Albert, Amjad Almahairi, Yasmine Babaei, Nikolay [394] Michele Tufano, Jevgenija Pantiuchina, Cody Wat-
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti son, Gabriele Bavota, and Denys Poshyvanyk. 2019.
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- On learning meaningful code changes via neural ma-
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, chine translation. In Proceedings of the 41st Interna-
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, tional Conference on Software Engineering, ICSE
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- 2019, Montreal, QC, Canada, May 25-31, 2019,
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan pages 25–36. IEEE / ACM.
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, [395] Michele Tufano, Cody Watson, Gabriele Bavota,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Massimiliano Di Penta, Martin White, and Denys
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- Poshyvanyk. 2019. An empirical study on learn-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- ing bug-fixing patches in the wild via neural ma-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- chine translation. ACM Trans. Softw. Eng. Methodol.,
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, 28(4):19:1–19:29.
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- [396] Michele Tufano, Cody Watson, Gabriele Bavota,
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Massimiliano Di Penta, Martin White, and Denys
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Poshyvanyk. 2019. Learning how to mutate source
Melanie Kambadur, Sharan Narang, Aurélien Ro- code from bug-fixes. In 2019 IEEE International
driguez, Robert Stojnic, Sergey Edunov, and Thomas Conference on Software Maintenance and Evolution,
Scialom. 2023. Llama 2: Open foundation and fine- ICSME 2019, Cleveland, OH, USA, September 29 -
tuned chat models. CoRR, abs/2307.09288. October 4, 2019, pages 301–312. IEEE.
[388] Hieu Tran, Ngoc M. Tran, Son Nguyen, Hoan [397] Rosalia Tufano, Simone Masiero, Antonio Mas-
Nguyen, and Tien N. Nguyen. 2019. Recovering tropaolo, Luca Pascarella, Denys Poshyvanyk, and
variable names for minified code with usage contexts. Gabriele Bavota. 2022. Using pre-trained models to
In Proceedings of the 41st International Conference boost code review automation. In 44th IEEE/ACM
on Software Engineering, ICSE 2019, Montreal, QC, 44th International Conference on Software Engineer-
Canada, May 25-31, 2019, pages 1165–1175. IEEE / ing, ICSE 2022, Pittsburgh, PA, USA, May 25-27,
ACM. 2022, pages 2291–2302. ACM.
[398] Rosalia Tufano, Luca Pascarella, Michele Tufano, text-to-sql parsers. In Proceedings of the 58th Annual
Denys Poshyvanyk, and Gabriele Bavota. 2021. To- Meeting of the Association for Computational Lin-
wards automating code review activities. In 43rd guistics, ACL 2020, Online, July 5-10, 2020, pages
IEEE/ACM International Conference on Software 7567–7578. Association for Computational Linguis-
Engineering, ICSE 2021, Madrid, Spain, 22-30 May tics.
2021, pages 163–174. IEEE.
[407] Ben Wang and Aran Komatsuzaki. 2021. GPT-
[399] Lewis Tunstall, Leandro von Werra, and Thomas J-6B: A 6 Billion Parameter Autoregressive Lan-
Wolf. 2022. Natural Language Processing with guage Model. https://github.com/kingoflolz/
Transformers: Building Language Applications with mesh-transformer-jax.
Hugging Face. O’Reilly Media, Incorporated.
[408] Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu,
[400] Marko Vasic, Aditya Kanade, Petros Maniatis, Yun Xiong, Wei Dong, and Xiangke Liao. 2022.
David Bieber, and Rishabh Singh. 2019. Neural Bridging pre-trained models and downstream tasks
program repair by jointly learning to localize and for source code understanding. In 44th IEEE/ACM
repair. In 7th International Conference on Learning 44th International Conference on Software Engineer-
Representations, ICLR 2019, New Orleans, LA, USA, ing, ICSE 2022, Pittsburgh, PA, USA, May 25-27,
May 6-9, 2019. OpenReview.net. 2022, pages 287–298. ACM.
[401] Bogdan Vasilescu, Casey Casalnuovo, and [409] Ke Wang and Zhendong Su. 2020. Blended, pre-
Premkumar T. Devanbu. 2017. Recovering clear, cise semantic program embeddings. In Proceedings
natural identifiers from obfuscated JS names. In of the 41st ACM SIGPLAN Conference on Program-
Proceedings of the 2017 11th Joint Meeting on Foun- ming Language Design and Implementation, PLDI
dations of Software Engineering, ESEC/FSE 2017, 2020, page 121–134, New York, NY, USA. Associa-
Paderborn, Germany, September 4-8, 2017, pages tion for Computing Machinery.
683–693. ACM.
[410] Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu,
[402] Ashish Vaswani, Noam Shazeer, Niki Parmar, Yun Xu, and Chanchal K. Roy. 2018. Ccaligner: a
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, token based large-gap clone detector. In Proceedings
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention of the 40th International Conference on Software
is all you need. In Advances in Neural Information Engineering, ICSE 2018, Gothenburg, Sweden, May
Processing Systems 30: Annual Conference on Neu- 27 - June 03, 2018, pages 1066–1077. ACM.
ral Information Processing Systems 2017, December
[411] Ping Wang, Tian Shi, and Chandan K. Reddy.
4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
2020. Text-to-sql generation for question answer-
[403] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, ing on electronic medical records. In WWW ’20: The
Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Im- Web Conference 2020, Taipei, Taiwan, April 20-24,
proving automatic source code summarization via 2020, pages 350–361. ACM / IW3C2.
deep reinforcement learning. In Proceedings of the [412] Shangwen Wang, Ming Wen, Bo Lin, Yepang
33rd ACM/IEEE International Conference on Auto- Liu, Tegawendé F. Bissyandé, and Xiaoguang Mao.
mated Software Engineering, ASE 2018, Montpellier, 2023. Pre-implementation method name prediction
France, September 3-7, 2018, pages 397–407. ACM. for object-oriented programming. ACM Trans. Softw.
[404] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Eng. Methodol., 32(6).
Amanpreet Singh, Julian Michael, Felix Hill, Omer [413] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han
Levy, and Samuel R. Bowman. 2019. Superglue: Fang, and Hao Ma. 2020. Linformer: Self-attention
A stickier benchmark for general-purpose language with linear complexity. CoRR, abs/2006.04768.
understanding systems. In Advances in Neural In-
formation Processing Systems 32: Annual Confer- [414] Song Wang, Devin Chollak, Dana Movshovitz-
ence on Neural Information Processing Systems 2019, Attias, and Lin Tan. 2016. Bugram: bug detection
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, with n-gram language models. In Proceedings of the
Canada, pages 3261–3275. 31st IEEE/ACM International Conference on Auto-
mated Software Engineering, ASE 2016, Singapore,
[405] Alex Wang, Amanpreet Singh, Julian Michael, Fe- September 3-7, 2016, pages 708–719. ACM.
lix Hill, Omer Levy, and Samuel R. Bowman. 2018.
GLUE: A multi-task benchmark and analysis plat- [415] Song Wang, Taiyue Liu, and Lin Tan. 2016. Au-
form for natural language understanding. In Proceed- tomatically learning semantic features for defect pre-
ings of the Workshop: Analyzing and Interpreting diction. In Proceedings of the 38th International
Neural Networks for NLP, BlackboxNLP@EMNLP Conference on Software Engineering, ICSE 2016,
2018, Brussels, Belgium, November 1, 2018, pages Austin, TX, USA, May 14-22, 2016, pages 297–308.
353–355. Association for Computational Linguistics. ACM.
[406] Bailin Wang, Richard Shin, Xiaodong Liu, Olek- [416] Thomas Wang, Adam Roberts, Daniel Hesslow,
sandr Polozov, and Matthew Richardson. 2020. RAT- Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien
SQL: relation-aware schema encoding and linking for Launay, and Colin Raffel. 2022. What language
model architecture and pretraining objective works Hannaneh Hajishirzi. 2023. Self-instruct: Aligning
best for zero-shot generalization? In International language models with self-generated instructions. In
Conference on Machine Learning, ICML 2022, 17-23 Proceedings of the 61st Annual Meeting of the As-
July 2022, Baltimore, Maryland, USA, volume 162 of sociation for Computational Linguistics (Volume 1:
Proceedings of Machine Learning Research, pages Long Papers), ACL 2023, Toronto, Canada, July 9-14,
22964–22984. PMLR. 2023, pages 13484–13508. Association for Computa-
tional Linguistics.
[417] Weidong Wang, Yujian Kang, and Dian Li. 2022.
An empirical study on java method name suggestion: [426] Yizhong Wang, Swaroop Mishra, Pegah
are we there yet? CoRR, abs/2201.08570. Alipoormolabashi, Yeganeh Kordi, Amirreza
Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan
[418] Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Dhanasekaran, Anjana Arunkumar, David Stap,
Jin. 2020. Detecting code clones with graph neural Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary
network and flow-augmented abstract syntax tree. Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson,
In 27th IEEE International Conference on Software Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal,
Analysis, Evolution and Reengineering, SANER 2020, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar,
London, ON, Canada, February 18-21, 2020, pages Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza,
261–271. IEEE. Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia,
Savan Doshi, Shailaja Keyur Sampat, Siddhartha
[419] Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit,
Guandong Xu. 2020. Transˆ3: A transformer-based and Xudong Shen. 2022. Super-naturalinstructions:
framework for unifying code summarization and code Generalization via declarative instructions on 1600+
search. CoRR, abs/2003.03238. NLP tasks. In Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Pro-
[420] Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, cessing, EMNLP 2022, Abu Dhabi, United Arab
Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Emirates, December 7-11, 2022, pages 5085–5109.
Jiang. 2021. Syncobert: Syntax-guided multi-modal Association for Computational Linguistics.
contrastive pre-training for code representation.
[427] Yue Wang, Hung Le, Akhilesh Deepak Gotmare,
[421] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yi- Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi.
tong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin Jiang, 2023. Codet5+: Open code large language mod-
and Qun Liu. 2022. Compilable neural code genera- els for code understanding and generation. CoRR,
tion with compiler feedback. In Findings of the As- abs/2305.07922.
sociation for Computational Linguistics: ACL 2022,
Dublin, Ireland, May 22-27, 2022, pages 9–19. Asso- [428] Yue Wang, Weishi Wang, Shafiq R. Joty, and
ciation for Computational Linguistics. Steven C. H. Hoi. 2021. Codet5: Identifier-aware
unified pre-trained encoder-decoder models for code
[422] Xin Wang, Yasheng Wang, Yao Wan, Jiawei Wang, understanding and generation. In Proceedings of the
Pingyi Zhou, Li Li, Hao Wu, and Jin Liu. 2022. 2021 Conference on Empirical Methods in Natural
CODE-MVP: learning to represent source code from Language Processing, EMNLP 2021, Virtual Event
multiple views with contrastive pre-training. In Find- / Punta Cana, Dominican Republic, 7-11 November,
ings of the Association for Computational Linguistics: 2021, pages 8696–8708. Association for Computa-
NAACL 2022, Seattle, WA, United States, July 10-15, tional Linguistics.
2022, pages 1066–1077. Association for Computa-
tional Linguistics. [429] Zan Wang, Ming Yan, Junjie Chen, Shuang Liu,
and Dongdi Zhang. 2020. Deep learning library test-
[423] Xuheng Wang, Xu Zhang, Liqun Li, Shilin ing via effective model generation. In ESEC/FSE
He, Hongyu Zhang, Yudong Liu, Lingling Zheng, ’20: 28th ACM Joint European Software Engineering
Yu Kang, Qingwei Lin, Yingnong Dang, Saravanaku- Conference and Symposium on the Foundations of
mar Rajmohan, and Dongmei Zhang. 2022. SPINE: Software Engineering, Virtual Event, USA, November
a scalable log parser with feedback guidance. In Pro- 8-13, 2020, pages 788–799. ACM.
ceedings of the 30th ACM Joint European Software
Engineering Conference and Symposium on the Foun- [430] Cody Watson, Michele Tufano, Kevin Moran,
dations of Software Engineering, ESEC/FSE 2022, Gabriele Bavota, and Denys Poshyvanyk. 2020. On
Singapore, Singapore, November 14-18, 2022, pages learning meaningful assert statements for unit test
1198–1208. ACM. cases. In ICSE ’20: 42nd International Conference
on Software Engineering, Seoul, South Korea, 27
[424] Yanlin Wang, Ensheng Shi, Lun Du, Xiaodi Yang, June - 19 July, 2020, pages 1398–1409. ACM.
Yuxuan Hu, Shi Han, Hongyu Zhang, and Dongmei
Zhang. 2021. Cocosum: Contextual code summa- [431] Anjiang Wei, Yinlin Deng, Chenyuan Yang, and
rization with multi-relational graph neural network. Lingming Zhang. 2022. Free lunch for testing:
CoRR, abs/2107.01933. Fuzzing deep-learning libraries from open source.
In 44th IEEE/ACM 44th International Conference
[425] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, on Software Engineering, ICSE 2022, Pittsburgh, PA,
Alisa Liu, Noah A. Smith, Daniel Khashabi, and USA, May 25-27, 2022, pages 995–1007. ACM.
[432] Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi [441] Ratnadira Widyasari, Sheng Qin Sim, Camellia
Jin. 2019. Code generation as a dual task of code Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan,
summarization. In Advances in Neural Information Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian
Processing Systems 32: Annual Conference on Neu- Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang,
ral Information Processing Systems 2019, NeurIPS David Lo, and Eng Lieh Ouh. 2020. Bugsinpy: a
2019, December 8-14, 2019, Vancouver, BC, Canada, database of existing bugs in python programs to en-
pages 6559–6569. able controlled testing and debugging studies. In
ESEC/FSE ’20: 28th ACM Joint European Software
[433] Huihui Wei and Ming Li. 2017. Supervised deep Engineering Conference and Symposium on the Foun-
features for software functional clone detection by ex- dations of Software Engineering, Virtual Event, USA,
ploiting lexical and syntactical information in source November 8-13, 2020, pages 1556–1560. ACM.
code. In Proceedings of the Twenty-Sixth Interna-
tional Joint Conference on Artificial Intelligence, IJ- [442] Ronald J. Williams. 1992. Simple statistical
CAI 2017, Melbourne, Australia, August 19-25, 2017, gradient-following algorithms for connectionist rein-
pages 3034–3040. ijcai.org. forcement learning. Mach. Learn., 8:229–256.
[434] Jason Wei, Maarten Bosma, Vincent Y. Zhao, [443] Dominik Winterer, Chengyu Zhang, and Zhen-
Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, dong Su. 2020. On the unusual effectiveness of type-
Andrew M. Dai, and Quoc V. Le. 2022. Finetuned aware operator mutations for testing SMT solvers.
language models are zero-shot learners. In The Tenth Proc. ACM Program. Lang., 4(OOPSLA):193:1–
International Conference on Learning Representa- 193:25.
tions, ICLR 2022, Virtual Event, April 25-29, 2022.
OpenReview.net. [444] Edmund Wong, Taiyue Liu, and Lin Tan. 2015.
Clocom: Mining existing source code for automatic
[435] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf- comment generation. In 22nd IEEE International
fel, Barret Zoph, Sebastian Borgeaud, Dani Yo- Conference on Software Analysis, Evolution, and
gatama, Maarten Bosma, Denny Zhou, Donald Met- Reengineering, SANER 2015, Montreal, QC, Canada,
zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, March 2-6, 2015, pages 380–389. IEEE Computer
Percy Liang, Jeff Dean, and William Fedus. 2022. Society.
Emergent abilities of large language models. Trans-
actions on Machine Learning Research. Survey Cer- [445] Hongqiu Wu, Hai Zhao, and Min Zhang. 2021.
tification. Code summarization with structure-induced trans-
former. In Findings of the Association for Com-
[436] Jason Wei, Xuezhi Wang, Dale Schuurmans,
putational Linguistics: ACL/IJCNLP 2021, Online
Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi,
Event, August 1-6, 2021, volume ACL/IJCNLP 2021
Quoc V. Le, and Denny Zhou. 2022. Chain-of-
of Findings of ACL, pages 1078–1090. Association
thought prompting elicits reasoning in large language
for Computational Linguistics.
models. In NeurIPS.
[437] Jiayi Wei, Greg Durrett, and Isil Dillig. 2023. [446] Ming Wu, Pengcheng Wang, Kangqi Yin, Haoyu
Typet5: Seq2seq type inference using static analysis. Cheng, Yun Xu, and Chanchal K. Roy. 2020. Lvmap-
In The Eleventh International Conference on Learn- per: A large-variance clone detector using sequenc-
ing Representations, ICLR 2023, Kigali, Rwanda, ing alignment approach. IEEE Access, 8:27986–
May 1-5, 2023. OpenReview.net. 27997.
[438] Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil [447] Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yuqun
Dillig. 2020. Lambdanet: Probabilistic type infer- Zhang, Guowei Yang, Huixin Ma, Sen Nie, Shi Wu,
ence using graph neural networks. In 8th Inter- Heming Cui, and Lingming Zhang. 2022. Evaluat-
national Conference on Learning Representations, ing and improving neural program-smoothing-based
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, fuzzing. In 44th IEEE/ACM 44th International Con-
2020. OpenReview.net. ference on Software Engineering, ICSE 2022, Pitts-
burgh, PA, USA, May 25-27, 2022, pages 847–858.
[439] Martin White, Michele Tufano, Christopher Ven- ACM.
dome, and Denys Poshyvanyk. 2016. Deep learning
code fragments for code clone detection. In Proceed- [448] Chunqiu Steven Xia, Yuxiang Wei, and Lingming
ings of the 31st IEEE/ACM International Conference Zhang. 2023. Automated program repair in the
on Automated Software Engineering, ASE 2016, Sin- era of large pre-trained language models. In 45th
gapore, September 3-7, 2016, pages 87–98. ACM. IEEE/ACM International Conference on Software
Engineering, ICSE 2023, Melbourne, Australia, May
[440] Martin White, Christopher Vendome, 14-20, 2023, pages 1482–1494. IEEE.
Mario Linares Vásquez, and Denys Poshyvanyk.
2015. Toward deep learning software repositories. [449] Chunqiu Steven Xia and Lingming Zhang. 2022.
In 12th IEEE/ACM Working Conference on Mining Less training, more repairing please: revisiting au-
Software Repositories, MSR 2015, Florence, Italy, tomated program repair via zero-shot learning. In
May 16-17, 2015, pages 334–345. IEEE Computer Proceedings of the 30th ACM Joint European Soft-
Society. ware Engineering Conference and Symposium on
the Foundations of Software Engineering, ESEC/FSE type inference with natural language support. In
2022, Singapore, Singapore, November 14-18, 2022, Proceedings of the 24th ACM SIGSOFT International
pages 959–971. ACM. Symposium on Foundations of Software Engineering,
FSE 2016, Seattle, WA, USA, November 13-18, 2016,
[450] Danning Xie, Yitong Li, Mijung Kim, Hung Viet pages 607–618. ACM.
Pham, Lin Tan, Xiangyu Zhang, and Michael W. God-
frey. 2022. Docter: documentation-guided fuzzing [459] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig,
for testing deep learning API functions. In ISSTA and Thomas Dillig. 2017. Sqlizer: query synthesis
’22: 31st ACM SIGSOFT International Symposium on from natural language. Proc. ACM Program. Lang.,
Software Testing and Analysis, Virtual Event, South 1(OOPSLA):63:1–63:26.
Korea, July 18 - 22, 2022, pages 176–188. ACM.
[460] Mohammad A. Yahya and Dae-Kyoo Kim.
[451] Rui Xie, Wei Ye, Jinan Sun, and Shikun Zhang. 2022. Cross-language source code clone detec-
2021. Exploiting method names to improve code tion using deep learning with infercode. CoRR,
summarization: A deliberation multi-task learning ap- abs/2205.04913.
proach. In 29th IEEE/ACM International Conference
[461] Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen,
on Program Comprehension, ICPC 2021, Madrid,
and Lingxiao Jiang. 2020. Are the code snippets what
Spain, May 20-21, 2021, pages 138–148. IEEE.
we are searching for? A benchmark and an empirical
[452] Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi study on code search with natural-language queries.
Zhong, Torsten Scholak, Michihiro Yasunaga, Chien- In 27th IEEE International Conference on Software
Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Analysis, Evolution and Reengineering, SANER 2020,
Wang, Victor Zhong, Bailin Wang, Chengzu Li, Con- London, ON, Canada, February 18-21, 2020, pages
nor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, 344–354. IEEE.
Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. [462] Aiyuan Yang, Bin Xiao, Bingning Wang, Borong
Smith, Luke Zettlemoyer, and Tao Yu. 2022. Unified- Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian
skg: Unifying and multi-tasking structured knowl- Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang,
edge grounding with text-to-text language models. Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou
In Proceedings of the 2022 Conference on Empirical Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui
Methods in Natural Language Processing, EMNLP Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang,
2022, Abu Dhabi, United Arab Emirates, December Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao
7-11, 2022, pages 602–631. Association for Compu- Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan
tational Linguistics. Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tian-
peng Li, Tianyu Li, Wei Cheng, Weipeng Chen,
[453] Yutao Xie, Jiayi Lin, Hande Dong, Lei Zhang, and Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen,
Zhonghai Wu. 2023. A survey of deep code search. Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yid-
CoRR, abs/2305.05959. ing Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yu-
[454] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, peng Zhang, Zenan Zhou, and Zhiying Wu. 2023.
Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Baichuan 2: Open large-scale language models.
Jiang. 2023. Wizardlm: Empowering large lan- CoRR, abs/2309.10305.
guage models to follow complex instructions. CoRR, [463] Chenyuan Yang, Yinlin Deng, Jiayi Yao, Yuxing
abs/2304.12244. Tu, Hanchi Li, and Lingming Zhang. 2023. Fuzzing
automatic differentiation in deep-learning libraries.
[455] Frank F. Xu, Uri Alon, Graham Neubig, and In 45th IEEE/ACM International Conference on Soft-
Vincent Josua Hellendoorn. 2022. A systematic ware Engineering, ICSE 2023, Melbourne, Australia,
evaluation of large language models of code. In May 14-20, 2023, pages 1174–1186. IEEE.
MAPS@PLDI 2022: 6th ACM SIGPLAN Interna-
tional Symposium on Machine Programming, San [464] Wei Yang, Peng Xu, and Yanshuai Cao. 2021. Hi-
Diego, CA, USA, 13 June 2022, pages 1–10. ACM. erarchical neural data synthesis for semantic parsing.
CoRR, abs/2112.02212.
[456] Ling Xu, Huanhuan Yang, Chao Liu, Jianhang
Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two- [465] Yanping Yang, Ling Xu, Meng Yan, Zhou Xu, and
stage attention-based model for code search with Zhongyang Deng. 2022. A naming pattern based ap-
textual and structural features. In 28th IEEE Inter- proach for method name recommendation. In IEEE
national Conference on Software Analysis, Evolu- 33rd International Symposium on Software Reliabil-
tion and Reengineering, SANER 2021, Honolulu, HI, ity Engineering, ISSRE 2022, Charlotte, NC, USA,
USA, March 9-12, 2021, pages 342–353. IEEE. October 31 - Nov. 3, 2022, pages 344–354. IEEE.
[457] Yichen Xu and Yanqiao Zhu. 2022. A survey on [466] Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, and
pretrained language models for neural code intelli- Huan Sun. 2018. Staqc: A systematically mined
gence. CoRR, abs/2212.10079. question-code dataset from stack overflow. In Pro-
ceedings of the 2018 World Wide Web Conference on
[458] Zhaogui Xu, Xiangyu Zhang, Lin Chen, Kexin World Wide Web, WWW 2018, Lyon, France, April
Pei, and Baowen Xu. 2016. Python probabilistic 23-27, 2018, pages 1693–1703. ACM.
[467] Michihiro Yasunaga and Percy Liang. 2020. NAACL-HLT, New Orleans, Louisiana, USA, June
Graph-based, self-supervised program repair from 1-6, 2018, Volume 2 (Short Papers), pages 588–594.
diagnostic feedback. In Proceedings of the 37th In- Association for Computational Linguistics.
ternational Conference on Machine Learning, ICML
2020, 13-18 July 2020, Virtual Event, volume 119 of [476] Tao Yu, Michihiro Yasunaga, Kai Yang, Rui
Proceedings of Machine Learning Research, pages Zhang, Dongxu Wang, Zifan Li, and Dragomir R.
10799–10808. PMLR. Radev. 2018. Syntaxsqlnet: Syntax tree networks
for complex and cross-domain text-to-sql task. In
[468] Michihiro Yasunaga and Percy Liang. 2021. Proceedings of the 2018 Conference on Empirical
Break-it-fix-it: Unsupervised learning for program Methods in Natural Language Processing, Brussels,
repair. In Proceedings of the 38th International Con- Belgium, October 31 - November 4, 2018, pages
ference on Machine Learning, ICML 2021, 18-24 1653–1663. Association for Computational Linguis-
July 2021, Virtual Event, volume 139 of Proceedings tics.
of Machine Learning Research, pages 11941–11952.
PMLR. [477] Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue,
Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi,
[469] Ming-Ho Yee and Arjun Guha. 2023. Do machine Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sun-
learning models produce typescript types that type grok Shim, Tao Chen, Alexander R. Fabbri, Zifan Li,
check? In 37th European Conference on Object- Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent
Oriented Programming, ECOOP 2023, July 17-21, Zhang, Caiming Xiong, Richard Socher, Walter S.
2023, Seattle, Washington, United States, volume Lasecki, and Dragomir R. Radev. 2019. Cosql: A
263 of LIPIcs, pages 37:1–37:28. Schloss Dagstuhl - conversational text-to-sql challenge towards cross-
Leibniz-Zentrum für Informatik. domain natural language interfaces to databases. In
Proceedings of the 2019 Conference on Empirical
[470] Pengcheng Yin, Bowen Deng, Edgar Chen, Bog- Methods in Natural Language Processing and the
dan Vasilescu, and Graham Neubig. 2018. Learning 9th International Joint Conference on Natural Lan-
to mine aligned code and natural language pairs from guage Processing, EMNLP-IJCNLP 2019, Hong
stack overflow. In Proceedings of the 15th Interna- Kong, China, November 3-7, 2019, pages 1962–1979.
tional Conference on Mining Software Repositories, Association for Computational Linguistics.
MSR 2018, Gothenburg, Sweden, May 28-29, 2018,
pages 476–486. ACM. [478] Tao Yu, Rui Zhang, Kai Yang, Michihiro Ya-
sunaga, Dongxu Wang, Zifan Li, James Ma, Irene
[471] Pengcheng Yin and Graham Neubig. 2017. A syn- Li, Qingning Yao, Shanelle Roman, Zilin Zhang,
tactic neural model for general-purpose code genera- and Dragomir R. Radev. 2018. Spider: A large-scale
tion. In Proceedings of the 55th Annual Meeting of human-labeled dataset for complex and cross-domain
the Association for Computational Linguistics, ACL semantic parsing and text-to-sql task. In Proceed-
2017, Vancouver, Canada, July 30 - August 4, Vol- ings of the 2018 Conference on Empirical Methods
ume 1: Long Papers, pages 440–450. Association for in Natural Language Processing, Brussels, Belgium,
Computational Linguistics. October 31 - November 4, 2018, pages 3911–3921.
Association for Computational Linguistics.
[472] Ying Yin, Yuhai Zhao, Yiming Sun, and Chen
Chen. 2023. Automatic code review by learning [479] Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern
the structure information of code graph. Sensors, Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene
23(5):2551. Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit,
David Proctor, Sungrok Shim, Jonathan Kraft, Vin-
[473] Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, cent Zhang, Caiming Xiong, Richard Socher, and
and Qianxiang Wang. 2019. Neural detection of Dragomir R. Radev. 2019. Sparc: Cross-domain se-
semantic code clones via tree-based convolution. In mantic parsing in context. In Proceedings of the 57th
Proceedings of the 27th International Conference Conference of the Association for Computational Lin-
on Program Comprehension, ICPC 2019, Montreal, guistics, ACL 2019, Florence, Italy, July 28- August
QC, Canada, May 25-31, 2019, pages 70–80. IEEE / 2, 2019, Volume 1: Long Papers, pages 4511–4523.
ACM. Association for Computational Linguistics.
[474] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, [480] Zhiqiang Yuan, Junwei Liu, Qiancheng Zi, Ming-
Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao wei Liu, Xin Peng, and Yiling Lou. 2023. Eval-
Xie, and Qianxiang Wang. 2023. Codereval: A uating instruction-tuned large language models
benchmark of pragmatic code generation with gener- on code comprehension and generation. CoRR,
ative pre-trained models. CoRR, abs/2302.00288. abs/2308.01240.
[475] Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and [481] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin,
Dragomir R. Radev. 2018. Typesql: Knowledge- Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen,
based type-aware neural text-to-sql generation. In and Jian-Guang Lou. 2022. CERT: continual pre-
Proceedings of the 2018 Conference of the North training on sketches for library-oriented code gen-
American Chapter of the Association for Computa- eration. In Proceedings of the Thirty-First Interna-
tional Linguistics: Human Language Technologies, tional Joint Conference on Artificial Intelligence, IJ-
CAI 2022, Vienna, Austria, 23-29 July 2022, pages 2023. Codegeex: A pre-trained model for code gen-
2369–2375. ijcai.org. eration with multilingual evaluations on humaneval-x.
CoRR, abs/2303.17568.
[482] Daoguang Zan, Bei Chen, Fengji Zhang, Dian-
jie Lu, Bingchao Wu, Bei Guan, Yongji Wang, and [491] Yunhui Zheng, Saurabh Pujar, Burn L. Lewis,
Jian-Guang Lou. 2023. Large language models meet Luca Buratti, Edward A. Epstein, Bo Yang, Jim
nl2code: A survey. In Proceedings of the 61st Annual Laredo, Alessandro Morari, and Zhong Su. 2021.
Meeting of the Association for Computational Lin- D2A: A dataset built for ai-based vulnerability de-
guistics (Volume 1: Long Papers), ACL 2023, Toronto, tection methods using differential analysis. In 43rd
Canada, July 9-14, 2023, pages 7443–7464. Associa- IEEE/ACM International Conference on Software En-
tion for Computational Linguistics. gineering: Software Engineering in Practice, ICSE
(SEIP) 2021, Madrid, Spain, May 25-28, 2021, pages
[483] John M. Zelle and Raymond J. Mooney. 1996. 111–120. IEEE.
Learning to parse database queries using inductive
logic programming. In Proceedings of the Thirteenth [492] Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yan-
National Conference on Artificial Intelligence and lin Wang, Wenqing Chen, Lianghong Guo, and We-
Eighth Innovative Applications of Artificial Intelli- icheng Wang. 2023. Towards an understanding of
gence Conference, AAAI 96, IAAI 96, Portland, Ore- large language models in software engineering tasks.
gon, USA, August 4-8, 1996, Volume 2, pages 1050– CoRR, abs/2308.11396.
1055. AAAI Press / The MIT Press.
[493] Victor Zhong, Caiming Xiong, and Richard Socher.
[484] Chunyan Zhang, Junchao Wang, Qinglei Zhou, 2017. Seq2sql: Generating structured queries
Ting Xu, Ke Tang, Hairen Gui, and Fudong Liu. 2022. from natural language using reinforcement learning.
A survey of automatic source code summarization. CoRR, abs/1709.00103.
Symmetry, 14(3):471.
[494] Wenkang Zhong, Chuanyi Li, Jidong Ge, and Bin
[485] Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Luo. 2022. Neural program repair : Systems, chal-
Daoguang Zan, Yi Mao, Jian-Guang Lou, and lenges and solutions. In Internetware 2022: 13th
Weizhu Chen. 2023. Repocoder: Repository-level Asia-Pacific Symposium on Internetware, Hohhot,
code completion through iterative retrieval and gen- China, June 11 - 12, 2022, pages 96–106. ACM.
eration. CoRR, abs/2303.12570.
[495] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi,
[486] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia,
Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel Linqi Song, Mingjie Zhan, and Hongsheng Li. 2023.
neural source code representation based on abstract Solving challenging math word problems using GPT-
syntax tree. In Proceedings of the 41st International 4 code interpreter with code-based self-verification.
Conference on Software Engineering, ICSE 2019, CoRR, abs/2308.07921.
Montreal, QC, Canada, May 25-31, 2019, pages 783–
794. IEEE / ACM. [496] Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xi-
aoning Du, and Yang Liu. 2019. Devign: Effective
[487] Quanjun Zhang, Chunrong Fang, Yuxiang Ma, vulnerability identification by learning comprehen-
Weisong Sun, and Zhenyu Chen. 2023. A survey sive program semantics via graph neural networks.
of learning-based automated program repair. CoRR, In Advances in Neural Information Processing Sys-
abs/2301.03270. tems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December
[488] Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, 8-14, 2019, Vancouver, BC, Canada, pages 10197–
Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming 10207.
Xiong, Richard Socher, and Dragomir R. Radev.
2019. Editing-based SQL query generation for cross- [497] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He,
domain context-dependent questions. In Proceedings Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019.
of the 2019 Conference on Empirical Methods in Nat- Tools and benchmarks for automated log parsing.
ural Language Processing and the 9th International In Proceedings of the 41st International Conference
Joint Conference on Natural Language Processing, on Software Engineering: Software Engineering in
EMNLP-IJCNLP 2019, Hong Kong, China, Novem- Practice, ICSE (SEIP) 2019, Montreal, QC, Canada,
ber 3-7, 2019, pages 5337–5348. Association for May 25-31, 2019, pages 121–130. IEEE / ACM.
Computational Linguistics.
[498] Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan
[489] Tianzhu Zhang, Han Qiu, Gabriele Castellano, Ravindran, Sindhu Tipirneni, and Chandan K. Reddy.
Myriana Rifai, Chung Shue Chen, and Fabio Pianese. 2022. Xlcost: A benchmark dataset for cross-lingual
2023. System log parsing: A survey. IEEE Trans. code intelligence. CoRR, abs/2206.08474.
Knowl. Data Eng., 35(8):8596–8614.
[499] Ming Zhu, Karthik Suresh, and Chandan K. Reddy.
[490] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, 2022. Multilingual code snippets training for pro-
Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi gram translation. In Thirty-Sixth AAAI Conference
Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. on Artificial Intelligence, AAAI 2022, Thirty-Fourth
Conference on Innovative Applications of Artificial A Benchmarks for Downstrem Tasks
Intelligence, IAAI 2022, The Twelveth Symposium
on Educational Advances in Artificial Intelligence, Table 5, 6, 7, 8 list benchmark datasets for code
EAAI 2022 Virtual Event, February 22 - March 1, downstream tasks.
2022, pages 11783–11790. AAAI Press.
[500] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie
Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang.
2021. A syntax-guided edit decoder for neural pro-
gram repair. In ESEC/FSE ’21: 29th ACM Joint
European Software Engineering Conference and Sym-
posium on the Foundations of Software Engineering,
Athens, Greece, August 23-28, 2021, pages 341–353.
ACM.
[501] Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen
Li, and Hai Jin. 2021. $\mu$µvuldeepecker: A deep
learning-based system for multiclass vulnerability
detection. IEEE Trans. Dependable Secur. Comput.,
18(5):2224–2236.
Task Date Benchmark Source Size Language
Hemphill et al. (1990); Dahl
1990 ATIS 11508
et al. (1994)
1996 GeoQuery Zelle and Mooney (1996) 877
2000 Restaurants Tang and Mooney (2000) 378
2014-09 MAS Li and Jagadish (2014) 196
2017-02 Yelp Yaghmazadeh et al. (2017) 128
2017-02 IMDb Yaghmazadeh et al. (2017) 131
2017-04 Scholar Iyer et al. (2017) 816
Text-to-SQL 2017-08 WikiSQL Zhong et al. (2017) 80654
2018-06 Advising Finegan-Dollak et al. (2018) 4570
2018-09 Spider Yu et al. (2018) 10181
2019-06 SParC Yu et al. (2019) 12726
2019-07 MIMICSQL Wang et al. (2020) 10000
2019-09 CoSQL Yu et al. (2019) 15598
2020-10 Squall Shi et al. (2020) 11276
2021-06 SEDE Hazoom et al. (2021) 12023
2021-06 KaggleDBQA Lee et al. (2021) 400
2018-08 CONCODE Iyer et al. (2018) 104K Java
2021-05 APPS Hendrycks et al. (2021) 10000 Python
2021-07 HumanEval Chen et al. (2021) 164 Python
2021-08 MBPP Austin et al. (2021) 974 Python
Program 2021-08 MathQA-Python Austin et al. (2021) 23914 Python
Synthesis 2022-06 AixBench Hao et al. (2022) 336 Java
2022-11 DS-1000 Lai et al. (2023) 1000 Python
2023-02 CoderEval Yu et al. (2023) 460 Python, Java
Python, C++,
2023-03 HumanEval-X Zheng et al. (2023) 820
Java, JS, Go
2023-09 CodeApex Fu et al. (2023) 476 C++
C++, Java,
2020-06 GeeksforGeeks Rozière et al. (2020) 1.4K
Python
2021-02 CodeTrans Lu et al. (2021) 11.8K Java, C#
Code 2021-08 Avatar Ahmad et al. (2023) 9515 Java, Python
Translation C++, Java,
2022-06 CoST Zhu et al. (2022) ∗ 132K Python, C#, JS,
PHP, C
C++, Java,
2022-06 XLCoST Zhu et al. (2022) ∗ 567K Python, C#, JS,
PHP, C
∗ 1640 Python, C++,
2023-03 HumanEval-X Zheng et al. (2023)
Java, JS, Go
∗ 4000 C++, Java, C#,
2023-08 G-TransEval Jiao et al. (2023)
JS, Python
Table 5: Benchmarks for text-to-SQL generation, program synthesis, and code translation. JS is short for JavaScript.
∗
These are pairwise sample counts. For example, HumanEval-X includes 164 programs, each implemented in 5
languages, totaling 164 × (5 × 4 / 2) = 1640 translation pairs.
Task Date Benchmark Source Size Language
2014-07 Defects4J Just et al. (2014) 357 Java
2015-12 ManyBugs Goues et al. (2015) 185 C
2015-12 IntroClass Goues et al. (2015) 998 C
2016-11 BugAID Hanam et al. (2016) 105K JS
2017-02 DeepFix Gupta et al. (2017) 6971 C
2017-05 Codeflaws Tan et al. (2017) 3902 C
2017-10 QuixBugs Lin et al. (2017) 80 Java, Python
2018-12 BFP Tufano et al. (2019) 124K Java
Program Repair
2019-01 unnamed Tufano et al. (2019) 21.8K Java
Karampatsis and Sutton
2019-05 ManySStuBs4J 154K Java
(2020)
2019-11 Refactory Hu et al. (2019) 1783 Python
Java, Python, C,
2020-07 CoCoNut Lutellier et al. (2020) 24M
JS
2020-11 BugsInPy Widyasari et al. (2020) 493 Python
2021-07 TFix Berabi et al. (2021) 105K JS
2022-11 TypeBugs Oh and Oh (2022) 93 Python
Python, JS, Go,
2023-08 HumanEvalPack Muennighoff et al. (2023) 984
Java, C++, Rust
2016-08 CODE-NN Iyer et al. (2016) 66K/32K C#/SQL
2017-07 unnamed Barone and Sennrich (2017) 150K Python
Code 2018-05 DeepCom Hu et al. (2018) 588K Java
Summarization 2018-07 TL-CodeSum Hu et al. (2018) 411K Java
Go, JS, Python,
2019-09 CodeSearchNet Husain et al. (2019) 2.3M
PHP, Java, Ruby
Python, JS, Go,
2023-08 HumanEvalPack Muennighoff et al. (2023) 984
Java, C++, Rust
2013-05 GitHub Java Corpus Allamanis and Sutton (2013) 2.1M Java
∗ Code 2016-10 Py150 Raychev et al. (2016) 150K Python
Completion 2016-10 JS150 Raychev et al. (2016) 150K JS
2023-06 LCC Guo et al. (2023) 360K Python, Java, C#
Table 6: Benchmarks for program repair, code summarization, and code completion. JS is short for JavaScript. ∗ The
task of code completion can be evaluated on any source code corpus, so we only list a few widely used benchmarks
here. For cross-file code completion please refer to Table 8.
Task Date Benchmark Source Size Language
2018-03 StaQC Yao et al. (2018) 268K Python, SQL
2018-05 DeepCS Gu et al. (2018) 16M Java
2018-05 CoNaLa Yin et al. (2018) ∗ 600K/2.9K Python
2019-08 unnamed Li et al. (2019) 287 Java
∗ 2.3M/99 Go, JS, Python,
2019-09 CodeSearchNet Husain et al. (2019)
Code Retrieval PHP, Java, Ruby
2020-02 CosBench Yan et al. (2020) 52 Java
2020-08 SO-DS Heyman and Cutsem (2020) 2.2K Python
2020-10 FB-Java Ling et al. (2021) 249K Java
2021-02 AdvTest Lu et al. (2021) 280K Python
2021-02 WebQueryTest Lu et al. (2021) 1K Python
2021-05 CoSQA Huang et al. (2021) 21K Python
2020-09 MMLU Hendrycks et al. (2021) † 15908
Table 7: Benchmarks for code retrieval, code reasoning, type inference, clone detection/code search, and de-
fect/vulnerability detection. JS is short for JavaScript. ∗ These benchmarks include a large number of automatically
constructed samples, and a small set of human-annotated samples. † These are general-domain reasoning bench-
marks, and only a subset therein concern programming, algorithms, and other topics related to computer science.
‡
These are project counts (or, in the case of Cassano et al. (2023), file counts). Yee and Guha (2023) propose to
measure project-level type check rate instead of type prediction accuracy for TypeScript.
Task Date Benchmark Source Size Language
Zhu et al. (2019); He et al.
2018-11 LogHub (2018) 379M
Log Parsing (2020)
2023-08 LogHub (2023) Jiang et al. (2023) ∗ 50.4M
Table 8: Benchmarks for log parsing and repository level coding. ∗ LogHub (2023) is an annotated subset of LogHub
(2018). † Line Completion/API Invocation Completion/Function Completion. ‡ Retrieval/Completion/Pipeline.
⋄
Migration/Temporal Edit.