(2023) A Survey On Language Models For Code
Codex (Chen et al., 2021), PaLM Coder (Chowdhery et al., 2022), Minerva (Lewkowycz et al., 2022),
Adaped LM PaLM 2* (Anil et al., 2023), Code LLaMA (Rozière et al., 2023)
Instruction WizardCoder (Luo et al., 2023), PanGu-Coder2 (Shen et al., 2023), Oc-
Finetuning toCoder (Muennighoff et al., 2023), MFTCoder (Liu et al., 2023)
Reinforcement CompCoder (Wang et al., 2022), CodeRL (Le et al., 2022), PPOCoder (Sho-
Learning jaee et al., 2023), RLTF (Liu et al., 2023)
ing, including instruction tuning (Honovich et al., learning, and engineering improvements. Then, in
2023; Xu et al., 2023; Luo et al., 2023), infilling §6, we discuss unique features of code that are not
objectives (Tay et al., 2023; Li et al., 2023; Rozière available to natural languages but have been uti-
et al., 2023), recontemplation of scaling laws (Hoff- lized to aid code processing. In §7, we review the
mann et al., 2022; Gunasekar et al., 2023; Li et al., most recent integration between LLMs and soft-
2023), architectural improvements (Shazeer, 2019; ware development, before finally concluding this
Su et al., 2021; Dao et al., 2022), and autonomous work in §8 and highlighting the current challenges
agents (Qian et al., 2023; Hong et al., 2023), while in code processing.
in return SE requirements are providing real-world
testbeds for these technologies and driving the de- 2 Background
velopment of LLMs forward into production. We
believe a systematic review of these advancements In this section, we briefly review the preliminaries
would benefit both communities. of Transformer-based language modeling, includ-
ing common objectives for unidirectional and bidi-
The rest of this work is organized following the
rectional models, as well as some popular models
taxonomy presented in Figure 1. In §2 we first
and designs in NLP.
provide the preliminaries of language modeling
and Transformer models, and then in §3 we con-
2.1 Causal Language Modeling
textualize the evaluation of language models for
code, highlighting the historical transition from Unidirectional language models (also known as
various code understanding tasks to more practical causal language models2 ) factor the probability of
text-to-code generation tasks. In §4 we discuss the a sentence into the product of each token’s con-
plethora of LLMs that have demonstrated coding ditional probability with the chain rule. A piece
ability, and then in §5 we review the specialized of input text x = [x1 , x2 , · · · , xn ] consisting of n
and often smaller models by their architecture, with 2
The training objective of such language models is Causal
special attention on the recent application of infill- Language Modeling (CLM), but also referred to as Next Token
ing objectives, instruction tuning, reinforcement Prediction.
tokens is modeled as 2.3 Denoising Objectives
Y GPT-style causal LM and BERT-style bidirectional
P (x) = pθ (xi |x1:i−1 ), (1) LM each has its own strengths and weaknesses.
i=1 While GPT can be used for autoregressive gen-
eration, it lacks a bidirectional representation of
where x1:i−1 is a shorthand for tokens before xi input text, and is thus unsuitable for sequence-to-
in the input, and θ is the parameters of the model. sequence (seq2seq) generation tasks such as transla-
With Transformer decoders such as GPT (Radford tion and summarization. BERT, on the other hand,
et al., 2018; Radford et al., 2019; Brown et al., can produce bidirectional representations, but is
2020) and LLaMA (Touvron et al., 2023; Touvron pretrained only for mask filling, not generation.
et al., 2023), the conditional probability in (1) is The vanilla Transformer encoder-decoder ar-
modeled by adding an attention mask to the atten- chitecture combines the respective advantages of
tion matrix of each Transformer block, ensuring GPT and BERT. T5 (Raffel et al., 2020) is such
that xi can only attend to previous tokens. During a model pretrained with span corruption, which
training, the cross entropy loss on all tokens in the can be regarded as a variant of MLM. During pre-
input is calculated in parallel, while at inference training, spans of text in the input are replaced
time new token is generated autoregressively. For with sentinel tokens, which plays the same role
further details about the Transformer architecture as [MASK] in BERT. The noisy input is first pro-
we refer to Vaswani et al. (2017). cessed by the encoder with bidirectional atten-
tion, and the masked spans are then generated
2.2 Masked Language Modeling autoregressively by the decoder. Formally, if k
Unlike causal language models, bidirectional lan- spans are sampled for corruption in input x, the
guage models are trained to acquire a better con- noisy input x̂ is then constructed by replacing
textual representation of text rather than generating each span with a special token <extra_id_i>, for
text autoregressively. In the vanilla Transformer, i = 1, 2, · · · , k, and the target y is constructed
the encoder part is allowed to attend to a token’s left by concatenating all spans prepended with corre-
as well as right context for this purpose. BERT (De- sponding sentinels: [<extra_id_1>, span1 , · · · ,
vlin et al., 2019) takes one step further and trained <extra_id_k>, spank ]. The model is then trained
only a Transformer encoder. A set M of randomly with a standard seq2seq objective, by maximizing
chosen tokens in the input are replaced by a special
token [MASK] to obtain a noisy input x̂, for example Y
pθ (y|x̂) = pθ (yi |x̂, y1:i−1 ). (3)
[[CLS], x1 , [MASK], x3 , [MASK], x5 , [EOS]]3 , and
the model is trained to recover the original tokens
by maximizing Lester et al. (2021) show that models pretrained
Y with such objectives can be adapted for autore-
pθ (m|x̂). (2) gressive language modeling with extra pretraining
m∈M using the prefix language modeling objective, i.e.
spliting the text into two parts, processing the first
While this objective requires the model to have a part with encoder and generating the second part
deep understanding of the input text to reconstruct with decoder.
it, it suffers from low training efficiency, since only Tay et al. (2023) argue that span corruption is
a small set of tokens (usually 15%) are masked (and also closely related to CLM, since one can mask out
thus “trained on”). To address this issue, Clark et al. the whole input text as a single span and train the
(2020) proposed ELECTRA, which is trained to decoder to generate it autoregressively. Inspired by
discriminate whether or not each token in the input this relation, they propose UL2, which is the com-
has been replaced by a BERT-like model instead. bination of many span corruption objectives that
Both [CLS] and [EOS] are artificial tokens added to the differ in corruption rate and span length. Apply-
input text. [CLS] is added at the beginning and its representa- ing it to both encoder-decoder models and decoder-
tion is used for sentence classification, while [EOS] indicates only models, they find that encoder-decoder models
end of sentence. The original BERT also uses another special
token [SEP], which is no longer in common use, and we refer performs better under the same computation bud-
to Devlin et al. (2019) for details. get constraint. Other researches have also found
that such encoder-decoder models generally per- and such pre-norm has since become a standard
form better than causal decoder-only models (Wang practice in Transformer decoders.
et al., 2022; Soltan et al., 2022). GPT-J (Wang and Komatsuzaki, 2021) modifies
the Transformer block to compute FFN sub-layer
2.4 Auxiliary Objectives
and self-attention sub-layer in parallel to increase
Language modeling objectives, such as previously computation throughput:
discussed CLM and MLM, mainly train the model
to capture token-level information and are ineffec-
tive at modeling document structures. Thus, auxil-
y = x + FFN(LN(x)) + Attention(LN(x)), (8)
iary objectives are often added to help the models
learn such global information. BERT is pretrained
with next sentence prediction (NSP) along with and Chowdhery et al. (2022) observes limited per-
MLM, which is formulated as a binary classifica- formance degradation when applying this design to
tion task to predict whether two segments in the larger models.
input are neighboring in the original corpus. Lan PaLM (Chowdhery et al., 2022) introduces Ro-
et al. (2020) propose a more challenging sentence- tary Position Embedding (RoPE) and Multi-Query
order prediction (SOP) task, where the negative Attention (MQA) into LLMs. RoPE (Su et al.,
samples are constructed by swapping the order of 2021) multiplies the keys and queries of each self-
two neighboring sentences instead of sampling a attention layer by a position-dependent rotation
random sentence from other documents. matrix to inject position information, and is later
Relatedly, Raffel et al. (2020) mix supervised shown to enable position interpolation for process-
downstream samples such as GLUE (Wang et al., ing of longer sequences (Chen et al., 2023; Roz-
2018) into T5’s pretraining dataset to conduct multi- ière et al., 2023). Alternative to RoPE, Press et al.
task pretraining. However, it is worth noting that (2022) propose ALiBi, which directly attenuates
since they unify all tasks into a text-to-text format, the attention scores according to the relative posi-
the training objective is the same for their self- tion between key and query. This position embed-
supervised pretraining and supervised downstream ding scheme is later adopted by BLOOM (Scao
tasks. et al., 2022).
Apart from position embeddings, another issue
2.5 Implementation Design in Transformer that has long troubled researchers is
While most researches on pretraining language the fact that the complexity of self-attention scales
models have focused on designing training objec- quadratically with the input sequence length. Some
tives, low-level implementation of the Transformer works such as Reformer (Kitaev et al., 2020), Lin-
architecture itself is also being continuously im- former (Wang et al., 2020), Performer (Choroman-
proved over the years in pursuit of stability, perfor- ski et al., 2021) and cosFormer (Qin et al., 2022)
mance, and efficiency. use approximate attention to reduce this complex-
The original Transformer block proposed by ity, but they mostly come at the cost of degraded
Vaswani et al. (2017) is formulated as performance. Other works tackle this issue from an
engineering point-of-view. MQA (Shazeer, 2019)
h = LN(Attention(x) + x), (4) shares the same set of keys and values across all
y = LN(FFN(h) + h), (5) attention heads to optimize memory-to-arithmetic
ratio and significantly improves inference speed
where x is the layer’s input, y is the layer’s output, at small costs of model performance. Its variant
“Attention” is the self-attention sublayer, “FFN” is Grouped-Query Attention (GQA, Ainslie et al.,
the feed-forward sublayer, and “LN” is layer nor- 2023) takes a middle-ground approach by divid-
malization (Ba et al., 2016). ing attention heads into groups and sharing the
GPT-2 (Radford et al., 2019) moves layer nor- same set of keys/values within each group. Orthog-
malization to the input of each Transformer sub- onally, Dao et al. (2022) introduce FlashAttention,
block to stabilize training: which is an exact but improved implementation of
self-attention that optimizes IO operations on the
h = Attention(LN(x)) + x, (6)
accelerating device via tiling to improve memory
y = FFN(LN(h)) + h, (7) efficiency.
WizardCoder 34B 34B
60 PanGu-Coder 2 16B
WizardCoder 16B
HumanEval pass@1
Figure 3: Evaluation tasks for code processing, to be continued in Figure 4. Black: non-neural methods. Red:
non-Transformer neural methods (such as LSTM). Orange: Transformer encoder based methods (such as BERT).
Violet: Transformer based seq2seq methods (such as T5). Blue: Transformer decoder based methods (such as GPT).
Gray: Other Transformer-based methods (such as UniLM). For code synthesis we only list several representative
benchmarks here, and refer to §4 and §5 for more details. We note that here “method” differs from “target”. For
example, Pearce et al. (2022) examine the code generated by GitHub Copilot for vulnerabilities, but the method
they use is non-neural.
by directly generating the source code in the au- and wide application in data management. We refer
toregressive language modeling style, even without to Kumar et al. (2022) and Deng et al. (2022) for
task-specific finetuning (Chen et al., 2021). We surveys on this topic.
discuss this task in more detail in §3.3. - Math programming is also a special case of
- Text-to-SQL is a special (and arguably easier) code synthesis, where a language model is required
case of code synthesis, where the model is tasked to solve mathematical reasoning problems via gen-
to generate SQL commands from natural language erating code that will be executed by external inter-
queries. It has been a topic of special interest due preters. This task abstracts the reasoning process
to SQL’s structured nature (when compared with from numerical calculations, and is thus of special
general-purpose languages such as Python and C) interest in evaluating LLMs.
3.1.2 Code-to-Code requires a decoder to generate autoregressively.
Code-to-code tasks take code as input, and output - Obfuscation refers to the process of renaming
code. identifiers (e.g. variables, methods, and classes),
- Code search is a task similar to code retrieval, for example to generic names like var_1, var_2
and differs from the later only in that the input or x, y. It is an important technique in virus de-
is an existing code snippet, often in a different tection, intellectual property protection, and code
programming language from the target. size reduction (Collberg and Thomborson, 2002;
- Code completion aims to complete a piece Murad et al., 2010; Vasilescu et al., 2017). De-
of code given its prefix. This is essentially lan- obfuscation refers to the reverse process, where
guage modeling applied to code, and related tech- meaningful identifier names are recovered from
nologies have been progressively introduced: n- obfuscated programs. Obfuscation has seen few
gram, RNN, and Transformer. However, due to application of language models as it can be easily
the structured nature of programming languages, achieved statically, but deobfuscation has been a
many early works found grammar-aided statistical subject of more interest in recent years, and has
models to perform better (Bielik et al., 2016; Hel- been adopted as a pretraining objective for code
lendoorn and Devanbu, 2017), and neural models language models (Lachaux et al., 2021; Ding et al.,
only became dominant after 2018 (see Figure 3 for 2022).
an intuitive overview.) - Unit test generation aims to generate unit tests
- Code translation aims to translate a piece of for a given program. Prior to the rise of Codex and
code (usually a function or method) into another other code LLMs, almost all works in this area em-
programming language. The relation between code ployed non-neural methods (see Figure 3). In the
translation and cross-lingual code search is similar age of LLMs, however, this task is ever more im-
to the one between code synthesis and text-to-code portant, as researches have shown that the current
retrieval, and SMT/MNT models have also been unit tests for evaluating LLMs’ program synthesis
widely applied to this task. Unlike code synthe- capability may be insufficient (Liu et al., 2023).
sis, which is useful in aiding programmers to write - Assertion generation is a task closely related to
snippets of code, code translation is an important unit testing. Given a program and a set of unit tests,
technique in migrating old projects written in ob- this task aims to generate assertions (also known
solete languages. However, we are yet to witness as oracles in software engineering) to evaluate the
such applications, as the context window of even program using the unit tests. This task has gener-
the most powerful language models are quite lim- ally went unnoticed by the NLP community, as the
ited in the face of such projects. program synthesis task used for evaluating LLMs
- Code repair, also known as bug fix, aims to fix often concern standalone, competition-style meth-
a piece of buggy code. Like code translation, it is a ods, for which the simple assertion of the equality
traditional sequence-to-sequence generation task, between program output and expected answer suf-
and surveys are abundant on this topic (Gazzola fices.
et al., 2018; Monperrus, 2018; Zhong et al., 2022; - Mutant generation aims to generate mutants of
Zhang et al., 2023; Huang et al., 2023). a given program for the purpose of mutation test-
- Cloze test is a recently proposed task for code ing, and relates closely to unit test generation and
processing, after the rise of BERT-style pretraining. assertion generation. A mutant that is not detected
Due to the unique semantics of programming lan- by a given set of unit tests and assertions indicates
guages, several keywords are often selected for this that either additional test cases or better assertions
test, such as min and max (Feng et al., 2020). are required (Fraser and Arcuri, 2011). Recently,
- Code infilling is another recently proposed task, masking out tokens in the source code and sam-
after fill-in-the-middle pretraining (Bavarian et al., pling them from the output of a masked language
2022) became popular. It is a generalization of model has become a common method for this task.
code completion, where not only the left context, Papadakis et al. (2019) provides a survey on this
but also the right context is given. However, it topic.
differs from cloze test in that the target of cloze test - Fuzzing aims to mutate a given set of unit tests
is only one token, while the target of code infilling to generate new test cases, and is another task re-
can be an entire line or even multiple lines, which lated to testing software. While many recent works
Figure 4: Evaluation tasks for code processing, continued from Figure 3. Black: non-neural methods. Red:
non-Transformer neural methods. Orange: Transformer encoder based methods. Violet: Transformer based seq2seq
methods. Blue: Transformer decoder based methods. Gray: Other Transformer-based methods. We note that here
“method” differs from “target”. For example, Pearce et al. (2022) examine the code generated by GitHub Copilot for
vulnerabilities, but the method they use is non-neural.
on fuzzing target deep learning libraries, few have technologies to recommend comments from a pool
utilized langauge models to conduct this process of existing reviews. As generative models be-
(see Figure 3). came more capable, however, researchers have also
- Type prediction aims to predict the type of dy- studied directly generating review comments as a
namic programming languages such as Python and sequence-to-sequence learning task.
JavaScript. It has been used as a pretraining objec- - Identifier prediction is the task of predicting
tive for code language models (Wang et al., 2022), identifier names in the code. As these names are
where it is often simplified as a binary tagging deemed to contain important semantic information,
task to predict which tokens in the code are identi- this task has been utilized for code summariza-
fiers (Wang et al., 2021; Wang et al., 2021). tion (Allamanis et al., 2016), as well as pretraining
code models (Wang et al., 2021; Niu et al., 2022).
3.1.3 Code-to-Text
Code-to-text tasks take code as input, and output 3.1.4 Code-to-Patterns
text. Code-to-patterns tasks conduct classification on
- Code summarization, also referred to as doc- code.
string generation, aims to generate a natural lan- - Defect detection predicts whether the input
guage description for a given piece of code (often code is buggy or not, and is a standard single-
a function or method). This is the opposite of code sentence classification task.
synthesis, and SMT/NMT techniques have been - Clone detection predicts whether or not two
likewise applied. Zhang et al. (2022) provides a pieces of code are clones of each other. In software
survey on this topic. engineering there exist four types of code clones,
- Code review aims to automate the process of and the most challenging type to identify is seman-
peer code review, and may come in many forms. tic clones, i.e. syntactically dissimilar code that
Many early works formulated it as a binary classi- have the same functionality. As this task can be
fication task to accept or reject changes at commit viewed as a two-sentence classification task, BERT-
time, while others utilized information retrieval style language models have been widely applied to
it. that for code infilling only one span in the input is
- Code classification, popularized by Mou et al. masked. Similarly, cloze test is an understanding
(2016), aims to predict the functionality of a piece task in the same form as (2).
of code within a predefined set of labels. A very Defect detection, clone detection, code classifica-
similar task is author identification, which predicts tion, and type prediction are sequence classification
the author of the input code. Both tasks are stan- tasks. In these tasks, a set of labels Y is defined
dard single-sentence classification tasks. over the input, and each instance is assigned a la-
- Code reasoning is a recently introduced task bel y ∈ Y (e.g. for defect detection Y = {0, 1},
for evaluating LLMs, and often comes as a sub- while for type prediction a possible Y is {int, float,
set of general evaluation benchmarks such as string, bool, others}). The model is then tasked to
MMLU (Hendrycks et al., 2021). This task requires maximize
the model to reason about the code or algorithms, pθ (y|x). (9)
and answer related questions which are written in
multiple-choice form and may range from concep- The last two tasks - code retrieval and code
tual understanding to numerical calculation and search - also belong to understanding tasks. In
complexity analysis. these tasks, each source sequence x is paired with
a positive target sequence y and a set of negative
3.1.5 Text-to-Text targets ȳ ∈ {y1 , · · · , yk }. The model’s task is to
Text-to-text tasks take text as input, and output text. find a similarity metric s such that s(x, y) is larger
- Document translation is the automatic trans- than s(x, ȳ).
lation of code-related documents. Since models,
3.2 Evaluation Metrics
datasets, and prompting strategies for machine
translation is abundant in NLP (Vaswani et al., Of the tasks mentioned in §3.1, the understand-
2017; Goyal et al., 2022; He et al., 2023), we do ing tasks are similar in form to natural language
not go into detail about this task. understanding tasks (Wang et al., 2018; Wang
- Log parsing aims to analyze the system logs et al., 2019) and evaluated likewise by metrics
produced by software products, for example pars- such as accuracy, F1 and Mean Reciprocal Rank
ing logs into structured templates or finding anoma- (MRR), while short generation tasks such as iden-
lies from raw logs. Zhu et al. (2019) provides a tifier prediction is also evaluated by accuracy of
survey on traditional methods for this task up to exact matches. Code-to-text tasks are evaluated
2018, while Zhang et al. (2023) also cover more with common metrics for text generation such as
recent methods. BLEU (Papineni et al., 2002),
Evaluation of tasks involving code generation,
3.1.6 NLP Point-of-View on the other hand, is more complicated. Most early
Among the previously listed tasks, code synthesis, works evaluated syntactical correctness, i.e. the
code translation, code repair, deobfuscation, unit percentage of generations that can be successfully
test generation, assertion generation, mutant gen- parsed. Chen et al. (2018) argued against such met-
eration, fuzzing, code summarization, code review, rics and suggested reference match instead, which
and identifier prediction are sequence-to-sequence is the percentage of generations that are exactly the
generation tasks. Formally, each instance of these same as the references. Ren et al. (2020) proposed
tasks has a source sequence x (e.g. a piece of CodeBLUE, a variant of BLEU that takes code syn-
source code) and a target sequence y (e.g. its corre- tax and semantics into account by evaluating the
sponding summarization), and the language model overlap of abstract syntax tree (AST) and data flow.
is tasked to maximize the conditional probability As code generation models became more capa-
given by (3), where θ can be either a decoder-only ble over the years, however, these metrics based
model or an encoder-decoder model. In the former on content-overlap have been found to be inad-
case, x and y are concatenated. In the later case, x equate (Rozière et al., 2020; Hendrycks et al.,
is processed by the encoder and y is processed by 2021; Austin et al., 2021), since functionally equiv-
the decoder. alent snippets of code can differ dramatically in
Code completion and code infilling are also gen- their lexical forms. Consequently, researchers have
eration tasks, and correspond exactly to the two turned their attention to functional correctness. One
pretraining objectives given in (1) and (3), except popular example of such metrics is pass@k, pro-
posed by Kulal et al. (2019) and refined by Chen and temporal editing, and Jimenez et al. (2023) in-
et al. (2021), which is an unbiased estimator of the troduce a corresponding benchmark, SWE-bench.
model’s chance in passing all unit tests of a pro-
gram with any of k generated samples. This metric 4 General Language Models for Code
can be generalized to passn@k (Li et al., 2022),
which limits the number of model submissions to n Since language models scaled to hundreds of bil-
but allows filtering by unit tests given in the input lions of parameters (Brown et al., 2020; Chowdhery
from k samples. et al., 2022), many of them have demonstrated non-
trivial coding capability, even if they are not specif-
3.3 Program Synthesis ically designed or trained for code. Pioneered by
Codex, researchers have also found continual pre-
As code models advanced over the years, re- training on code to significantly benefit language
searchers have gradually turned their attention to models’ performance on code4 .
the practical task of program synthesis. CON-
CODE (Iyer et al., 2018) is one of the early 4.1 Off-the-Shelf Language Models
datasets in this area, which includes more than
100K Java methods and is incorporated as a sub- Large language models are often pretrained on tril-
net of CodeXGLUE benchmark (Lu et al., 2021). lions of tokens following the scaling laws (Kaplan
Since 2021, the community has witnessed an abun- et al., 2020; Hoffmann et al., 2022), and such
dance of datasets for this task. Most of them, an amount of text data is often a diverse com-
including APPS (Hendrycks et al., 2021), Hu- posite with a non-negligible part of code. The
manEval (Chen et al., 2021), and MBPP (Austin Pile (Gao et al., 2021), for example, includes
et al., 2021), focuse on python, but recent works 95GB of code crawled from GitHub out of its
have also extended HumanEval into other pro- 800GB raw dataset, while the multilingual pre-
gramming languages (Cassano et al., 2023; Zheng training dataset ROOTS (Laurençon et al., 2022)
et al., 2023; Muennighoff et al., 2023). DS-1000 also contains 163GB of code spanning 13 pro-
is a more realistic Python dataset that focuses on gramming languages in its 1.6TB compound. As
data science libraries such as NumPy and SciPy, two of the largest open-source pretraining datasets,
while several math reasoning benchmarks have also they have supported many language models with
been converted to programming tasks, including coding ability. GPT-J (Wang and Komatsuzaki,
MathQA-Python (Amini et al., 2019; Austin et al., 2021), for example, is reported by Chen et al.
2021) and GSM8K-Python (Cobbe et al., 2021; (2021) to demonstrate non-trivial performance on
Chowdhery et al., 2022; Wang et al., 2023). HumanEval, while Scao et al. (2022) report simi-
lar results for GPT-NeoX (Black et al., 2022) and
3.4 Repository-Level Evaluation BLOOM. LLaMA (Touvron et al., 2023), whose
pretraining dataset includes 328GB code from
Most evaluation tasks discussed in §3.1 and Fig- GitHub, achieves 23.7 pass@1 performance on Hu-
ure 3 are limited to a single file or even a single manEval, and its successor LLaMA 2 (Touvron
function, as cross-file code modeling poses chal- et al., 2023), achieves an even higher score of 29.9.
lenges that are beyond the capability of most exist- Closed-source models, on the other hand, per-
ing language models. Recently, however, position form generally better. LaMDA (Thoppilan et al.,
interpolation techniques (Chen et al., 2023; Rozière 2022) and PaLM (Chowdhery et al., 2022), whose
et al., 2023; Peng et al., 2023) have extended the pretraining dataset contains 12.5% and 5% code
context window of LLMs to hundreds of thousands respectively, achieve 14.0 and 26.2 pass@1 per-
of tokens, making it possible to contextualize the formance on HumanEval, while GPT-4 (OpenAI,
evaluation of code modeling within entire reposito- 2023) set a staggering record of 67.0 (and an early
ries. Several works (Shrivastava et al., 2023; Ding version is reported by Bubeck et al. (2023) to be
et al., 2022; Zhang et al., 2023; Shrivastava et al., 82) that until recently has remained higher than
2023) have studied code completion leveraging
repository-level context, and Liu et al. (2023) pro- 4
While some works refer to this process as “finetuning on
pose RepoBench to evaluate such systems. More code", it is still self-supervised in nature. Thus we choose to
adopt the term “extra/additional/continual pretraining" in this
recently, Bairi et al. (2023) investigate the more work to avoid confusion with supervised in-task finetuning or
challenging tasks of repository-level API migration instruction finetuning.
HumanEval (0) MBPP (3) models are presented in Table 1.
k=1 k=100 k=1 k=80
4.2 Language Models with Additional
GPT-Ja 11.6 27.7
LaMDAbc 14.0 47.3 14.8 62.4 Pretraining on Code
PaLMb 26.2 76.2 36.8 75.0 Along with the seminal benchmark HumanEval,
GPT-NeoXd 15.4 41.2 Chen et al. (2021) kick-started the age of LLM
BLOOMd 15.5 55.5
for code with Codex, which are GPT-3 check-
LLaMAe 23.7 79.3 37.7 76.8
points pretrained on 100B additional code tokens
GPT-4 67.0f /82g
LLaMA 2h 29.9 89.0 45.0 81.5 and one of the earliest multi-billion models for
Phi-1.5i 41.4 43.5 code. Following their work, other researchers have
Baichuan 2j 17.1 30.2 also specialized their LLMs on code with addi-
Qwenk 32.3 40.8 tional pretraining. Chowdhery et al. (2022) train
Codexa 28.8 72.3 PaLM on 7.8B additional code tokens to obtain
PaLM-Coderb 36.0 88.4 47.0 80.8 PaLM-Coder, setting new state-of-the-art on Hu-
PaLM 2-S*l 37.6 88.4 50.0 86.8 manEval and MBPP (Table 1) that are only bro-
Code LLaMAm 53.7 94.7 56.2 ken later by its successor PaLM 2-S*, the small-
CodeFusen 74.4 61.0 est version of PaLM 2 (Anil et al., 2023) further
trained on an undisclosed amount of code. Sim-
Table 1: Pass@k performance of raw language models ilarly, Lewkowycz et al. (2022) train PaLM on
(top) and language models with extra training on code
38.5B tokens of arXiv papers and mathematical
(bottom) on HumanEval (0-shot) and MBPP (3-shot),
ordered chronologically. For Phi-1.5 we consider Phi- content, while Rozière et al. (2023) train LLaMA
1.5-web version, and for Code LLaMA we consider its 2 (Touvron et al., 2023) on more than 500B code to-
Python version. a Chen et al. (2021); b Chowdhery kens to acquire Code LLaMA, whose performance
et al. (2022); c Austin et al. (2021); d Scao et al. (2022); on HumanEval surpasses all previous LMs except
Touvron et al. (2023); f OpenAI (2023); g Bubeck GPT-4 (Table 1). Liu et al. (2023) further train
et al. (2023); h Touvron et al. (2023); i Li et al. (2023); Code LLaMA with multi-task finetuning (MFT) to
Yang et al. (2023); k Bai et al. (2023); l Anil et al. introduce CodeFuse-CodeLLaMA, achieving 74.4
(2023); m Rozière et al. (2023); n Liu et al. (2023).
pass@1 on HumanEval and surpassing even the per-
formance of GPT-4 published in OpenAI (2023).
While almost all of these models are Trans-
any specialized models pretrained or instruction-
former decoders pretrained with CLM, several
finetuned for code.
architectural modifications have been introduced
More recently, the general trend has been to along the way, as we noted in §2.5. All these mod-
train smaller models with larger datasets, fol- els use pre-norm, and GPT-J introduces parallel
lowing the revised scaling law (Hoffmann et al., attention, which is later adopted by PaLM, GPT-
2022). Baichuan 2 (Yang et al., 2023), for exam- NeoX, and Phi-1.5. PaLM introduces MQA and
ple, is a 13B model trained on 2.6T tokens, while RoPE into LLMs, and RoPE is now employed by
Qwen (Bai et al., 2023) is a 14B model trained on most language models, including GPT-NeoX, two
3T tokens. They achieve 17.1 and 32.3 pass@1 on generations of LLaMA, Qwen, and the 7B version
HumanEval, respectively. Li et al. (2023), how- of Baichuan 2. BLOOM and the 13B version of
ever, demonstrate that models as small as 1.3B Baichuan 2, however, use ALiBi for position em-
can acquire coding capability that’s comparable beddings, while LLaMA 2 and Code LLaMA adopt
to much larger models while also maintaining a GQA instead of MHA or MQA. In §5, we show that
reasonable performance on general text process- specialized models pretrained exclusively on code
ing and even manifesting some emergent abili- have also followed these advancements closely.
ties (Wei et al., 2022) such as chain-of-though rea-
soning (Wei et al., 2022). Their model, Phi-1.5, is 5 Specialized Language Models for Code
trained on 21B tokens of textbook data generated
by ChatGPT, and 100B tokens of filtered web data As pretrained Transformers such as GPT and
from Stack Overflow and Refined Web (Penedo BERT achieved remarkable success in natural lan-
et al., 2023), and attains 41.4 pass@1 performance guage processing, such model architectures, learn-
on HumanEval. The exact performance of these ing paradigms, and training objectives were soon
adopted by the software engineering community to Dataset Size (GB) Files (M) # PL
produce specialized models for code understand-
CodeSearchNeta 20 6.5 6
ing and generation. In this section, we first re-
The Pilebc 95 19 -
view common datasets used for pretraining code CodeParrotd 1K 115 30
language models (§5.1), and then dive into the The Stacke 3136 317 30
complex family of code LMs by their model ar- ROOTSf 163 15 13
chitecture: encoder-only models (§5.2), encoder-
decoder models (§5.3), decoder-only models (§5.4), Table 2: Statistics of several pretraining datasets for
UniLM (§5.5), and diffusion models (§5.6). Lastly, code models: size in bytes, number of files, and
in §5.7 we also illustrate the current trend of ap- number of programming languages. In CodeSearch-
plying more recent techniques in NLP, such as Net each file is a function. For Pile and ROOTS
we only consider their code composite. a Husain
instruction tuning (Wei et al., 2022; Sanh et al.,
et al. (2019); b Gao et al. (2021); c Biderman
2022; Chung et al., 2022) and reinforcement learn- et al. (2022); d https://huggingface.co/datasets/
ing (Ouyang et al., 2022) to code processing. An codeparrot/github-code; e Kocetkov et al. (2022);
overview of these pretrained models are provided f
Laurençon et al. (2022).
in Table 3.
5.1 Training Dataset for Code et al. (2023) utilize this feature and construct a 2GB
dataset CommitPackFT containing 742K samples
While text data for pretraining language models of instruction data for code, obviating the need of
are often crawled from the web and must un- extensive human labor that’s required to construct
dergo meticulous and often aggressive preprocess- natural language instructions (Sanh et al., 2022;
ing (Raffel et al., 2020), code data come naturally Wang et al., 2022).
as whole documents from public GitHub reposi- Apart from bimodal training and instruction fine-
tories. Even better, they come with readily avail- tuning, another recent trend in constructing code
able quality indicators such as the count of stars dataset is synthesizing data with powerful models
or forks (although Allal et al. (2023) suggest that such as ChatGPT. While this method is originally
star count correlates poorly with downstream per- proposed for generating instruction data in natural
formance). As a result, many large-scale code pre- language (Wang et al., 2023; Honovich et al., 2023),
training datasets have been introduced, including Gunasekar et al. (2023) take one step further and
CodeSearchNet (Husain et al., 2019), CodePar- synthesize 1B tokens of Python textbooks and cod-
rot (Tunstall et al., 2022), and the Stack (Kocetkov ing exercises to pretrain a 1.3B model, achieving
et al., 2022), totaling 20GB, 50GB and 3TB of state-of-the-art results on HumanEval that’s compa-
code documents respectively (Table 2). rable to much larger models trained on significantly
While these datasets are meant for training code larger datasets.
models, it should be noted that code is ultimately a
special form of natural language, as the vocabulary 5.2 Encoders
of most programming languages is a small sub- Pretrained Transformer encoders such as
set of English. Besides, high-quality code is often BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
interleaved with natural language comments or doc- 2019), and ELECTRA (Clark et al., 2020) have
umentations, which also enables models to acquire attained impressive results on natural language
certain knowledge of general text representation. understanding tasks, and these methods were
In fact, of the 6.5M functions in CodeSearchNet, soon introduced into code processing after their
2.3M are paired with natural language documen- advent. Kanade et al. (2020) replicate the training
tation, allowing models to train explicitly on such procedure of BERT on a code corpus to produce
bimodal data. CuBERT, showcasing its superior performance
Compared with natural language, another over LSTM (Hochreiter and Schmidhuber, 1997)
byproduct of scraping code from GitHub is com- and non-pretrained Transformers. Feng et al.
mit histories, which consist of code before com- (2020), on the other hand, train CodeBERT with
mit, code after commit, and a short message de- MLM and ELECTRA’s RTD on CodeSearchNet.
scribing the commit, which can loosely serve as They also utilize the explicit text-code pairs in
an instruction for language models. Muennighoff CodeSearchNet, and use them respectively as the
Atten. Parallel Pre- Flash
Date Model Arch. Size Vocab Context PE Init. from Objectives Dataset Training PL Inst.
Type Atten. Norm Atten.
2019-12 CuBERT BERT 350M 50K 1024 absolute MHA - MLM + NSP 9.3B 93B 1 Google
2020-02 CodeBERT RoBERTa 125M 50K 512 absolute MHA RoBERTa MLM + RTD 20GB 105B 6 Microsoft
2020-09 RoBERTa 125M 50K 640 absolute MHA CodeBERT MLM + Edge Predic-
tion + Node Alignment
20GB 131B 6 Microsoft
MLM + IP + AST Edge
2021-08 SynCoBERT RoBERTa 125M 50K 512 absolute MHA CodeBERT Prediction + CL
20GB 7B 6 Huawei
MLM + Node Type Columbia
2021-10 DISCO BERT 100M 20K 512 absolute MHA - MLM + CL
1.8GB 2
GraphCode- MLM + Type Inference
2022-05 Code-MVP RoBERTa 125M 50K 512 absolute MHA + CL
2GB 39B 1 Huawei
2020-05 GPT-C GPT-2 374M 60K 1024 absolute MHA ! - CLM 11B 270B 4 Microsoft
2021-02 CodeGPT GPT-2 124M 50K 1024 absolute MHA ! GPT-2 CLM 2GB 1 Microsoft
2022-02 PolyCoder GPT-2 160M-2.7B 50K 2048 absolute MHA ! - CLM 254GB 39B 12 CMU
CodeGen- 350M- 1.6TB(1.8TB)/
2022-03 GPT-3 50K 2048 RoPE MHA ! ! - CLM 1T(1.2T) 6(1) Salesforce
Multi(Mono) 16.1B 506B(577B)
2022-04 InCoder GPT-3 6.7B 50K 2048 Cosine MHA ! - Causal Masking 204GB 52B 28 Meta
2022-06 PyCodeGPT GPT-Neo 110M 32K 1024 absolute MHA ! - CLM 96GB 100B 1 Microsoft
2022-07 PanGu-α 317M-2.6B 42K 1024 absolute MHA ! - CLM 147GB 230B 1 Huawei
2023-01 SantaCoder GPT-2 1.1B 49K 2048 absolute MQA ! - FIM 268GB 236B 3 BigCode
2023-03 CodeGeeX PanGu-α 13B 52K 2048 absolute MHA ! - CLM 158B 850B 23 Tsinghua
2023-05 StarCoder GPT-2 15.5B 49K 8192 absolute MQA ! ! - FIM 815GB 1T 86 BigCode
2023-06 Phi-1 GPT-J 1.3B 51K 2048 RoPE MHA ! ! ! - CLM 7B 53B 1 Microsoft
2023-10 CodeFuse GPT-J 350M-13B 101K 4096 RoPE MHA ! ! ! - CLM 1.6TB / 1T 40+ Ant Group
2023-10 CodeShell GPT-2 7B 70K 8192 RoPE GQA ! - CLM 500B Peking U.
2020-10 PyMT5 GPT-2 374M 50K 1024+1024 absolute MHA ! - SC 27GB 1 Microsoft
Mastropaolo della
2021-02 T5 60M 32k 512+512 T5 MHA ! - SC 1GB 1 Svizzera
et al.
2021-02 DOBF 250M 50K 512+512 absolute MHA - MLM + Deobfuscation 45GB 2 Meta
2021-03 PLBART BART 140M 50K 1024+1024 absolute MHA - DAE 655GB / 71B 210B 2 Columbia
SC + IP + Masked IP +
2021-09 CodeT5 T5 60M-220M 32K 512+256 T5 MHA ! - Text2Code + Code2Text
∼25GB 8 Salesforce
NSP + SC + Method
2022-01 SPT-Code BART 262M 80K 512+512 absolute MHA - Name Prediction
20GB 6 Nanjing U.
2022-02 AlphaCode 300M-41B 8K 1536+768 MQA - MLM + CLM 715GB 967B 13 DeepMind
Columbia &
2022-06 NatGen T5 220M 32K 512+256 T5 MHA ! CodeT5 Naturalization ∼26GB 14B 8 UC Davis
T5/GPT- CodeGen- SC + CLM + CL +
2023-05 CodeT5+ 220M-16B 50K 2048+2048 absolute MHA ! ! Text2Code + Code2Text
52B 9 Salesforce
3 mono
2020-12 CugLM BERT 51M 50K 128 absolute MHA - MLM + NSP + CLM 8M 1.2B 2 Peking U.
MLM + CLM + SC +
2022-03 UniXcoder RoBERTa 125M 51K 1024 absolute MHA - CL + Code2Text
20GB+ 839B 6 Microsoft
Table 3: An overview of pretrained code language models’ architecture and training details: their base architecture,
model size, vocabulary, context length, position embedding, attention type (Multi-Head Attention (Vaswani
et al., 2017), Multi-Query Attention (Shazeer, 2019), or Grouped-Query Attention (Ainslie et al., 2023)), layer
normalization type (post-norm or pre-norm), usage of FlashAttention (Dao et al., 2022), training initialization,
objectives, dataset size (either in disk size, measured by GB/TB, or in number of tokens, measured by B/T),
tokens seen during training, supported number of programming languages, and institute. We note that the number
of training tokens does not count the training tokens of the model used for initialization, if any. The common
training objectives are: MLM (Masked Language Modeling), NSP (Next Sentence Prediction), RTD (Replaced
Token Detection), IP (Identifier Prediction), CL (Contrastive Learning), SC (Span Corruption), DAE (Denoising
Auto-Encoding). Missing information (such as AlphaCode’s position embedding type) is left as blank.
first and second segment in BERT’s input. When CoBERT and Code-MVP (Wang et al., 2022) also
using CodeBERT to initialize the encoder part of add type inference to their pretraining stage in the
a vanilla Transformer for sequence-to-sequence form of tagging. Another common objective is
generation tasks such as code summarization, contrastive learning: SynCoBERT and Code-MVP
they observe a moderate performance gain over contrast between different views of the input (such
non-pretrained baselines. as code, comment, AST, and transformed code),
while DISCO (Ding et al., 2022) constructs posi-
Apart from these standard training objectives, tive sample pairs by semantic-preserving transfor-
many auxiliary objectives specifically designed mations such as obfuscation, and negative pairs by
for code have also been introduced. GraphCode- injecting artificial bugs.
BERT (Guo et al., 2021) and SynCoBERT (Wang
et al., 2021) both extract graphs from the source
5.3 Encoder-Decoders
code (data flow graph and abstract syntax tree, re-
spectively) and train the models to predict the ty- In NLP, pretrained Transformer encoder-decoders
pological relations between the nodes, while Syn- such as T5 (Raffel et al., 2020) and BART (Lewis
et al., 2020) have also left a notable mark in the ilar to deobfuscation: semantically equivalent but
past few years’ advancement in language modeling. unnatural code is generated by predefined opera-
T5, for example, unifies all textual tasks into a se- tions such as loop transformation, dead code injec-
quence to sequence format and sets new records on tion, and variable renaming, and the model is pre-
GLUE (Wang et al., 2018) and SuperGLUE (Wang trained to translate these unnatural code back to its
et al., 2019). Compared with encoder-only models, original form. We note that some of these models
encoder-decoders are naturally more powerful as are built on previous works. For example, NatGen
they can be used for conditional text generation, is initialized with CodeT5, while the largest ver-
while their encoder part can always be taken out to sion of CodeT5+ is initialized from a decoder-only
perform tasks that require an encoder-only archi- model, CodeGen (Nijkamp et al., 2023).
tecture, such as regression (Tay et al., 2023). Apart from these general pretraining objec-
Inspired by these advantages of encoder-decoder tives, several works have also trained Transformer
architecture, many such models have been pro- encoder-decoders with a focus on code translation,
posed for code processing. PyMT5 (Clement et al., which is a natural application of Transformer mod-
2020) and Mastropaolo et al. (2021) replicate the els in code as the Transformer architecture was
pretraining and multi-task finetuning process of T5 originally proposed by Vaswani et al. (2017) for
on code corpus, while Ahmad et al. (2021) intro- machine translation (MT). However, unlike natu-
duce PLBART, a BART pretrained on 655GB com- ral languages, where parallel corpus across two or
bined data of Java, Pyhton, and natural language. more human languages exist in abundance, there
Lachaux et al. (2021) argue that MLM could be too is little parallel data for code. To tackle this issue,
easy a task for programming languages as identi- Rozière et al. (2020) propose Transcoder, which
fier names often occur multiple times in a single first pretrains an encoder with XLM (Conneau and
context window, and propose a deobfuscation pre- Lample, 2019), and then initializes a vanilla Trans-
training objective, where the model is trained to former with this encoder and continue to pretrain it
convert obfuscated code back to its original form. with Denoising Auto-Encoding (DAE, Lewis et al.,
Related to this method, we note that meaningful 2020) and back translation (Sennrich et al., 2016),
variable names have also been found to have a posi- while its follow-up work (Szafraniec et al., 2023)
tive impact on the code generation process of large also utilize language-independent intermediate rep-
language models (Chen et al., 2022). resentations to enhance this process, which we dis-
cuss in more detail in §6.
Building on these early works, Wang et al.
(2021) propose CodeT5, which is pretrained al- Apart from training data and objectives, these
ternatively with 1) T5’s original span corruption; models mostly keep to the original architectures
2) identifier tagging (where each token in the code proposed by the NLP community, as shown in
input is tagged as either identifier or non-identifier); Table 3. Models based on BART, for example,
3) masked identifier prediction (a special form of use post-normalization and learnable absolute po-
span corruption where all identifiers are masked); sition embeddings, while those based on T5 use
and 4) text-to-code & code-to-text generation. Its its simplified relative position embeddings and pre-
successor, CodeT5+ (Wang et al., 2023), take inspi- normalization.
ration from UL2 (Tay et al., 2023) and introduce
causal language modeling (CLM) into pretraining, 5.4 Decoders
along with additional contrastive objectives based After the monumental debut of GPT-3 (Brown
on text-code matching. et al., 2020) and the discovery of in-context
AlphaCode (Li et al., 2022) is also trained learning, decoder-only Transformer models have
with multiple objectives, where the encoder is become dominant in language modeling (Rae
trained with MLM and the decoder is trained et al., Hoffmann et al., Chowdhery et al., Scao
with CLM, with architecture modifications such et al., Touvron et al., Touvron et al., 2021, 2022,
as shallow-encoder & deep-decoder, multi-query 2022, 2022, 2023, 2023, inter alia). Many mod-
attention (Shazeer, 2019), and being much larger els similarly pretrained with CLM have also
than CodeT5 (up to 41B parameters). Nat- emerged in code processing, such as GPT-C (Svy-
Gen (Chakraborty et al., 2022), on the other hand, atkovskiy et al., 2020), CodeGPT (Lu et al.,
is pretrained with a "naturalization" objective sim- 2021), PolyCoder (Xu et al., 2022), CodeGen (Ni-
jkamp et al., 2023), PyCodeGPT (Zan et al., Model Size HumanEval MBPP
2022), Pangu-Coder (Christopoulou et al., 2022),
PolyCoder 2.7B 5.6 -
CodeGeeX (Zheng et al., 2023), Phi-1 (Gu- CodeGen-Mono 16.1B 29.3 35.3
nasekar et al., 2023), CodeFuse (Di et al., 2023), InCoder 6.7B 15.2 19.4
CodeShell5 , and DeepSeek Coder6 . Of these mod- PyCodeGPT 110M 8.3 -
els, several alternative training objectives have been Pangu-Coder 2.6B 23.8 23.0
experimented with, such as MLM and Masked SantaCoder 1.1B 14.0 35.0
CLM7 in Pangu-Coder, but are found to underper- CodeGeeX 13B 22.9 24.4
form compared with CLM-only training. Zan et al. StarCoder 15.5B 33.6 52.7
(2022) also propose continual training on sketches, CodeT5+ 16B 30.9 -
where the model learns to first generate a sketch Phi-1 1.3B 50.6 55.5
of a program and then the actual code. Notably, CodeFuse 13B 24.8 -
Gunasekar et al. (2023) present Phi-1, a 1.3B small InstructCodeT5+ 16B 35.0 -
model trained on a dataset of only 7B tokens con- WizardCoder 15.5B 57.3 51.8
sisting of 6B tokens from StackOverflow and 1B Pangu-Coder 2 15.5B 61.6 -
synthetic data generated by ChatGPT but achieving OctoCoder 15.5B 46.2 -
50.6 pass@1 on HumanEval and 55.5 pass@1 on CodeFuse-SFT 13B 37.1 -
MBPP, comparable to much larger (both in model GPT-4 - 67.0/82 -
size and training data size) models such as Code PaLM 2-S* - 37.6 50.0
LLaMA or PaLM 2. Code LLaMA 34B 53.7 56.2
Although Christopoulou et al. (2022) report de- Phi-1.5 1.3B 41.4 43.5
noising objectives to underperform in decoder-only
Table 4: Pass@1 performance of pretrained code models
models, there have been other works that success-
(top), instruction finetuned code models (middle), in
fully combine denoising or multi-task pretraining comparison with some of the best general language
with decoder architecture. Incoder (Fried et al., models (bottom), with models in each category ordered
2023), SantaCoder (Allal et al., 2023), and Star- chronologically. The sources of these figures can be
Coder (Li et al., 2023) are all trained with fill- found in §5.3, §5.4, and Table 1.
in-the-middle (FIM) objective, also referred to as
causal masking by Fried et al. (2023), which is
essentially span corruption (Raffel et al., 2020) Notably, the three most recent models - StarCoder,
adopted to decoder-only architecture. One of the Phi-1, and CodeFuse - also employ FlashAttention
visible advantages of these infilling objectives is to improve model throughput.
that they inject the models with the ability to fill in 5.5 UniLMs
blanks in the middle of input code at inference time,
while CLM allows only for autoregressive genera- Following UniLM (Dong et al., 2019) in NLP, sev-
tion. As Table 4 shows, however, these objectives eral works in code processing have also pretrained
also lead to higher performance on downstream this fourth family of Transformer models on code.
tasks when compared with CLM-only models such CugLM (Liu et al., 2020) is trained with both CLM
as CodeGen. and MLM + NSP via alternating attention masks,
Observing Table 3, it is clear that decoder-only while UniXcoder is trained with CLM, MLM, Span
models for code have generally followed the prac- Corruption (in Prefix LM style) along with auxil-
tices in NLP more closely, when compared with iary objectives including contrastive learning and
other model architectures. All these models use text-code mutual generation. Both two models,
pre-normalization, while MQA, RoPE, and parallel however, are relatively small in size, and whether
attention have also been adopted by several models. or not this architecture is suitable for code process-
ing is yet to be explored.
https://github.com/deepseek-ai/ 5.6 Diffusion Models
In their paper, MLM is conducted by replacing tokens Currently the Transformer architecture dominate
in the input with <mask> and predicting it from only the left text generation, but several works (Li et al., 2022;
context, while Masked CLM is performed by adding a <mask>
in the input and predict the the next token from it. Both tasks Lin et al., 2023) have also adopted Diffusion Mod-
do not change the attention mask patterns of the model. els (Ho et al., 2020) from computer vision for text
generation. Recently CodeFusion (Singh et al., human feedback required in aligning LLMs often
2023) also introduces diffusion models into code involves extensive labor on annotation. In com-
modeling, and demonstrates that a 75M diffusion parison, applying reinforcement learning to code
model can outperform StarCoder, CodeT5+, and models has a natural advantage, as compilers can
GPT-3 on 3 code synthesis datasets. be used for automatically generating feedback for
code samples produced by language models.
5.7 Instruction Finetuning and Reinforcement CodeRL (Le et al., 2022) is one such model,
Learning for Code which defines four levels of rewards for each
generated program (viz. compile error, runtime
In natural language processing, training models on
error, unit test failure, pass) as well as fine-
a diverse set of tasks with instruction prefix, known
grained token-level reward estimated by a critic
as instruction finetuning, has been shown to unlock
model. The actor model, which is an extention
the ability of cross-task generalization (Ouyang
of CodeT5, is then trained with REINFORCE
et al., 2022; Chung et al., 2022; Iyer et al., 2022).
algorithm (Williams, 1992). Similarly, Comp-
At first, these instruction data samples are manu-
Coder (Wang et al., 2022) and PPOCoder (Shojaee
ally compiled or crowd-sourced (Wei et al., 2022;
et al., 2023) train CodeGPT and CodeT5 respec-
Sanh et al., 2022), but later researches find LLM-
tively with proximal policy optimization (Schul-
generated instructions to be sufficient (Wang et al.,
man et al., 2017), while RLTF (Liu et al., 2023)
2023; Honovich et al., 2023).
proposes fine-grained feedback based on the error
Following these works in natural language, re- information and location provided by the compiler,
searchers from the code community have applied as well as adaptive feedback that takes the ratio of
instruction tuning to their models as well. Wang passed test cases into account.
et al. (2023) finetune CodeT5+ with 20K in-
struction data generated by InstructGPT (Ouyang 6 Code Features for Language Models
et al., 2022) to obtain InstructCodeT5+. Wizard-
Coder (Luo et al., 2023) follows the methods of A major difference between programming lan-
WizardLM (Xu et al., 2023) to evolve 20K code Al- guages and natural languages is that the former is
paca (Taori et al., 2023) samples into a 78K dataset artificially defined to be precise and unambiguous,
and uses it to finetune StarCoder. Pangu-Coder and need to be compiled (or interpreted) without er-
2 (Shen et al., 2023) also uses WizardLM’s Evol- ror before execution. This allows for a much larger
Instruct to generate 68K instruction samples from flexibility in designing pretraining objectives on
20K code Alpaca, but also introduces reinforce- code, beside lexical manipulations such as CLM,
ment learning via Rank Responses to align Test MLM, and Span Corruption. A similar trend can
& Teacher Feedback (RRTF). OctoCoder (Muen- be observed in the last years before neural net-
nighoff et al., 2023), on the other hand, takes a dif- works were introduced into mainstream NLP lit-
ferent path and uses Git commit histories as instruc- erature (Sutskever et al., 2014; Bahdanau et al.,
tion data to finetune StarCoder and CodeGeeX2. 2015), when researchers in the MT community
More recently, CodeFuse (Di et al., 2023) also utilized alternative views of text such as syntactic
employs multitask-finetuning and explicitly intro- features to improve the performance of SMT sys-
duces multiple downstream tasks into their instruc- tems (Galley et al., 2006; Chiang, 2007). These
tion data. The performance of these instruction features, however, are not universally applicable or
finetuned code models can also be found in Ta- even agreed upon, and often result in highly com-
ble 4. plicated systems (for example, the size of English
In NLP, another technology closely related to in- part-of-speech tagging’s label set may range from
struction finetuning is reinforcement learning from dozens to hundreds).
human feedback (RLHF), which has played a sig- Programming languages, however, fare much
nificant role in aligning LLMs with human val- better in these aspects. Each mainstream program-
ues (Ouyang et al., 2022; Bai et al., 2022). The ming language, such as C, Python, and Java, comes
merit of reinforcement learning is that it can incor- with readily available compiler toolkits that allow
porate non-differentiable reward signals into train- for easy and accurate extraction of semantic in-
ing, such as BLEU (Bahdanau et al., 2017) and formation such as Abstract Syntax Tree (AST),
human preference (Christiano et al., 2017), but the language-independent Intermediate Representation
(IR), and auxiliary information such as type of each generation, a special form of span corruption where
token and control/data flow graph (CFG/DFG). a method name is masked. Different from other
Thus, in the context of Transformer-based language works, however, they do not take the docstrings as
modeling for code, many works have incorporated the text segment in their input, but instead concate-
these features into their training procedure. nate all method names appearing in the code as a
succinct natural language description. Likewise,
6.1 Abstract Syntax Tree and Intermediate UniXcoder (Guo et al., 2022) takes flattened AST
Representation instead of source code as its input during training.
In the compiling pipeline, AST is usually fol-
AST is one of the most common intermediate re-
lowed by language-independent intermediate rep-
sults of the compiling process, where a program is
resentations, such as LLVM IR (Lattner and Adve,
parsed into a tree of operations and their operands.
2004). Such features’ independence from spe-
Before the popularization of Transformer in the
cific programming languages makes them suitable
code processing community, there had been works
candidates for translation pivots, as is English in
such as InferCode (Bui et al., 2021) that processes
machine translation of low-resource natural lan-
these representations with special network archi-
guages (Leng et al., 2019). Szafraniec et al. (2023)
tectures like Tree-Based CNN and conducts self-
take advantage of this characteristic and extend
supervised pretraining by predicting subtrees.
Transcoder (Rozière et al., 2020) with translation
TreeBERT (Jiang et al., 2021) is one of the first language modeling (Conneau and Lample, 2019)
attempts to take AST into the Transformer-based over code and IR, as well as IR generation from
pretraining-finetuning framework. It’s a Trans- code. They also investigate other objectives such
former encoder-decoder pretrained with Tree MLM as IR decompilation (i.e. generating code from IR)
and Node Order Prediction, where the encoder and IR pivot (i.e. directly generating code in one
takes a set of constituent paths in the AST as in- language from the IR of another language), both
put (with each token being a path, which is the showing promising results.
concatenation of its nodes’ representations) while
the decoder takes the code as input. Tree MLM is
6.2 Control Flow and Data Flow
then performed by masking certain nodes in a path
representation and its corresponding code tokens While AST and IR have proved to be useful infor-
in the decoder input, while Node Order Prediction mation in certain tasks such as code translation,
is accomplished by swapping nodes in a path and they are static by nature, just like the source code,
predicting it with a [CLS] token similar to BERT. and may fail to capture semantic properties of code
The method used by TreeBERT, however, is that are only revealed at runtime (Wang and Su,
complicated and does not scale well. Later works 2020). Such semantics, however, are contained
mostly opt to first process AST into a text sequence in dynamic features such as control flow and data
and treat it like a normal part of the input. Wang flow. Similar to AST, specialized networks were
et al. (2021), for example, process AST with depth- used to process such information before the rise
first traversal and concatenate it with code and com- of pretrained Transformers, such as Message Pass-
ment, and then train SynCoBERT (which, unlike ing Neural Network used by ProGraML (Cummins
TreeBERT, is actually a BERT-like encoder-only et al., 2021). Unlike AST, however, even after pre-
model) with four objectives: 1) MLM; 2) identi- trained Transformers became dominant few works
fier tagging; 3) AST edge prediction (predicting have looked in this direction.
whether there exists an edge between two AST GraphCodeBERT (Guo et al., 2021) is one of
nodes from the dot product of these nodes’ repre- such works, which creates special tokens and po-
sentations); and 4) contrastive learning over i) code sition embeddings for variables in the flow graph,
and AST pairs, as well as ii) text and code-AST and concatenates the variable sequence after text
pairs. Similarly, SPT-Code (Niu et al., 2022), a and source code to construct model input, with
Transformer encoder-decoder, takes the concatena- tailored attention masks on the code and variable
tion of code, sequentialized AST, and text as input, segments: tokens from code segment and variable
and is pretrained with 1) span corruption; 2) code- segment can attend to each other if and only if the
AST prediction (NSP with one segment being code variable is identified from the code token, and for
and one segment being AST); and 3) method name tokens within the variable segment, vi is allowed
to attend to vj if there is a direct edge from vj to vi while ViperGPT (Surís et al., 2023) extends it fur-
in the dataflow. The model is then pretrained with ther by calling vision APIs to extract information
MLM in combination with edge prediction and from visual input and answer related questions.
node alignment, both of which are accomplished Apart from alleviating the burden of numeri-
by binary classification from the dot product of two cal calculation in abstract reasoning tasks, inter-
tokens’ representations (one from code segment preter also provides feedback on the process of
and one from variable segment for node alignment, code generation itself, together with unit tests.
and both from variable segment for edge predic- CodeT (Bareiß et al., 2022) and TiCoder (Chen
tion). et al., 2023) use Codex to generate unit tests, which
are run against generated code samples to improve
6.3 Type the model’s performance on code synthesis. Sim-
Apart from AST, IR, and data flow, type informa- ilarly, TransCoder-ST (Rozière et al., 2022) aug-
tion has also been used to aid language models in ments TransCoder and DOBF with external unit
processing code. CugLM (Liu et al., 2020), for tests for code translation. In §5.7, we have also
example, uses type information during finetuning shown that the execution results on unit tests serve
to aid in the prediction of tokens for unidirectional as natural supervision signals for reinforcement
MLM (i.e. MLM with unidirectional attention learning on code.
mask): the type of a masked token is first predicted Notably, in March 2023 OpenAI also released an
from the final Transformer layer’s representation, interpreter plugin for ChatGPT8 , which can accept
and then the token itself is predicted based on both file inputs from users, generate code according to
the hidden representation and predicted type. In user instructions, and provide feedback via real-
contrast, both CodeT5 (Wang et al., 2021) and Syn- time execution. Zhou et al. (2023) show that this
CoBERT (Wang et al., 2021) include identifier tag- feature allows GPT-4 to self-debug.
ging in their pretraining objectives, which can be A topic closely related to tool using in LLM
viewed as coarse-grained type prediction. researches is planning as intelligent agents, which
Notably, Wang et al. (2022) integrate many has been shown to enhance LLMs’ capability both
of the aforementioned features into Code-MVP: theoretically and empirically (Feng et al., 2023).
source code, docstrings, AST, CFG, and trans- Ruan et al. (2023) find that LLMs can plan to solve
formed source code via identifier renaming, loop complex tasks using external SQL generators and
exchange, and dead code insertion. The model, Python generators, while CodePlan (Bairi et al.,
initialized from GraphCodeBERT, is then trained 2023) demonstrates they can perform repository-
with MLM, fine-grained type prediction, and con- level coding via adaptive planning.
trastive learning across different views, such as text Another stream of works use LLMs to create
vs. code, code vs. AST, and code vs. CFG. multi-agent systems for code generation, such
as self-collaboration (Dong et al., 2023), Chat-
7 LLMs in Software Development Dev (Qian et al., 2023), and MetaGPT (Hong et al.,
2023). In these frameworks, multiple LLMs are
As language models set new records on software en-
prompted to play distinct roles such as programmer,
gineering benchmarks, software engineering tech-
reviewer, and manager. These roles interact with
nologies are also expanding the boundaries of lan-
each other, breakdown code generation into differ-
guage models in return, and have subsequently led
ent phases (e.g. designing, coding, testing, and
them into real-world development cycles.
documenting), and collaborate to complete com-
7.1 LLMs Extended with Coding Tools plex tasks.
Researches in the NLP community have shown 7.2 LLMs Integrated into Software
that LLMs can learn to use external tools such as Development
calculators, MT systems, and search engines (Thop-
pilan et al., 2022; Schick et al., 2023). As such, With the increase in LLMs’ interactive coding ca-
interpreter has been used to augment LLMs in pability, researchers have also started to integrate
complex reasoning tasks. PAL (Gao et al., 2023) them into each and every process of software de-
and PoT (Chen et al., 2022) both extend Codex 8
with Python interpreters for numerical calculations, code-interpreter
velopment. With these in mind, we identify several chal-
Auto code completion is one of the earliest ap- lenges in the current development of code model-
plications of language models in software develop- ing.
ment, as they require only the ability to predict the - Comprehensive benchmarks to push code
next token. Even before language models scaled to LLMs to the next stage. The widely used Hu-
billions of parameters, there had been integration of manEval benchmark plays a key role in the evo-
completion systems such as Pythia (Svyatkovskiy lution of Code LLMs. However, it is relatively
et al., 2019) and IntelliCode (Svyatkovskiy et al., small and its scoreboard has been manipulated to
2020) into popular IDEs. near perfect, which does not exactly reflect real-
Recently, however, the application of code lan- world behaviors. Many other benchmarks for Code
guage models have transcended simple code com- LLMs have been proposed, but they are still not
pletion. GitHub Copilot is arguably one of the most comprehensive enough to reflect production-level
popular AI code assistants, with diverse features in- requirements. The community is eager for a new
cluding code generation, vulnerability detection, standard benchmark after HumanEval to further
and license management9 , while CodeFuse (Di boost the progress of Code LLMs to the next stage.
et al., 2023) also integrates code generation, code - Acquisition of high-quality data. With Gu-
translation, code commenting, and testcase genera- nasekar et al. (2023) achieving SOTA performance
tion into a single IDE extension. As code language with a 1.3B model trained on textbook data, we
models become larger, however, their client-side believe the selection of training data and utilization
deployment and real-time performance also raise of synthetic data will be ever more prominent in
new challenges. the near future, for both self-supervised pretraining
As LLMs continue to advance, building applica- and supervised finetuning.
tions on top of them is also evolving into a conse- - Integration of code features into language
quential task itself. Many open-source frameworks models. As we noted in §6.2, CFG and DFG are
for such applications have been released, including yet to be employed at scale in code language model-
LangChain10 , AutoGPT11 , and WorkGPT12 . These ing. The few works that do employ data flow make
frameworks provide abstractions over language changes to the models’ attention masks, which
models for developers, and are actively revolution- severely limits their cross-task generalization and
izing the entire process of software development scaling ability. We believe the seamless integra-
even as this survey is being finalized. tion of such features into textual input is worth
researching in the future.
8 Conclusion and Challenges - Application of LLMs in more code down-
stream tasks. As we have pointed out in §3, cur-
In this work, we systematically reviewed the history
rent evaluation of LLMs’ coding capability is fo-
of code processing with pretrained Transformer
cused on program synthesis, and Figure 3 clearly
language models, and highlighted their relations
shows that tasks related to software testing (viz.
and comparisons to models pretrained on general
unit test generation, assertion generation, mutant
domains. The advancement in code modeling gen-
generation, and fuzzing) and deobfuscation have
erally follows the history course of NLP, evolv-
seen few application of LLMs. Besides, since the
ing from SMT models, to NMT models, and then
context window of LLMs are currently quite lim-
to finetuning pretrained Transformers and lastly
ited, generation tasks such as program synthesis
to few-shot application of LLMs and even au-
and code translation are yet to be applied beyond
tonomous agents in real-world production. Unlike
method level. In §3.4, we have listed several works
natural languages, the nature of code makes it easy
on repository-level code completion and temporal
to extract auxiliary information from alternative
editing, and we believe the application of LLMs
views, and to utilize interpreter and unit tests for
in more repository-level tasks will become a hot
automatic feedback.
research top in the future.
https://github.com/features/copilot - Alternative model architectures and training
11 objectives. In Table 3, we have shown that many
AutoGPT code language models are pretrained with auxiliary
https://github.com/team-openpm/workgpt objectives specific to code, but these models all be-
Conference on Innovative Applications of Artificial A Benchmarks for Downstrem Tasks
Intelligence, IAAI 2022, The Twelveth Symposium
on Educational Advances in Artificial Intelligence, Table 5, 6, 7, 8 list benchmark datasets for code
EAAI 2022 Virtual Event, February 22 - March 1, downstream tasks.
2022, pages 11783–11790. AAAI Press.
[500] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie
Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang.
2021. A syntax-guided edit decoder for neural pro-
gram repair. In ESEC/FSE ’21: 29th ACM Joint
European Software Engineering Conference and Sym-
posium on the Foundations of Software Engineering,
Athens, Greece, August 23-28, 2021, pages 341–353.
[501] Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen
Li, and Hai Jin. 2021. $\mu$µvuldeepecker: A deep
learning-based system for multiclass vulnerability
detection. IEEE Trans. Dependable Secur. Comput.,
Task Date Benchmark Source Size Language
Hemphill et al. (1990); Dahl
1990 ATIS 11508
et al. (1994)
1996 GeoQuery Zelle and Mooney (1996) 877
2000 Restaurants Tang and Mooney (2000) 378
2014-09 MAS Li and Jagadish (2014) 196
2017-02 Yelp Yaghmazadeh et al. (2017) 128
2017-02 IMDb Yaghmazadeh et al. (2017) 131
2017-04 Scholar Iyer et al. (2017) 816
Text-to-SQL 2017-08 WikiSQL Zhong et al. (2017) 80654
2018-06 Advising Finegan-Dollak et al. (2018) 4570
2018-09 Spider Yu et al. (2018) 10181
2019-06 SParC Yu et al. (2019) 12726
2019-07 MIMICSQL Wang et al. (2020) 10000
2019-09 CoSQL Yu et al. (2019) 15598
2020-10 Squall Shi et al. (2020) 11276
2021-06 SEDE Hazoom et al. (2021) 12023
2021-06 KaggleDBQA Lee et al. (2021) 400
2018-08 CONCODE Iyer et al. (2018) 104K Java
2021-05 APPS Hendrycks et al. (2021) 10000 Python
2021-07 HumanEval Chen et al. (2021) 164 Python
2021-08 MBPP Austin et al. (2021) 974 Python
Program 2021-08 MathQA-Python Austin et al. (2021) 23914 Python
Synthesis 2022-06 AixBench Hao et al. (2022) 336 Java
2022-11 DS-1000 Lai et al. (2023) 1000 Python
2023-02 CoderEval Yu et al. (2023) 460 Python, Java
Python, C++,
2023-03 HumanEval-X Zheng et al. (2023) 820
Java, JS, Go
2023-09 CodeApex Fu et al. (2023) 476 C++
C++, Java,
2020-06 GeeksforGeeks Rozière et al. (2020) 1.4K
2021-02 CodeTrans Lu et al. (2021) 11.8K Java, C#
Code 2021-08 Avatar Ahmad et al. (2023) 9515 Java, Python
Translation C++, Java,
2022-06 CoST Zhu et al. (2022) ∗ 132K Python, C#, JS,
C++, Java,
2022-06 XLCoST Zhu et al. (2022) ∗ 567K Python, C#, JS,
∗ 1640 Python, C++,
2023-03 HumanEval-X Zheng et al. (2023)
Java, JS, Go
∗ 4000 C++, Java, C#,
2023-08 G-TransEval Jiao et al. (2023)
JS, Python
Table 5: Benchmarks for text-to-SQL generation, program synthesis, and code translation. JS is short for JavaScript.
These are pairwise sample counts. For example, HumanEval-X includes 164 programs, each implemented in 5
languages, totaling 164 × (5 × 4 / 2) = 1640 translation pairs.
Task Date Benchmark Source Size Language
2014-07 Defects4J Just et al. (2014) 357 Java
2015-12 ManyBugs Goues et al. (2015) 185 C
2015-12 IntroClass Goues et al. (2015) 998 C
2016-11 BugAID Hanam et al. (2016) 105K JS
2017-02 DeepFix Gupta et al. (2017) 6971 C
2017-05 Codeflaws Tan et al. (2017) 3902 C
2017-10 QuixBugs Lin et al. (2017) 80 Java, Python
2018-12 BFP Tufano et al. (2019) 124K Java
Program Repair
2019-01 unnamed Tufano et al. (2019) 21.8K Java
Karampatsis and Sutton
2019-05 ManySStuBs4J 154K Java
2019-11 Refactory Hu et al. (2019) 1783 Python
Java, Python, C,
2020-07 CoCoNut Lutellier et al. (2020) 24M
2020-11 BugsInPy Widyasari et al. (2020) 493 Python
2021-07 TFix Berabi et al. (2021) 105K JS
2022-11 TypeBugs Oh and Oh (2022) 93 Python
Python, JS, Go,
2023-08 HumanEvalPack Muennighoff et al. (2023) 984
Java, C++, Rust
2016-08 CODE-NN Iyer et al. (2016) 66K/32K C#/SQL
2017-07 unnamed Barone and Sennrich (2017) 150K Python
Code 2018-05 DeepCom Hu et al. (2018) 588K Java
Summarization 2018-07 TL-CodeSum Hu et al. (2018) 411K Java
Go, JS, Python,
2019-09 CodeSearchNet Husain et al. (2019) 2.3M
PHP, Java, Ruby
Python, JS, Go,
2023-08 HumanEvalPack Muennighoff et al. (2023) 984
Java, C++, Rust
2013-05 GitHub Java Corpus Allamanis and Sutton (2013) 2.1M Java
∗ Code 2016-10 Py150 Raychev et al. (2016) 150K Python
Completion 2016-10 JS150 Raychev et al. (2016) 150K JS
2023-06 LCC Guo et al. (2023) 360K Python, Java, C#
Table 6: Benchmarks for program repair, code summarization, and code completion. JS is short for JavaScript. ∗ The
task of code completion can be evaluated on any source code corpus, so we only list a few widely used benchmarks
here. For cross-file code completion please refer to Table 8.
Task Date Benchmark Source Size Language
2018-03 StaQC Yao et al. (2018) 268K Python, SQL
2018-05 DeepCS Gu et al. (2018) 16M Java
2018-05 CoNaLa Yin et al. (2018) ∗ 600K/2.9K Python
2019-08 unnamed Li et al. (2019) 287 Java
∗ 2.3M/99 Go, JS, Python,
2019-09 CodeSearchNet Husain et al. (2019)
Code Retrieval PHP, Java, Ruby
2020-02 CosBench Yan et al. (2020) 52 Java
2020-08 SO-DS Heyman and Cutsem (2020) 2.2K Python
2020-10 FB-Java Ling et al. (2021) 249K Java
2021-02 AdvTest Lu et al. (2021) 280K Python
2021-02 WebQueryTest Lu et al. (2021) 1K Python
2021-05 CoSQA Huang et al. (2021) 21K Python
2020-09 MMLU Hendrycks et al. (2021) † 15908
Table 7: Benchmarks for code retrieval, code reasoning, type inference, clone detection/code search, and de-
fect/vulnerability detection. JS is short for JavaScript. ∗ These benchmarks include a large number of automatically
constructed samples, and a small set of human-annotated samples. † These are general-domain reasoning bench-
marks, and only a subset therein concern programming, algorithms, and other topics related to computer science.
These are project counts (or, in the case of Cassano et al. (2023), file counts). Yee and Guha (2023) propose to
measure project-level type check rate instead of type prediction accuracy for TypeScript.
Task Date Benchmark Source Size Language
Zhu et al. (2019); He et al.
2018-11 LogHub (2018) 379M
Log Parsing (2020)
2023-08 LogHub (2023) Jiang et al. (2023) ∗ 50.4M
Table 8: Benchmarks for log parsing and repository level coding. ∗ LogHub (2023) is an annotated subset of LogHub
(2018). † Line Completion/API Invocation Completion/Function Completion. ‡ Retrieval/Completion/Pipeline.
Migration/Temporal Edit.