Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Survey on Large Language Models for Code Generation

Juyong Jiang jjiang472@connect.hkust-gz.edu.cn The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina Fan Wang fwang380@connect.hkust-gz.edu.cn The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina Jiasi Shen sjs@cse.ust.hk The Hong Kong University of Science and TechnologyHong KongChina Sungju Kim sungju.kim@navercorp.com NAVER CloudSeoulSouth Korea  and  Sunghun Kim hunkim@cse.ust.hk The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina
(2024)
Abstract.

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the widely recognized HumanEval and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource website (https://codellm.github.io) to continuously document and disseminate the most recent advances in the field.

Large Language Models, Code Large Language Models, Code Generation
copyright: acmcopyrightjournalyear: 2024doi: XXXXXXX.XXXXXXXjournal: TOSEMjournalvolume: 1journalnumber: 1article: 1publicationmonth: 9ccs: General and reference Surveys and overviewsccs: Software and its engineering Software development techniquesccs: Computing methodologies Artificial intelligence

1. Introduction

The advent of Large Language Models (LLMs) such as ChatGPT111https://chat.openai.com (OpenAI, 2022) has profoundly transformed the landscape of automated code-related tasks (Chen et al., 2021), including code completion (Wang and Li, 2021; Lu et al., 2022; Guo et al., 2023; Wu et al., 2024), code translation (Lachaux et al., 2020; Szafraniec et al., 2022; Chen et al., 2018), and code repair (Jiang et al., 2021; Olausson et al., 2023; Parasaram et al., 2024). A particularly intriguing application of LLMs is code generation, a task that involves producing source code from natural language descriptions. Despite varying definitions across studies (Ren et al., 2020; Chen et al., 2023; Shojaee et al., 2023; Wang et al., 2023b), for the purposes of this survey, we adopt a consistent definition of code generation as the natural-language-to-code (NL2Code) task (Austin et al., 2021; Athiwaratkun et al., 2022; Zan et al., 2023). This area has garnered substantial interest from both academia and industry, as evidenced by the development of tools like GitHub Copilot222https://github.com/features/copilot (Chen et al., 2021), CodeGeeX333https://codegeex.cn/en-US (Zheng et al., 2023c), and Amazon CodeWhisperer444https://aws.amazon.com/codewhisperer, which leverage groundbreaking code LLMs to facilitate software development.

Initial investigations into code generation primarily utilized heuristic rules or expert systems, such as probabilistic grammar-based frameworks (Joshi and Rambow, 2003; Cohn et al., 2010; Allamanis and Sutton, 2014) and specialized language models (De Moura and Bjørner, 2008; Gulwani, 2010; Jha et al., 2010). These early techniques were typically rigid and difficult to scale. However, the introduction of Transformer-based LLMs has shifted the paradigm, establishing them as the preferred method due to their superior proficiency and versatility. One remarkable aspect of LLMs is their capability to follow instructions (Wei et al., 2022a; Ouyang et al., 2022; Xu et al., 2023; Muennighoff et al., 2023; Chung et al., 2024), enabling even novice programmers to write code by simply articulating their requirements. This emergent ability has democratized coding, making it accessible to a broader audience (Zan et al., 2023). The performance of LLMs on code generation tasks has seen remarkable improvements, as illustrated by the HumanEval leaderboard555https://paperswithcode.com/sota/code-generation-on-humaneval, which showcases the evolution from PaLM 8B (Chowdhery et al., 2023) of 3.6% to LDB (Zhong et al., 2024) of 95.1% on Pass@1 metrics. As can be seen, the HumanEval benchmark (Chen et al., 2021) has been established as a de facto standard for evaluating the coding proficiency of LLMs (Chen et al., 2021).

To offer a comprehensive chronological evolution, we present an overview of the development of LLMs for code generation, as illustrated in Figure 1. The landscape of LLMs for code generation is characterized by a spectrum of models, with certain models like ChatGPT (Ouyang et al., 2022), GPT4 (Achiam et al., 2023), LLaMA (Touvron et al., 2023a, b), and Claude 3 (Anthropic, 2024) serving general-purpose applications, while others such as StarCoder (Li et al., 2023a; Lozhkov et al., 2024), Code LLaMA (Roziere et al., 2023), DeepSeek-Coder (Guo et al., 2024), and Code Gemma (CodeGemma Team et al., 2024) are tailored specifically for code-centric tasks. The convergence of code generation with the latest LLM advancements is pivotal, especially when programming languages can be considered as distinct dialects of multilingual natural language (Athiwaratkun et al., 2022; Zheng et al., 2023c). These models are not only tested against software engineering (SE) requirements but also propel the advancement of LLMs into practical production (Zhang et al., 2023b).

While recent surveys have shed light on code LLMs from the lenses of Natural Language Processing (NLP), Software Engineering (SE), or a combination of both disciplines (Zan et al., 2023; Zheng et al., 2023a; Zhang et al., 2023b; Hou et al., 2024), they have often encompassed a broad range of code-related tasks. There remains a dearth of literature specifically reviewing advanced topics in code generation, such as meticulous data curation, instruction tuning, alignment with feedback, prompting techniques, the development of autonomous coding agents, retrieval augmented code generation, LLM-as-a-Judge for code generation, among others. A notably pertinent study (Athiwaratkun et al., 2022; Zan et al., 2023) also concentrates on LLMs for text-to-code generation (NL2Code), yet it primarily examines models released from 2020 to 2022. Consequently, this noticeable temporal gap has resulted in an absence of up-to-date literature reviews that contemplate the latest advancements, including models like CodeQwen (Team, 2024), WizardCoder (Luo et al., 2023), and PPOCoder (Shojaee et al., 2023), as well as the comprehensive exploration of the advanced topics previously mentioned.

Recognizing the need for a dedicated and up-to-date literature review, this survey endeavors to fill that void. We provide a systematic review that will serve as a foundational reference for researchers quickly exploring the latest progress in LLMs for code generation. A taxonomy is introduced to categorize and examine recent advancements, encompassing data curation (Wang et al., 2023a; Luo et al., 2023; Wei et al., 2023), advanced topics (Parvez et al., 2021; Lu et al., 2022; Le et al., 2022; Muennighoff et al., 2023; Liu et al., 2023d; Chen et al., 2022b; Ni et al., 2023a; Chen et al., 2023; Huang et al., 2023; Shrivastava et al., 2023a; Zhang et al., 2023c), evaluation methods (Chen et al., 2021; Hendrycks et al., 2021; Jimenez et al., 2023; Zhuo, 2024), and practical applications (Chen et al., 2021; Zheng et al., 2023c). This category aligns with the comprehensive lifecycle of an LLM for code generation. Furthermore, we pinpoint critical challenges and identify promising opportunities to bridge the research-practicality divide. Therefore, this survey allows NLP and SE researchers to seamlessly equip with a thorough understanding of LLM for code generation, highlighting cutting-edge directions and current hurdles and prospects.

The remainder of the survey is organized following the structure outlined in our taxonomy in Figure 3. In Section 2, we introduce the preliminaries of LLM with Transformer architecture and formulate the task of LLM for code generation. Then, in Section 3, we propose a taxonomy, categorizing the complete process of LLMs in code generation. Section 4 delves into the specifics of LLMs for code generation within this taxonomy framework. In Section 5, we underscore the critical challenges and promising opportunities for bridging the research-practicality gap and conclude this work in Section 6.

Refer to caption
Figure 1. A chronological overview of large language models (LLMs) for code generation in recent years. The timeline was established mainly according to the release date. The models with publicly available model checkpoints are highlighted in green color.

2. Background

2.1. Large Language Models

The effectiveness of large language models (LLMs) is fundamentally attributed to their substantial quantity of model parameters, large-scale and diversified datasets, and the immense computational power utilized during training (Kaplan et al., 2020; Hoffmann et al., 2022). Generally, scaling up language models consistently results in enhanced performance and sample efficiency across a broad array of downstream tasks (Wei et al., 2022a; Zhao et al., 2023b). However, with the expansion of the model size to a certain extent (e.g., GPT-3 (Brown et al., 2020) with 175B-parameters and PaLM (Chowdhery et al., 2023) with 540B), LLMs have exhibited an unpredictable phenomenon known as emergent abilities666It should be noted that an LLM is not necessarily superior to a smaller language model, and emergent abilities may not manifest in all LLMs (Zhao et al., 2023b)., including instruction following (Ouyang et al., 2022), in-context learning (Dong et al., 2022), and step-by-step reasoning (Wei et al., 2022b; Huang and Chang, 2022), which are absent in smaller models but apparent in larger ones (Wei et al., 2022a).

Adhering to the same architectures of the Transformer (Vaswani et al., 2017) in LLMs, code LLMs are specifically pre-trained on large-scale unlabeled code corpora, whereas general-purpose LLMs (e.g., ChatGPT (OpenAI, 2022)) are pre-trained on a blend of code and text data. Analogous to LLMs, Code LLMs can also be classified into three architectural categories: encoder-only models, decoder-only models, and encoder-decoder models. For encoder-only models, such as CodeBERT (Feng et al., 2020), they are typically suitable for code comprehension tasks including type prediction, code retrieval, and clone detection. For decoder-only models, such as StarCoder (Brown et al., 2020), they predominantly excel in generation tasks, such as code generation, code translation, and code summarization. Encoder-decoder models, such as CodeT5 (Wang et al., 2021), can accommodate both code understanding and generation tasks but do not necessarily outperform encoder-only or decoder-only models. The overall architectures of the different Code LLMs for code generation are depicted in Figure 2.

In the following subsection, we will delineate the key modules of the Transformer layers in Code LLMs.

2.1.1. Multi-Head Self-Attention Modules

Each Transformer layer incorporates a multi-head self-attention (MHSA) mechanism to discern the inherent semantic relationships within a sequence of tokens across hhitalic_h distinct latent representation spaces. Formally, the MHSA employed by the Transformer can be formulated as follows:

(1) 𝐡(l)=MultiHeadSelfAttn(𝐐,𝐊,𝐕)=Concat{Headi}i=1h𝐖𝐎,\displaystyle\mathbf{h}^{(l)}=\operatorname{MultiHeadSelfAttn}(\mathbf{Q},% \mathbf{K},\mathbf{V})=\operatorname{Concat}\left\{\mathrm{Head}_{i}\right\}_{% i=1}^{h}\mathbf{W^{O}},bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = roman_MultiHeadSelfAttn ( bold_Q , bold_K , bold_V ) = roman_Concat { roman_Head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT bold_O end_POSTSUPERSCRIPT ,
(2) Headi=Attention(𝐇(l1)𝐖i𝐐𝐐,𝐇(l1)𝐖i𝐊𝐊,𝐇(l1)𝐖i𝐕𝐕),subscriptHead𝑖Attentionsubscriptsuperscript𝐇𝑙1superscriptsubscript𝐖𝑖𝐐𝐐subscriptsuperscript𝐇𝑙1superscriptsubscript𝐖𝑖𝐊𝐊subscriptsuperscript𝐇𝑙1superscriptsubscript𝐖𝑖𝐕𝐕\displaystyle\operatorname{Head}_{i}=\operatorname{Attention}(\underbrace{% \mathbf{H}^{(l-1)}\mathbf{W}_{i}^{\mathbf{Q}}}_{\mathbf{Q}},\underbrace{% \mathbf{H}^{(l-1)}\mathbf{W}_{i}^{\mathbf{K}}}_{\mathbf{K}},\underbrace{% \mathbf{H}^{(l-1)}\mathbf{W}_{i}^{\mathbf{V}}}_{\mathbf{V}}),roman_Head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Attention ( under⏟ start_ARG bold_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , under⏟ start_ARG bold_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT , under⏟ start_ARG bold_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ) ,
(3) Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊Tdmodel/h)𝐕,Attention𝐐𝐊𝐕softmaxsuperscript𝐐𝐊𝑇subscript𝑑𝑚𝑜𝑑𝑒𝑙𝐕\displaystyle\operatorname{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=% \operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{model}/h}% }\right)\mathbf{V},roman_Attention ( bold_Q , bold_K , bold_V ) = roman_softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT / italic_h end_ARG end_ARG ) bold_V ,

where 𝐇(l1)n×dmodelsuperscript𝐇𝑙1superscript𝑛subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{H}^{(l-1)}\in\mathbb{R}^{n\times d_{model}}bold_H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the input to the l𝑙litalic_l-th Transformer layer, while 𝐡(l)n×dmodelsuperscript𝐡𝑙superscript𝑛subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{h}^{(l)}\in\mathbb{R}^{n\times d_{model}}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the output of MHSA sub-layer. The quantity of distinct attention heads is represented by hhitalic_h, and dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT refers to the model dimension. The set of projections {𝐖i𝐐,𝐖i𝐊,𝐖i𝐕,𝐖i𝐎}dmodel×dmodel/hsuperscriptsubscript𝐖𝑖𝐐superscriptsubscript𝐖𝑖𝐊superscriptsubscript𝐖𝑖𝐕superscriptsubscript𝐖𝑖𝐎superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝑑𝑚𝑜𝑑𝑒𝑙\left\{\mathbf{W}_{i}^{\mathbf{Q}},\mathbf{W}_{i}^{\mathbf{K}},\mathbf{W}_{i}^% {\mathbf{V}},\mathbf{W}_{i}^{\mathbf{O}}\right\}\in\mathbb{R}^{d_{model}\times d% _{model}/h}{ bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_O end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT / italic_h end_POSTSUPERSCRIPT encompasses the affine transformation parameters for each attention head HeadisubscriptHead𝑖\operatorname{Head}_{i}roman_Head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, transforming the Query 𝐐𝐐\mathbf{Q}bold_Q, Key 𝐊𝐊\mathbf{K}bold_K, Value 𝐕𝐕\mathbf{V}bold_V, and the output of the attention sub-layer, The softmaxsoftmax\operatorname{softmax}roman_softmax function is applied in a row-wise manner. The dot-products of queries and keys are divided by a scaling factor dmodel/hsubscript𝑑𝑚𝑜𝑑𝑒𝑙\sqrt{d_{model}/h}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT / italic_h end_ARG to counteract the potential risk of excessive large inner products and correspondingly diminished gradients in the softmaxsoftmax\operatorname{softmax}roman_softmax function, thus encouraging a more balanced attention landscape.

In addition to multi-head self-attention, there are two other types of attention based on the source of queries and key-value pairs:

  • Masked Multi-Head Self-Attention. Within the decoder layers of the Transformer, the self-attention mechanism is constrained by introducing an attention mask, ensuring that queries at each position can only attend to all key-value pairs up to and inclusive of that position. To facilitate parallel training, this is typically executed by assigning a value of 0 to the lower triangular part and setting the remaining elements to -\infty- ∞. Consequently, each item attends only to its predecessors and itself. Formally, this modification in Equation 3 can be depicted as follows:

    (4) Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊Tdmodel/h+𝐌mask)𝐕,Attention𝐐𝐊𝐕softmaxsuperscript𝐐𝐊𝑇subscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝐌𝑚𝑎𝑠𝑘𝐕\displaystyle\operatorname{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=% \operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{model}/h}% }+\mathbf{M}_{mask}\right)\mathbf{V},roman_Attention ( bold_Q , bold_K , bold_V ) = roman_softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT / italic_h end_ARG end_ARG + bold_M start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT ) bold_V ,
    (5) 𝐌mask=(mij)n×n=(𝕀(ij))n×n={0for ij otherwise,subscript𝐌𝑚𝑎𝑠𝑘subscriptsubscript𝑚𝑖𝑗𝑛𝑛subscript𝕀𝑖𝑗𝑛𝑛cases0for ij otherwise\displaystyle\mathbf{M}_{mask}=\Big{(}m_{ij}\Big{)}_{n\times n}=\Big{(}\mathbb% {I}(i\geq j)\Big{)}_{n\times n}=\begin{cases}0&\text{for $i\geq j$ }\\ -\infty&\text{otherwise}\end{cases},bold_M start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = ( italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT = ( blackboard_I ( italic_i ≥ italic_j ) ) start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL for italic_i ≥ italic_j end_CELL end_ROW start_ROW start_CELL - ∞ end_CELL start_CELL otherwise end_CELL end_ROW ,

    This form of self-attention is commonly denoted as autoregressive or causal attention (Lin et al., 2022).

  • Cross-Layer Multi-Head Self-Attention. The queries are derived from the outputs of the preceding (decoder) layer, while the keys and values are projected from the outputs of the encoder.

Refer to caption (a) Encoder-Decoder Models Refer to caption (b) Decoder-only Models

Figure 2. The overview of large language models (LLMs) with encoder-decoder and decoder-only Transformer architecture for code generation, adapted from (Vaswani et al., 2017).

2.1.2. Position-wise Feed-Forward Networks

Within each Transformer layer, a Position-wise Feed-Forward Network (PFFN) is leveraged following the MHSA sub-layer to refine the sequence embeddings at each position i𝑖iitalic_i in a separate and identical manner, thereby encoding more intricate feature representations. The PFFN is composed of a pair of linear transformations, interspersed with a ReLU activation function. Formally,

(6) PFFN(h(l))=(Concat{FFN(hi(l))T}i=1n)T,\displaystyle\operatorname{PFFN}(h^{(l)})=\left(\operatorname{Concat}\left\{% \operatorname{FFN}(h^{(l)}_{i})^{T}\right\}_{i=1}^{n}\right)^{T},roman_PFFN ( italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = ( roman_Concat { roman_FFN ( italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,
(7) FFN(hi(l))=ReLU(hi(l)𝐖(1)+b(1))𝐖(2)+b(2),FFNsubscriptsuperscript𝑙𝑖ReLUsubscriptsuperscript𝑙𝑖superscript𝐖1superscript𝑏1superscript𝐖2superscript𝑏2\displaystyle\operatorname{FFN}(h^{(l)}_{i})=\operatorname{ReLU}(h^{(l)}_{i}% \mathbf{W}^{(1)}+b^{(1)})\mathbf{W}^{(2)}+b^{(2)},roman_FFN ( italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_ReLU ( italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) bold_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ,

where h(l)n×dmodelsuperscript𝑙superscript𝑛subscript𝑑𝑚𝑜𝑑𝑒𝑙h^{(l)}\in\mathbb{R}^{n\times d_{model}}italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the outputs of MHSA sub-layer in l𝑙litalic_l-th Transformer layer, and hi(l)dmodelsubscriptsuperscript𝑙𝑖superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙h^{(l)}_{i}\in\mathbb{R}^{d_{model}}italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the latent representation at each sequence position. The projection matrices {𝐖(1),(𝐖(2))T}dmodel×4dmodelsuperscript𝐖1superscriptsuperscript𝐖2𝑇superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙4subscript𝑑𝑚𝑜𝑑𝑒𝑙\left\{\mathbf{W}^{(1)},(\mathbf{W}^{(2)})^{T}\right\}\in\mathbb{R}^{d_{model}% \times 4d_{model}}{ bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , ( bold_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT × 4 italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bias vectors {𝐛(1),𝐛(2)}dmodelsuperscript𝐛1superscript𝐛2superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙\{\mathbf{b}^{(1)},\mathbf{b}^{(2)}\}\in\mathbb{R}^{d_{model}}{ bold_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are parameters learned during training. These parameters remain consistent across all positions while are individually initialized from layer to layer. In this context, T𝑇Titalic_T represents the transpose operation on a matrix.

2.1.3. Residual Connection and Normalization

To alleviate the issue of vanishing or exploding gradients resulting from network deepening, the Transformer model incorporates a residual connection (He et al., 2016) around each of the aforementioned modules, followed by Layer Normalization (Ba et al., 2016). For the placement of Layer Normalization operation, there are two widely used approaches: 1) Post-Norm: Layer normalization is implemented subsequent to the element-wise residual addition, in accordance with the vanilla Transformer (Vaswani et al., 2017). 2) Pre-Norm: Layer normalization is applied to the input of each sub-layer, as seen in models like GPT-2 (Radford et al., 2019). Formally, it can be formulated as:

(8) Post-Norm:𝐇(𝐥):Post-Normsuperscript𝐇𝐥\displaystyle\textbf{Post-Norm}:\mathbf{H^{(l)}}Post-Norm : bold_H start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT =LayerNorm(PFFN(𝐡(𝐥))+𝐡(𝐥)),absentLayerNormPFFNsuperscript𝐡𝐥superscript𝐡𝐥\displaystyle=\operatorname{LayerNorm}(\operatorname{PFFN}(\mathbf{h^{(l)}})+% \mathbf{h^{(l)}}),= roman_LayerNorm ( roman_PFFN ( bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT ) + bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT ) ,
𝐡(𝐥)superscript𝐡𝐥\displaystyle\mathbf{h^{(l)}}bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT =LayerNorm(MHSA(𝐇(𝐥𝟏))+𝐇(𝐥𝟏))absentLayerNormMHSAsuperscript𝐇𝐥1superscript𝐇𝐥1\displaystyle=\operatorname{LayerNorm}(\operatorname{MHSA}(\mathbf{H^{(l-1)}})% +\mathbf{H^{(l-1)}})= roman_LayerNorm ( roman_MHSA ( bold_H start_POSTSUPERSCRIPT ( bold_l - bold_1 ) end_POSTSUPERSCRIPT ) + bold_H start_POSTSUPERSCRIPT ( bold_l - bold_1 ) end_POSTSUPERSCRIPT )
(9) Pre-Norm:𝐇(𝐥):Pre-Normsuperscript𝐇𝐥\displaystyle\textbf{Pre-Norm}:\mathbf{H^{(l)}}Pre-Norm : bold_H start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT =PFFN(LayerNorm(𝐡(𝐥)))+𝐡(𝐥),absentPFFNLayerNormsuperscript𝐡𝐥superscript𝐡𝐥\displaystyle=\operatorname{PFFN}(\operatorname{LayerNorm}(\mathbf{h^{(l)}}))+% \mathbf{h^{(l)}},= roman_PFFN ( roman_LayerNorm ( bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT ) ) + bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT ,
𝐡(𝐥)superscript𝐡𝐥\displaystyle\mathbf{h^{(l)}}bold_h start_POSTSUPERSCRIPT ( bold_l ) end_POSTSUPERSCRIPT =MHSA(LayerNorm(𝐇(𝐥𝟏)))+𝐇(𝐥𝟏)absentMHSALayerNormsuperscript𝐇𝐥1superscript𝐇𝐥1\displaystyle=\operatorname{MHSA}(\operatorname{LayerNorm}(\mathbf{H^{(l-1)}})% )+\mathbf{H^{(l-1)}}= roman_MHSA ( roman_LayerNorm ( bold_H start_POSTSUPERSCRIPT ( bold_l - bold_1 ) end_POSTSUPERSCRIPT ) ) + bold_H start_POSTSUPERSCRIPT ( bold_l - bold_1 ) end_POSTSUPERSCRIPT

2.1.4. Positional Encoding

Given that self-attention alone cannot discern the positional information of each input token, the vanilla Transformer introduces an absolute positional encoding method to supplement this positional information, known as sinusoidal position embeddings (Vaswani et al., 2017). Specifically, for a token at position pos𝑝𝑜𝑠positalic_p italic_o italic_s, the position embedding is defined as:

(10) 𝐩pos,2i=sin(pos100002i/dmodel),subscript𝐩𝑝𝑜𝑠2𝑖𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle\mathbf{p}_{pos,2i}=\sin(\frac{pos}{10000^{2i/d_{model}}}),bold_p start_POSTSUBSCRIPT italic_p italic_o italic_s , 2 italic_i end_POSTSUBSCRIPT = roman_sin ( divide start_ARG italic_p italic_o italic_s end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) ,
(11) 𝐩pos,2i+1=cos(pos100002i/dmodel),subscript𝐩𝑝𝑜𝑠2𝑖1𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle\mathbf{p}_{pos,2i+1}=\cos(\frac{pos}{10000^{2i/d_{model}}}),bold_p start_POSTSUBSCRIPT italic_p italic_o italic_s , 2 italic_i + 1 end_POSTSUBSCRIPT = roman_cos ( divide start_ARG italic_p italic_o italic_s end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) ,

where 2i,2i+12𝑖2𝑖12i,2i+12 italic_i , 2 italic_i + 1 represent the dimensions of the position embedding, while dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT denotes the model dimension. Subsequently, each position embedding is added to the corresponding token embedding, and the sum is fed into the Transformer. Since the inception of this method, a variety of innovative positional encoding approaches have emerged, such as learnable embeddings (Devlin et al., 2018), relative position embeddings (Shaw et al., 2018), RoPE (Su et al., 2024a), and ALiBi (Press et al., 2021). For more detailed descriptions of each method, please consult (Lin et al., 2022; Zhao et al., 2023a).

2.2. Code Generation

Large language models (LLMs) for code generation refer to the use of LLM to generate source code from natural language descriptions, a process also known as a natural-language-to-code task. Typically, these natural language descriptions encompass programming problem statements (or docstrings) and may optionally include some programming context (e.g., function signatures, assertions, etc.). Formally, these natural language (NL) descriptions can be represented as 𝐱𝐱\mathbf{x}bold_x. Given 𝐱𝐱\mathbf{x}bold_x, the use of an LLM with model parameters θ𝜃\thetaitalic_θ to generate a code solution 𝐲𝐲\mathbf{y}bold_y can be denoted as Pθ(𝐲𝐱)subscript𝑃𝜃conditional𝐲𝐱P_{\theta}(\mathbf{y}\mid\mathbf{x})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ∣ bold_x ). To verify the functionality correctness of the code solution, 𝐲𝐲\mathbf{y}bold_y is subsequently executed via a compiler or interpreter, represented as 𝐄𝐱𝐞()𝐄𝐱𝐞\mathbf{Exe}(\cdot)bold_Exe ( ⋅ ), on a suit of unit tests. The feedback from this execution can be denoted as 𝐅𝐞𝐞𝐝𝐛𝐚𝐜𝐤(𝐄𝐱𝐞(𝐲))𝐅𝐞𝐞𝐝𝐛𝐚𝐜𝐤𝐄𝐱𝐞𝐲\mathbf{Feedback}(\mathbf{Exe}(\mathbf{y}))bold_Feedback ( bold_Exe ( bold_y ) ).

The advent of in-context learning abilities in LLM (Wei et al., 2022a) has led to the appending of exemplars to the natural language description 𝐱𝐱\mathbf{x}bold_x as demonstrations to enhance code generation performance or constrain the generation format (Li et al., 2023d; Patel et al., 2023). A fixed set of M𝑀Mitalic_M exemplars is denoted as {(𝐱𝐢,𝐲𝐢)}i=1Msuperscriptsubscriptsubscript𝐱𝐢subscript𝐲𝐢𝑖1𝑀\{(\mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{M}{ ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Consequently, following (Ni et al., 2023a), a more general formulation of LLMs for code generation with few-shot (or zero-shot) exemplars can be revised as:

(12) Pθ(𝐲𝐱)=Pθ(𝐲prompt(𝐱,{(𝐱𝐢,𝐲𝐢)}i=1k)),k={0,1,,M}formulae-sequencesubscript𝑃𝜃conditional𝐲𝐱subscript𝑃𝜃conditional𝐲prompt𝐱superscriptsubscriptsubscript𝐱𝐢subscript𝐲𝐢𝑖1𝑘𝑘01𝑀\displaystyle P_{\theta}(\mathbf{y}\mid\mathbf{x})=P_{\theta}(\mathbf{y}\mid% \operatorname{prompt}(\mathbf{x},\{(\mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{k}% )),k=\{0,1,\dots,M\}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ∣ bold_x ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ∣ roman_prompt ( bold_x , { ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) , italic_k = { 0 , 1 , … , italic_M }

where prompt(𝐱,{(𝐱𝐢,𝐲𝐢)}i=1k))\operatorname{prompt}(\mathbf{x},\{(\mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{k}))roman_prompt ( bold_x , { ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) is a string representation of the overall input, and {(𝐱𝐢,𝐲𝐢)}i=1ksuperscriptsubscriptsubscript𝐱𝐢subscript𝐲𝐢𝑖1𝑘\{(\mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{k}{ ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes a set of k𝑘kitalic_k exemplars randomly selected from {(𝐱𝐢,𝐲𝐢)}i=1Msuperscriptsubscriptsubscript𝐱𝐢subscript𝐲𝐢𝑖1𝑀\{(\mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{M}{ ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. In particular, when k=0𝑘0k=0italic_k = 0, this denotes zero-shot code generation, equivalent to vanilla ones without in-context learning. Subsequently, a variety of decoding strategies can be performed for code generation, including deterministic-based strategies (e.g., greedy search and beam search) and sampling-based strategies (e.g., temperature sampling, top-k sampling, and top-p (nucleus) sampling). For more detailed descriptions of each decoding strategy, please consult (Holtzman et al., 2019).

(13) Greedy Search:𝐲=argmax𝐲Pθ(𝐲prompt(𝐱,{(𝐱𝐢,𝐲𝐢)}i=1k)),k={0,1,,M}:Greedy Searchformulae-sequencesuperscript𝐲subscriptargmax𝐲subscript𝑃𝜃conditional𝐲prompt𝐱superscriptsubscriptsubscript𝐱𝐢subscript𝐲𝐢𝑖1𝑘𝑘01𝑀\displaystyle\textbf{Greedy Search}:\mathbf{y^{*}}=\mathop{\mathrm{argmax}}_{% \mathbf{y}}P_{\theta}(\mathbf{y}\mid\operatorname{prompt}(\mathbf{x},\{(% \mathbf{x_{i}},\mathbf{y_{i}})\}_{i=1}^{k})),k=\{0,1,\dots,M\}Greedy Search : bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ∣ roman_prompt ( bold_x , { ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) , italic_k = { 0 , 1 , … , italic_M }
(14) Sampling:𝐲Pθ(𝐲prompt(𝐱,{(𝐱𝐢,𝐲i)}i=1k)),k={0,1,,M}:Samplingformulae-sequencesimilar-to𝐲subscript𝑃𝜃conditional𝐲prompt𝐱superscriptsubscriptsubscript𝐱𝐢subscript𝐲𝑖𝑖1𝑘𝑘01𝑀\displaystyle\textbf{Sampling}:\mathbf{y}\sim P_{\theta}(\mathbf{y}\mid% \operatorname{prompt}(\mathbf{x},\{(\mathbf{x_{i}},\mathbf{y}_{i})\}_{i=1}^{k}% )),k=\{0,1,\dots,M\}Sampling : bold_y ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ∣ roman_prompt ( bold_x , { ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) , italic_k = { 0 , 1 , … , italic_M }

3. Taxonomy

The recent surge in the development of Large Language Models (LLMs) has led to a significant number of these models being repurposed for code generation task through continued pre-training or fine-tuning. This trend is particularly observable in the realm of open-source models. For instance, Meta AI initially made the LLaMA (Touvron et al., 2023a) model publicly available, which was followed by the release of Code LLaMA (Roziere et al., 2023), designed specifically for code generation. Similarly, DeepSeek LLM (Bi et al., 2024a) developed and released by DeepSeeker has been extended to create DeepSeek Coder (Guo et al., 2024), a variant tailored for code generation. The Qwen team has developed and released Code Qwen (Team, 2024), building on their original Qwen (Bai et al., 2023) model. Microsoft, on the other hand, has unveiled WizardLM (Xu et al., 2023) and is exploring its coding-oriented counterpart, WizardCoder (Luo et al., 2023). Google has joined the fray by releasing Gemma (Team et al., 2024), subsequently followed by Code Gemma (CodeGemma Team et al., 2024). Beyond simply adapting general-purpose LLMs for code-related tasks, there has been a proliferation of models specifically engineered for code generation. Notable examples include StarCoder (Li et al., 2023a), OctoCoder (Muennighoff et al., 2023), and CodeGen (Nijkamp et al., 2022). These models underscore the trend of LLMs being developed with a focus on code generation.

Recognizing the importance of these developments, we propose a taxonomy that categorizes and evaluates the latest advances in LLMs for code generation. This taxonomy, depicted in Figure 3, serves as a comprehensive reference for researchers seeking to quickly familiarize themselves with the state-of-the-art in this dynamic field.

In the subsequent sections, we will provide an in-depth analysis of each category related to code generation. This will encompass a definition of the problem, the challenges to be addressed, and a comparison of the most prominent models and their performance evaluation.

{forest}

forked edges, for tree= grow=east, reversed=true,anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw=line-color, rounded corners,align=left, minimum width=2.5em, s sep=3pt, inner xsep=2pt, inner ysep=1pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=3.2em,font=,, where level=2text width=3.8em,font=, where level=3text width=4.0em,font=,where level=4text width=4.2em,font=,[LLMs for Code Generation, ver [Data
Curation [Pre-training [CodeSearchNet(Husain et al., 2019), Google BigQuery(Hoffa, 2016), The Pile(Gao et al., 2020), CodeParrot(Tunstall et al., 2022), GitHub Code(Tunstall et al., 2022)
ROOTS(Laurençon et al., 2022), The Stack(Kocetkov et al., 2022), The Stack v2(Lozhkov et al., 2024) ,leaf,text width=24.5em] ] [Instruction
Tuning [CommitPackFT (Muennighoff et al., 2023), Code Alpaca(Chaudhary, 2023), OA-Leet(Computations, 2023), OSS-Instruct(Wei et al., 2023), Evol-instruction(Roshdieh, 2023)
Self-OSS-Instruct-SC2-Exec-Filter(Yuxiang Wei, 2024) ,leaf,text width=24.5em] ] [Benchmarks [General [HumanEval(Chen et al., 2021), HumanEval+(Liu et al., 2024b), HumanEvalPack(Muennighoff et al., 2023), MBPP(Austin et al., 2021)
MBPP+(Liu et al., 2024b), CoNaLa(Yin et al., 2018), Spider(Yu et al., 2018), CONCODE(Iyer et al., 2018), ODEX(Wang et al., 2022d)
CoderEval(Yu et al., 2024), ReCode(Wang et al., 2022c), StudentEval(Babe et al., 2023) ,leaf,text width=19em] ] [Competitions [APPS(Hendrycks et al., 2021), CodeContests(Li et al., 2022a) ,leaf,text width=19em] ] [Data Science [DSP(Chandel et al., 2022), DS-1000(Lai et al., 2023), ExeDS(Huang et al., 2022) ,leaf,text width=19em] ] [Multilingual [MBXP(Athiwaratkun et al., 2022), Multilingual HumanEval(Athiwaratkun et al., 2022), HumanEval-X(Zheng et al., 2023c), MultiPL-E(Cassano et al., 2022)
xCodeEval(Khan et al., 2023) ,leaf,text width=19em] ] [Reasoning [MathQA-X(Athiwaratkun et al., 2022), MathQA-Python(Austin et al., 2021), GSM8K(Cobbe et al., 2021), GSM-HARD(Gao et al., 2023a) ,leaf,text width=19em] ] [Repository [RepoEval(Zhang et al., 2023c), Stack-Repo(Shrivastava et al., 2023a), Repobench(Liu et al., 2023b), EvoCodeBench(Li et al., 2024)
SWE-bench(Jimenez et al., 2023), CrossCodeEval(Ding et al., 2024), SketchEval(Zan et al., 2024) ,leaf,text width=19em] ] ] ] [Recent
Advances [Data
Synthesis [Self-Instruct (Wang et al., 2023a), Evol-Instruct (Xu et al., 2023), Phi-1(Gunasekar et al., 2023), Code Alpaca(Chaudhary, 2023), WizardCoder(Luo et al., 2023)
Magicoder(Wei et al., 2023), StarCoder2-instruct (Yuxiang Wei, 2024) ,leaf,text width=24.5em] ] [Pre-training [Model
Architectures [Encoder-Decoder [PyMT5(Clement et al., 2020), PLBART(Ahmad et al., 2021), CodeT5(Wang et al., 2021), JuPyT5(Chandel et al., 2022)
AlphaCode(Li et al., 2022a), CodeRL(Le et al., 2022), ERNIE-Code(Chai et al., 2022)
PPOCoder(Shojaee et al., 2023), CodeT5+(Wang et al., 2023b), CodeFusion(Singh et al., 2023)
AST-T5(Gong et al., 2024) ,leaf,text width=13em] ] [Decoder-Only [GPT-C(Svyatkovskiy et al., 2020), GPT-Neo(Black et al., 2021), GPT-J(Wang and Komatsuzaki, 2021), Codex(Chen et al., 2021)
CodeGPT(Lu et al., 2021), CodeParrot(Tunstall et al., 2022), PolyCoder(Xu et al., 2022)
CodeGen(Nijkamp et al., 2022), GPT-NeoX(Black et al., 2022), PaLM-Coder(Chowdhery et al., 2023)
InCoder(Fried et al., 2022), PanGu-Coder(Christopoulou et al., 2022), PyCodeGPT(Zan et al., 2022)
CodeGeeX(Zheng et al., 2023c), BLOOM(Le Scao et al., 2023), ChatGPT(OpenAI, 2022)
SantaCoder(Allal et al., 2023), LLaMA(Touvron et al., 2023a), GPT-4(Achiam et al., 2023)
CodeGen2(Nijkamp et al., 2023), replit-code(Replit, 2023), StarCoder(Li et al., 2023a)
WizardCoder(Luo et al., 2023), phi-1(Gunasekar et al., 2023), ChainCoder(Zheng et al., 2023b)
CodeGeeX2(Zheng et al., 2023c), PanGu-Coder2(Shen et al., 2023), Llama 2(Touvron et al., 2023b)
OctoPack(Muennighoff et al., 2023), Code Llama(Roziere et al., 2023), MFTCoder(Liu et al., 2023a)
phi-1.5(Li et al., 2023b), CodeShell(Xie et al., 2024), Magicoder(Wei et al., 2023)
AlphaCode 2(AlphaCode Team, 2023), StableCode(Pinnaparaju et al., 2024), WaveCoder(Yu et al., 2023)
phi-2(Mojan Javaheripi, 2023), DeepSeek-Coder(Guo et al., 2024), StepCoder(Dou et al., 2024)
OpenCodeInterpreter(Zheng et al., 2024b), StarCoder 2(Lozhkov et al., 2024)
Claude 3(Anthropic, 2024), ProCoder(Bi et al., 2024b), CodeGemma(CodeGemma Team et al., 2024)
CodeQwen(Team, 2024), Llama3(Meta, 2024)
StarCoder2-Instruct(Yuxiang Wei, 2024) ,leaf,text width=13em] ] ] [Pre-training
Tasks [CLM(Li et al., 2023a; Luo et al., 2023; Wei et al., 2023; Guo et al., 2024), DAE(Ahmad et al., 2021; Wang et al., 2021, 2023b), Auxiliary(Wang et al., 2021; Chai et al., 2022; Wang et al., 2023b) ,leaf,text width=16em] ] ] [Fine-tuning [Instruction
Tuning [Full Parameter
Fine-tuning [Code Alpaca(Chaudhary, 2023), CodeT5+(Wang et al., 2021), WizardCoder(Luo et al., 2023)
StarCoder(Li et al., 2023a), Pangu-Coder2(Shen et al., 2023), OctoPack(Muennighoff et al., 2023)
CodeGeeX2(Zheng et al., 2023c), Magicoder(Wei et al., 2023), CodeGemma(CodeGemma Team et al., 2024)
StarCoder2-instruct(Yuxiang Wei, 2024) ,leaf,text width=13em] ] [Parameter
Efficient
Fine-tuning [CodeUp(Jiang and Kim, 2023), ASTRAIOS(Zhuo et al., 2024) ,leaf,text width=8em] ] ] [Reinforcement
Learning
with Feedback [CodeRL(Le et al., 2022), CompCoder(Wang et al., 2022a), PPOCoder(Shojaee et al., 2023), RLTF(Liu et al., 2023d)
PanGu-Coder2(Shen et al., 2023), StepCoder(Dou et al., 2024) ,leaf,text width=15em] ] ] [Prompting
Engineering [Reflexion(Shinn et al., 2024), LATS(Zhou et al., 2023), Self-Debugging(Chen et al., 2023), SelfEvolve(Jiang et al., 2023)
Theo X. et al.(Olausson et al., 2023), CodeT(Chen et al., 2022b), LEVER(Ni et al., 2023a), AlphaCodium(Ridnik et al., 2024) ,leaf,text width=17em] ] [Repository
Level & Long
Context [RepoCoder(Zhang et al., 2023c), CoCoMIC(Ding et al., 2022b), RepoHyper(Phan et al., 2024), RLPG(Shrivastava et al., 2023b)
Repoformer(Wu et al., 2024), RepoFusion(Shrivastava et al., 2023a), ToolGen(Wang et al., 2024c), CodePlan(Bairi et al., 2023)
CodeS(Zan et al., 2024) ,leaf,text width=17em] ] [Retrieval
Augmented [HGNN(Liu et al., 2020), REDCODER(Parvez et al., 2021), ReACC(Lu et al., 2022), DocPrompting(Zhou et al., 2022a)
RepoCoder(Zhang et al., 2023c), Su et al.(Su et al., 2024b) ,leaf,text width=17em] ] [Autonomous
Coding Agents [AgentCoder (Huang et al., 2023), MetaGPT(Hong et al., 2023), CodeAct (Wang et al., 2024a), AutoCodeRover (Zhang et al., 2024), Devin(Cognition, 2024)
OpenDevin(OpenDevin, 2024), SWE-agent(John Yang, 2024), L2MAC(Holt et al., 2023), OpenDevin CodeAct 1.0(Xingyao Wang and Neubig, 2024) ,leaf,text width=22em] ] ] [Evaluation [Metrics [Exact Match, BLEU(Papineni et al., 2002), ROUGE(Lin, 2004), METEOR(Banerjee and Lavie, 2005), CodeBLEU(Ren et al., 2020), pass@k(Chen et al., 2021)
n@k(Li et al., 2022a), test case average(Hendrycks et al., 2021), execution accuracy(Rajkumar et al., 2022), pass@t(Olausson et al., 2023), perplexity(Jelinek et al., 1977) ,leaf,text width=22em] ] [Human
Evaluation [CodePlan(Bairi et al., 2023), RepoFusion(Shrivastava et al., 2023a), CodeBLEU(Ren et al., 2020) ,leaf,text width=13em] ] [LLM-as-a-Judge [AlpacaEval(Li et al., 2023c), MT-bench(Zheng et al., 2024a), ICE-Score(Zhuo, 2024) ,leaf,text width=13em] ] ] [Application [GitHub Copilot(Chen et al., 2021), CodeGeeX(Zheng et al., 2023c), CodeWhisperer(Amazon, 2022), Codeium(Codeium, 2023), CodeArts Snap(Shen et al., 2023), TabNine(TabNine, 2018)
Replit(Replit, 2016) ,leaf,text width=27.5em] ] ]

Figure 3. Taxonomy of large language models (LLMs) for code generation.

4. Large Langauge Models for Code Generation

Large language models (LLMs) with Transformer architecture have revolutionized a multitude of fields, and their application in code generation has been particularly impactful. These models follow a comprehensive process that starts with the curation and synthesis of code data, followed by a structured training approach that includes pre-training and fine-tuning, and the use of sophisticated prompt engineering techniques. Recent advancements have seen the integration of repository-level and retrieval-augmented code generation, as well as the development of autonomous coding agents. Furthermore, the evaluation of coding abilities of LLMs has become a critical component of this research area.

In the forthcoming sections, we will explore these dimensions of LLMs in the context of code generation in detail. Section 4.1 will address the data curation and processing strategies employed throughout the various stages of LLM development. Section 4.2 will discuss data synthesis methods designed to mitigate the scarcity of high-quality data. Section 4.3 will outline the prevalent model architectures used in LLMs for code generation. Moving to Section 4.4, we will examine the techniques for full parameter fine-tuning and parameter-efficient fine-tuning, which are essential for tailoring LLMs to code generation task. Section 4.5 will shed light on enhancing code quality through reinforcement learning, utilizing the power of feedback. Section 4.6 will delve into the strategic use of prompts to maximize the coding capabilities of LLMs. The innovative approaches of repository-level and retrieval-augmented code generation will be elaborated in Sections 4.7 and 4.8, respectively. Additionally, Section 4.9 will discuss the exciting field of autonomous coding agents. Lastly, Section 4.11 will provide insights into some of the practical applications that leverage LLMs for code generation, demonstrating the real-world impact of these sophisticated models. Through this comprehensive exploration, we aim to highlight the significance and potential of LLMs within the domain of automated code generation.

4.1. Data Curation & Processing

The exceptional performance of Large Language Models (LLMs) can be attributed to their training on large-scale and diverse datasets (Zan et al., 2023). Meanwhile, the extensive parameters of these models necessitate substantial data to unlock their full potential, in alignment with established scaling law (Kaplan et al., 2020; Hoffmann et al., 2022). For a general-purpose LLM, amassing a large-scale corpus of natural language from a variety of sources is imperative. Such sources include webpages, conversation data, books and news, scientific data, and code (Brown et al., 2020; Chowdhery et al., 2023; Bai et al., 2023; Touvron et al., 2023a, b; Yoo et al., 2024), while these data are often crawled from the web and must undergo meticulous and aggressive pre-processing (Raffel et al., 2020; Zhang et al., 2023b). Fortunately, multiple platforms and websites offer large-scale, open-source, and permissively licensed code corpora, such as GitHub777https://github.com and Stack Overflow888https://stackoverflow.com. Notably, the number of stars or forks of GitHub repositories has emerged as a valuable metric for filtering high-quality code datasets. In a similar vein, the quantity of votes on Stack Overflow can serve to discern the most relevant and superior answers.

Nonetheless, raw datasets are frequently laden with redundant, noisy data and personal information, eliciting concerns regarding privacy leakage, which may include the names and email addresses of repository contributors (Carlini et al., 2021; Laurençon et al., 2022; Al-Kaswan et al., 2024). Consequently, it is essential to undertake rigorous data-cleaning procedures. Typically, this process encompasses exact match deduplication, code data filtering based on average line length and a defined threshold for the fraction of alphanumeric characters, the removal of auto-generated files through keyword searches, and the expunction of personal user data (Tunstall et al., 2022; Kocetkov et al., 2022). Specifically, the standard data preprocessing workflow is depicted in Figure 4.

The development of a proficient LLM for code generation necessitates the utilization of various types of code data at different developmental stages. Therefore, we categorize code data into three distinct classes: pre-training datasets, instruction-tuning datasets, and benchmarks for performance evaluation. The subsequent subsections will provide a detailed illustration of code data within each classification.

Refer to caption
Figure 4. A diagram depicting the standard data preprocessing workflow utilized in the pre-training phase of large language models (LLMs) for code generation.

4.1.1. Pre-training

The remarkable success of bidirectional pre-trained language models (PLMs) such as BERT (Devlin et al., 2018) and unidirectional PLMs like GPT (Radford et al., 2018) has firmly established the practice of pre-training on large-scale unlabeled datasets to endow models with a broad spectrum of general knowledge. Extending this principle to the realm of code generation enables Large Language Models (LLMs) to assimilate fundamental coding principles, including the understanding of code structure dependencies, the semantics of code identifiers, and the intrinsic logic of code sequences (Chen et al., 2021; Wang et al., 2021; Guo et al., 2022; Wang et al., 2023b). In light of this advancement, there has been a proliferation of large-scale unlabeled code datasets proposed to serve as the foundational training ground for LLMs to develop coding proficiency. A brief introduction of these datasets is as follows, with the statistics available in Table 1.

  • CodeSearchNet (Husain et al., 2019): CodeSearchNet corpus is a comprehensive dataset, consisting of 2 million (comment, code) pairs from open-source repositories on GitHub. It includes code and documentation in several programming languages including Go, Java, PHP, Python, JavaScript, and Ruby. The dataset was primarily compiled to promote research into the problem of code retrieval using natural language.

  • Google BigQuery (Hoffa, 2016): the Google BigQuery Public Datasets program offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery.

  • The Pile (Gao et al., 2020): the Pile is an 825 GiB diverse and open source language modeling dataset aggregating 22 smaller, high-quality datasets including GitHub, Books3, and Wikipedia (en). It aims to encompass text from as many modalities as possible, thereby facilitating the development of models with broader generalization capabilities. For code generation, the GitHub composite is specifically utilized.

  • CodeParrot (Tunstall et al., 2022): the CodeParrot dataset contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the “NLP with Transformers book” (Tunstall et al., 2022). Created with the GitHub dataset available via Google’s BigQuery, the CodeParrot dataset includes approximately 22 million Python files and is 180 GB (50 GB compressed) big.

  • GitHub Code (Tunstall et al., 2022): the GitHub Code dataset comprises 115M code files derived from GitHub, spanning 32 programming languages and 60 extensions totaling 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

  • ROOTS (Laurençon et al., 2022): the BigScience ROOTS Corpus is a 1.6TB dataset spanning 59 languages that was used to train the 176B BigScience Large Open-science Open-access Multilingual (BLOOM) language model. For the code generation task, the code subset of the ROOTS Corpus will be specifically utilized.

  • The Stack (Kocetkov et al., 2022): the Stack contains over 6TB of permissively licensed source code files that cover 358 programming languages. The dataset was compiled as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).

  • The Stack v2 (Lozhkov et al., 2024): The Stack v2, a dataset created as part of the BigCode Project, contains over 3B files across more than 600 programming and markup languages. The dataset is derived from the Software Heritage archive999https://archive.softwareheritage.org, the largest public archive of software source code and accompanying development history.

Table 1. The statistics of some commonly-used pre-training datasets for large language models (LLMs) aimed at code generation. The column labeled ‘#PL’ indicates the number of programming languages included in each dataset. It should be noted that in the CodeSearchNet (Husain et al., 2019) dataset, each file represents a function, and for the Pile (Gao et al., 2020) and ROOTS (Laurençon et al., 2022) datasets, only the code components are considered.
Dataset Size (GB) Files (M) #PL Date Link
CodeSearchNet (Husain et al., 2019) 20 6.5 6 2022-01 https://huggingface.co/datasets/code_search_net
Google BigQuery(Hoffa, 2016) - - - 2016-06 github-on-bigquery-analyze-all-the-open-source-code
The Pile (Gao et al., 2020) 95 19 - 2022-01 https://huggingface.co/datasets/EleutherAI/pile
CodeParrot (Tunstall et al., 2022) 180 22 1 2021-08 https://huggingface.co/datasets/transformersbook/codeparrot
GitHub Code(Tunstall et al., 2022) 1,024 115 32 2022-02 https://huggingface.co/datasets/codeparrot/github-code
ROOTS (Laurençon et al., 2022) 163 15 13 2023-03 https://huggingface.co/bigscience-data
The Stack (Kocetkov et al., 2022) 3,136 317 30 2022-10 https://huggingface.co/datasets/bigcode/the-stack
The Stack v2 (Lozhkov et al., 2024) 32K 3K 619 2024-04 https://huggingface.co/datasets/bigcode/the-stack-v2

4.1.2. Instruction Tuning

Instruction tuning refers to the process of fine-tuning large language models (LLMs) using a collection of datasets that are structured as instructions. This method has demonstrated a considerable improvement in model performance and an enhanced ability to generalize to unseen tasks that the model has not previously encountered, as evidenced by recent studies (Ouyang et al., 2022; Chung et al., 2024). Leveraging the benefits of instruction tuning, instruction tuning has been expanded into coding domains, especially for code generation, which involves the automatic generation of the intended code from a natural language description. The promise of instruction tuning in this area has led numerous researchers to develop large-scale instruction-tuning datasets tailored for code generation. Below, we provide an overview of several notable datasets tailored for instruction tuning, with their respective statistics detailed in Table 2.

  • CodeAlpaca-20k (Chaudhary, 2023): CodeAlpaca-20k is a collection of 20K instruction-following data generated using the data synthesis techniques termed Self-Instruct outlined in (Wang et al., 2023a), with modifications for code generation, editing, and optimization tasks instead of general tasks.

  • CommitPackFT (Muennighoff et al., 2023): CommitPackFT is a 2GB refined version of CommitPack. It is filtered to only include high-quality commit messages that resemble natural language instructions.

  • Evol-Instruct-Code-80k (Roshdieh, 2023): Evol-Instruct-Code-80k is an open-source implementation of Evol-Instruct-Code described in the WizardCoder paper (Luo et al., 2023), which enhances the fine-tuning effect of pre-trained code large models by adding complex code instructions.

  • Magicoder-OSS-Instruct-75k (Wei et al., 2023): is a 75k synthetic data generated through OSS-Instruct with gpt-3.5-turbo-1106 and used to train both Magicoder and Magicoder-S series models.

  • Self-OSS-Instruct-SC2-Exec-Filter-50k (Yuxiang Wei, 2024): Self-OSS-Instruct-SC2-Exec-Filter-50k is generated by StarCoder2-15B using the OSS-Instruct (Wei et al., 2023) data synthesis approach. It was subsequently used to fine-tune StarCoder-15B without any human annotations or distilled data from huge and proprietary LLMs.

Table 2. The statistics of several representative datasets used in instruction-tuning large language models (LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages encompassed by each dataset.
Dataset Size #PL Date Link
CodeAlpaca-20K (Chaudhary, 2023) 20k - 2023-03 https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
CommitPackFT (Muennighoff et al., 2023) 2GB 277 2023-08 https://huggingface.co/datasets/bigcode/commitpackft
Evol-Instruct-Code-80k (Roshdieh, 2023) 80k - 2023-07 https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
Magicoder-OSS-Instruct-75k (Wei et al., 2023) 75k
Python, Shell,
TypeScript, C++,
Rust, PHP, Java,
Swift, C#
2023-12 https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
Self-OSS-Instruct-SC2-Exec-Filter-50k (Yuxiang Wei, 2024) 50k Python 2024-04 https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k

4.1.3. Benchmarks

To rigorously assess the efficacy of Large Language Models (LLMs) for code generation, the research community has introduced a variety of high-quality benchmarks in recent years. Building on the foundational work by (Chen et al., 2021), numerous variations of the HumanEval dataset and additional benchmarks have emerged, aiming to evaluate a broader spectrum of code generation capabilities in LLMs. We roughly divide these benchmarks into six distinct categories based on their application contexts, including general-purpose, competitive programming, data science, multilingual, logical reasoning, and repository-level. The statistics for these benchmarks are presented in Table 3.

General

  • HumanEval (Chen et al., 2021): HumanEval comprises 164 manually scripted Python programming problems, each featuring a function signature, docstring, body, and multiple unit tests.

  • HumanEval+ (Liu et al., 2024b): HumanEval+ extends the original HumanEval (Chen et al., 2021) benchmark by increasing the scale of the test cases by 80 times. As the test cases increase, HumanEval+ can catch significant amounts of previously undetected incorrect code synthesized by LLMs.

  • HumanEvalPack (Muennighoff et al., 2023): expands HumanEval (Chen et al., 2021) by extending it to encompass three coding tasks across six programming languages, namely code synthesis, code repair, and code explanation.

  • MBPP (Austin et al., 2021): MBPP is a collection of approximately 974 Python programming problems, crowd-sourced and designed for entry-level programmers. Each problem comes with an English task description, a code solution, and three automated test cases.

  • MBPP+ (Liu et al., 2024b): MBPP+ enhances MBPP (Austin et al., 2021) by eliminating ill-formed problems and rectifying problems with incorrect implementations. The test scale of MBPP+ is also expanded by 35 times for test augmentation.

  • CoNaLa (Yin et al., 2018): CoNaLa contains almost 597K data samples for evaluating Python code generation. The curated part of CoNaLa is crawled from Stack Overflow, automatically filtered, and then curated by annotators. The mined part of CoNaLais automatically mined, with almost 600k examples.

  • Spider (Yu et al., 2018): Spider is large-scale complex text-to-SQL dataset covering 138 different domains. It has over 10K questions and 5.6K complex SQL queries on 200 databases. This dataset aims to test a model’s ability to generalize to SQL queries, database schemas, and new domains.

  • CONCODE (Iyer et al., 2018): CONCODE is a dataset with over 100K samples consisting of Java classes from public GitHub repositories. It provides near zero-shot conditions that can test the model’s ability to generalize to unseen natural language tokens with unseen environments.

  • ODEX (Wang et al., 2022d): ODEX is an open-domain dataset focused on the execution-based generation of Python code from natural language. It features 945 pairs of natural language queries and their corresponding Python code, all extracted from StackOverflow forums.

  • CoderEval (Yu et al., 2024): CoderEval is a pragmatic code generation benchmark that includes 230 Python and 230 Java code generation problems. It can be used to evaluate the model performance in generating pragmatic code beyond just generating standalone functions.

  • ReCode (Wang et al., 2022c): Recode serves as a comprehensive robustness evaluation benchmark. ReCode applies perturbations to docstrings, function and variable names, code syntax, and code format, thereby providing multifaceted assessments of a model’s robustness performance.

  • StudentEval (Babe et al., 2023): StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80 students who have only completed a one-semester Python programming class. Unlike many other benchmarks, it has multiple prompts per problem and multiple attempts by the same participant, each problem is also accompanied by a set of instructor-written test cases.

Competitions

  • APPS (Hendrycks et al., 2021): The APPS benchmark is composed of 10K Python problems, spanning three levels of difficulty: introductory, interview, and competition. Each entry in the dataset includes a programming problem described in English, corresponding ground truth Python solutions, test cases defined by their inputs and outputs or function names if provided.

  • CodeContests (Li et al., 2022a): is a competitive programming dataset consisting of samples from various sources including Aizu, AtCoder, CodeChef, Codeforces, and HackerEarth. The dataset encompasses programming problems accompanied by test cases in the form of paired inputs and outputs, along with both correct and incorrect human solutions in multiple programming languages.

Data Science

  • DSP (Chandel et al., 2022): DSP allows for model evaluation based on real data science pedagogical notebooks. It includes well-structured problems, along with unit tests to verify the correctness of solutions and a Docker environment for reproducible execution.

  • DS-1000 (Lai et al., 2023): DS-1000 has 1K science questions from seven Python libraries, namely NumPy, Pandas, TensorFlow, PyTorch, SciPy, Scikit-learn, and Matplotlib. The DS-1000 benchmark features: (1) realistic problems with diverse contexts (2) implementation of multi-criteria evaluation metrics, and (3) defense against memorization.

  • ExeDS (Huang et al., 2022): ExeDS is a data science code generation dataset specifically designed for execution evaluation. It contains 534 problems with execution outputs from Jupyter Notebooks, as well as 123K examples for training and validation.

Multilingual

  • MBXP (Athiwaratkun et al., 2022): MBXP is a multilingual adaptation of the original MBPP (Austin et al., 2021) dataset. It is created using a framework that translates prompts and test cases from the original Python datasets into the corresponding data in the targeted programming language.

  • Multilingual HumanEval (Athiwaratkun et al., 2022): Multilingual HumanEval is a dataset derived from HumanEval (Chen et al., 2021). It is designed to assess the performance of models in a multilingual context. It helps uncover the generalization ability of the given model on languages that are out-of-domain.

  • HumanEval-X (Zheng et al., 2023c): HumanEval-X is developed for evaluating the multilingual ability of code generation models with 820 hand-writing data samples in C++, Java, JavaScript, and Go.

  • MultiPL-E (Cassano et al., 2022): MultiPL-E is a dataset for evaluating LLMs for code generation across 18 programming languages. It adopts the HumanEval (Chen et al., 2021) and the MBPP (Austin et al., 2021) Python benchmarks and uses little compilers to translate them to other languages.

  • xCodeEval (Khan et al., 2023): xCodeEval is an executable multilingual multitask benchmark consisting of 25M examples covering 17 programming languages. Its tasks include code understanding, generation, translation, and retrieval.

Reasoning

  • MathQA-X (Athiwaratkun et al., 2022) MathQA-X is the multilingual version of MathQA (Amini et al., 2019). It is generated by utilizing a conversion framework that converts samples from Python datasets into the target language.

  • MathQA-Python (Austin et al., 2021) MathQA-Python is a Python version of the MathQA benchmark(Amini et al., 2019). The benchmark, containing more than 23K problems, is designed to assess the capability of models to synthesize code from complex textual descriptions.

  • GSM8K (Cobbe et al., 2021): GSM8K is a dataset of 8.5K linguistically diverse grade school math problems. The dataset is crafted to facilitate the task of question answering on basic mathematical problems that requires multi-step reasoning.

  • GSM-HARD (Gao et al., 2023a): GSM-HARD is a more challenging version of the GSM8K (Cobbe et al., 2021) dataset. It replaces the numbers in the GSM8K questions with larger, less common numbers, thereby increasing the complexity and difficulty level of the problems.

Table 3. The detailed statistics of commonly-used benchmarks used in evaluating large language models (LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages included in each dataset. For the sake of brevity, we list the programming languages (PLs) for benchmarks that support fewer than or include five PLs. For benchmarks with six or more PLs, we provide only a numerical count of the PLs supported.
Scenario Benchmark Size #PL Date Link
General HumanEval (Chen et al., 2021) 164 Python 2021-07 https://huggingface.co/datasets/openai_humaneval
HumanEval+ (Liu et al., 2024b) 164 Python 2023-05 https://huggingface.co/datasets/evalplus/humanevalplus
HumanEvalPack (Muennighoff et al., 2023) 164 6 2023-08 https://huggingface.co/datasets/bigcode/humanevalpack
MBPP (Austin et al., 2021) 974 Python 2021-08 https://huggingface.co/datasets/mbpp
MBPP+ (Liu et al., 2024b) 378 Python 2023-05 https://huggingface.co/datasets/evalplus/mbppplus
CoNaLa (Yin et al., 2018) 596.88K Python 2018-05 https://huggingface.co/datasets/neulab/conala
Spider (Yu et al., 2018) 8,034 SQL 2018-09 https://huggingface.co/datasets/xlangai/spider
CONCODE (Iyer et al., 2018) 104K Java 2018-08 https://huggingface.co/datasets/AhmedSSoliman/CONCOD
ODEX (Wang et al., 2022d) 945 Python 2022-12 https://huggingface.co/datasets/neulab/odex
CoderEval (Yu et al., 2024) 460 Python, Java 2023-02 https://github.com/CoderEval/CoderEval
ReCode (Wang et al., 2022c) 1,138 Python 2022-12 https://github.com/amazon-science/recode
StudentEval (Babe et al., 2023) 1,749 Python 2023-06 https://huggingface.co/datasets/wellesley-easel/StudentEval
Competitions APPS (Hendrycks et al., 2021) 10,000 Python 2021-05 https://huggingface.co/datasets/codeparrot/apps
CodeContests (Li et al., 2022a) 13,610 C++, Python, Java 2022-02 https://huggingface.co/datasets/deepmind/code_contests
Data Science DSP (Chandel et al., 2022) 1,119 Python 2022-01 https://github.com/microsoft/DataScienceProblems
DS-1000 (Lai et al., 2023) 1,000 Python 2022-11 https://huggingface.co/datasets/xlangai/DS-1000
ExeDS (Huang et al., 2022) 534 Python 2022-11 https://github.com/Jun-jie-Huang/ExeDS
Multilingual MBXP (Athiwaratkun et al., 2022) 12.4K 13 2022-10 https://huggingface.co/datasets/mxeval/mbxp
Multilingual HumanEval (Athiwaratkun et al., 2022) 1.9K 12 2022-10 https://huggingface.co/datasets/mxeval/multi-humaneval
HumanEval-X (Zheng et al., 2023c) 820 Python, C++, Java, JavaScript, Go 2023-03 https://huggingface.co/datasets/THUDM/humaneval-x
MultiPL-E (Cassano et al., 2022) 161 18 2022-08 https://huggingface.co/datasets/nuprl/MultiPL-E
xCodeEval (Khan et al., 2023) 5.5M 11 2023-03 https://github.com/ntunlp/xCodeEval
Reasoning MathQA-X (Athiwaratkun et al., 2022) 5.6K Python, Java, JavaScript 2022-10 https://huggingface.co/datasets/mxeval/mathqa-x
MathQA-Python (Austin et al., 2021) 23,914 Python 2021-08 https://github.com/google-research/google-research
GSM8K (Cobbe et al., 2021) 8.5K Python 2021-10 https://huggingface.co/datasets/gsm8k
GSM-HARD (Gao et al., 2023a) 1.32K Python 2022-11 https://huggingface.co/datasets/reasoning-machines/gsm-hard
Repository RepoEval (Zhang et al., 2023c) 3,573 Python, Java 2023-03 https://paperswithcode.com/dataset/repoeval
Stack-Repo (Shrivastava et al., 2023a) 200 Java 2023-06 https://huggingface.co/datasets/RepoFusion/Stack-Repo
Repobench (Liu et al., 2023b) 27k Python, Java 2023-01 https://github.com/Leolty/repobench
EvoCodeBench (Li et al., 2024) 275 Python 2024-03 https://huggingface.co/datasets/LJ0815/EvoCodeBench
SWE-bench (Jimenez et al., 2023) 2,294 Python 2023-10 https://huggingface.co/datasets/princeton-nlp/SWE-bench
CrossCodeEval (Ding et al., 2024) 10K Python, Java, TypeScript, C# 2023-10 https://github.com/amazon-science/cceval
SketchEval (Zan et al., 2024) 20,355 Python 2024-03 https://github.com/nl2code/codes

Repository

  • RepoEval (Zhang et al., 2023c): RepoEval enables the evaluation of repository-level code completion. It can offer different levels of granularity and improved evaluation accuracy through the use of unit tests.

  • Stack-Repo (Shrivastava et al., 2023a): Stack-Repo is a dataset of 200 Java repositories from GitHub with near-deduplicated files. These files are augmented with three types of repository contexts: prompt proposal contexts, BM25 Contexts (based on BM25 similarity scores), and RandomNN Contexts (obtained using the nearest neighbors in the representation space of an embedding model).

  • Repobench (Liu et al., 2023b): Repobench is a benchmark specifically used for evaluating repository-level code auto-completion systems. Supporting both Python and Java, it consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline).

  • EvoCodeBench (Li et al., 2024): EvoCodeBench is an evolutionary code generation benchmark, constructed through a rigorous pipeline and aligned with real-world repositories. This benchmark also provides comprehensive annotations and robust evaluation metrics.

  • SWE-bench (Jimenez et al., 2023): SWE-bench is a dataset that tests a model’s ability to automatically solve GitHub issues. The dataset has 2,294 Issue-Pull Request pairs from 12 popular Python repositories.

  • CrossCodeEval (Ding et al., 2024): CrossCodeEval is a diverse and multilingual scope completion dataset covering four languages: Python, Java, TypeScript, and C#. This benchmark tests the model’s ability to understand in-depth cross-file information and accurately complete the code.

  • SketchEval (Zan et al., 2024): SketchEval is a repository-oriented benchmark that encompasses data from 19 repositories, each varying in complexity. In addition to the dataset, SketchEval introduces a metric, known as SketchBLEU, to measure the similarity between two repositories based on their structures and semantics.

4.2. Data Synthesis

Numerous studies have demonstrated that high-quality datasets are integral to enhancing the performance of large language models (LLMs) in various downstream tasks (Brown et al., 2020; Meng et al., 2022; Xie et al., 2023; Zhou et al., 2024; Köpf et al., 2024; Wettig et al., 2024). For instance, the LIMA model, a 65B parameter LLaMa language model fine-tuned with a standard supervised loss on a mere 1,000 meticulously curated prompts and responses, achieved performance on par with, or even superior to, GPT-4 in 43% of evaluated cases. This figure rose to 58% when compared to Bard and 65% against DaVinci003, all without the use of reinforcement learning or human preference modeling (Zhou et al., 2024). The QuRating initiative strategically selects pre-training data embodying four key textual qualities — writing style, facts & trivia, required expertise, and educational value — that resonate with human intuition. Training a 1.3B parameter model on such data resulted in reduced perplexity and stronger in-context learning compared to baseline models (Wettig et al., 2024).

Despite these advancements, acquiring quality data remains a significant challenge due to issues such as data scarcity, privacy concerns, and prohibitive costs (Wang et al., 2023a; Liu et al., 2024a). Human-generated data is often labor-intensive and expensive to produce, and it may lack the necessary scope and detail to navigate complex, rare, or ambiguous scenarios. As a resolution to these challenges, synthetic data has emerged as a viable alternative. By generating artificial datasets that replicate the intricacies of real-world information, models such as GPT-3.5-turbo (OpenAI, 2022) and GPT-4 (Achiam et al., 2023) have enabled the creation of rich datasets without the need for human annotation (Wang et al., 2023a; Hämäläinen et al., 2023; Liu et al., 2024a; Laurer, 2024). This approach is particularly beneficial in enhancing the instruction-following capabilities of LLMs, with a focus on generating synthetic instruction-based data.

A notable example of this approach is the Self-Instruct (Wang et al., 2023a) framework, which employs an off-the-shelf language model to generate a suite of instructions, inputs, and outputs. This data is then refined by removing invalid or redundant entries before being used to fine-tune the model. The empirical evidence supports the efficacy of this synthetic data generation methodology. Building upon this concept, the Alpaca (Taori et al., 2023) model, fine-tuned on 52k pieces of instruction-following data from a 7B parameter LLaMa (Touvron et al., 2023a) model, exhibits performance comparable to the text-davinci-003 model. WizardLM (Xu et al., 2023) introduced the Evol-Instruct technique, which incrementally transforms simple instructions into more complex variants. The fine-tuned LLaMa model using this technique has shown promising results in comparison to established proprietary LLMs such as ChatGPT (OpenAI, 2022) and GPT-4 (Achiam et al., 2023), to some extent. Moreover, Microsoft has contributed to this field with their Phi series of models, predominantly trained on synthetic high-quality data, which includes Phi-1 (1.3B) (Gunasekar et al., 2023) for Python coding, Phi-1.5 (1.3B) (Li et al., 2023b) for common sense reasoning and language understanding, Phi-2 (2.7B) (Mojan Javaheripi, 2023) for advanced reasoning and language understanding, and Phi-3 (3.8B) (Abdin et al., 2024) for general purposes. These models have consistently outperformed larger counterparts across various benchmarks, demonstrating the efficacy of synthetic data in model training.

Drawing on the successes of data synthesis for general-purpose Large Language Models (LLMs), researchers have expanded the application of synthetic data to the realm of code generation. The Code Alpaca model, as described in (Chaudhary, 2023), has been fine-tuned on a 7B and 13B LLaMA model using a dataset of 20k instruction-following examples for code generation. This dataset was created by text-davinci-003101010https://platform.openai.com and employed the Self-Instruct technique (Wang et al., 2023a). Building on this, the WizardCoder 15B (Luo et al., 2023) utilizes the Evol-Instruct technique to create an enhanced dataset of 78k evolved code instruction examples. This dataset originates from the initial 20k instruction-following dataset used by Code Alpaca (Chaudhary, 2023), which was also generated by text-davinci-003. The WizardCoder model, fine-tuned on the StarCoder (Li et al., 2023a) base model, achieved a 57.3% pass@1 on the HumanEval benchmarks. This performance not only surpasses all other open-source Code LLMs by a significant margin but also outperforms leading closed LLMs such as Anthropic’s Claude and Google’s Bard. In a similar vein, Magicoder (Wei et al., 2023) introduces a novel data synthesis approach termed OSS-INSTRUCT which enlightens LLMs with open-source code snippets to generate high-quality instruction data for coding tasks. It aims to address the inherent biases often present in synthetic data produced by LLMs. Building upon CodeLlama (Roziere et al., 2023), the MagicoderS-CL-7B model — fine-tuned with 75k synthetic instruction data using the OSS-INSTRUCT technique and with gpt-3.5-turbo-1106 as the data generator — has outperformed the prominent ChatGPT on the HumanEval Plus benchmark, achieving pass@1 of 66.5% versus 65.9%. In a noteworthy development, Microsoft has introduced the phi-1 model (Gunasekar et al., 2023), a more compact LLM of only 1.3B parameters. Despite its smaller size, phi-1 has been trained on high-quality textbook data sourced from the web (comprising 6 billion tokens) and supplemented with synthetic textbooks and exercises generated with GPT-3.5 (1 billion tokens). It has achieved pass@1 of 50.6% on HumanEval and 55.5% on MBPP, setting a new state-of-the-art for Python coding performance among existing small language models (SLMs). The latest contribution to this field is from the BigCode team, which has presented StarCoder2-15B-instruct (Yuxiang Wei, 2024), the first entirely self-aligned code LLM trained with a transparent and permissive pipeline. This model aligns closely with the OSS-INSTRUCT principles established by Magicoder, generating instructions based on seed functions filtered from the Stack v1 dataset (Kocetkov et al., 2022) and producing responses through self-validation. Unlike Magicoder, StarCoder2-15B-instruct employs its base model, StarCoder2-15B, as the data generator, thus avoiding reliance on large and proprietary LLMs like GPT-3.5-turbo (OpenAI, 2022).

While synthetic data has demonstrated its potential across both small- and large-scale LMs for a variety of general and specialized tasks, including code generation, it also poses several challenges that must be addressed. These challenges include a lack of data diversity (Wettig et al., 2024), the need to ensure the factuality and fidelity of the information (Wood et al., 2021; Van Breugel et al., 2023), and the potential to amplify existing biases or introduce new ones (Barbierato et al., 2022; Gupta et al., 2021).

4.3. Pre-Training

4.3.1. Model Architectures

Since the inception of the Transformer architecture for machine translation (Vaswani et al., 2017), it has become the de facto backbone for a multitude of large language models (LLMs) that address a wide range of downstream tasks. The Transformer and its derivatives owe their prominence to their exceptional ability to parallelize computation and their powerful representational capacities (Zhao et al., 2023b; Yoo et al., 2024). Through innovative scaling techniques, such as Mixture-of-Experts (MoE) (Shazeer et al., 2017; Cai et al., 2024) and Depth-Up-Scaling (DUS) (Kim et al., 2023), the capacity of Transformer-based LLMs has expanded to encompass hundreds of billions or even trillions of parameters. These scaled-up models have exhibited a range of emergent abilities (Kaplan et al., 2020; Hoffmann et al., 2022; Wei et al., 2022a), such as instruction following (Ouyang et al., 2022), in-context learning (Dong et al., 2022), and step-by-step reasoning (Wei et al., 2022b; Huang and Chang, 2022) that were previously unforeseen.

In the domain of code generation using LLMs, the architecture of contemporary models generally falls into one of two categories: encoder-decoder models, such as CodeT5 (Wang et al., 2021), CodeT5+ (Wang et al., 2023b), and CodeRL (Le et al., 2022); or decoder-only models, such as Codex (Chen et al., 2021), StarCoder (Li et al., 2023a), Code Llama (Roziere et al., 2023), and CodeGemma (CodeGemma Team et al., 2024). These architectures are depicted in Figure 2(b) and (c), respectively. For a comprehensive overview, Table 4 details the encoder-decoder architectures, while Table 5 focuses on the decoder-only models utilized in code generation.

Table 4. The overview of large language models (LLMs) with encoder-decoder architectures for code generation.
Architecture Model Institution Size Vocabulary Context Window Date Open Source
Encoder-Decoder PyMT5(Clement et al., 2020) Microsoft 374M 50K 1024+1024 2020-10
PLBART(Ahmad et al., 2021) UCLA 140M 50K 1024+1024 2021-03
CodeT5 (Wang et al., 2021) Salesforce 60M, 220M, 770M 32K 512+256 2021-09
JuPyT5(Chandel et al., 2022) Microsoft 350M 50K 1024+1024 2022-01
AlphaCode(Li et al., 2022a) DeepMind 284M, 1.1B, 2.8B, 8.7B, 41.1B 8K 1536+768 2022-02
CodeRL(Le et al., 2022) Salesforce 770M 32K 512+256 2022-06
ERNIE-Code(Chai et al., 2022) Baidu 560M 250K 1024+1024 2022-12
PPOCoder(Shojaee et al., 2023) Virginia Tech 770M 32K 512+256 2023-01
CodeT5+(Wang et al., 2023b) Salesforce 220M, 770M, 2B, 6B, 16B 50K 2048+2048 2023-05
CodeFusion(Singh et al., 2023) Microsoft 75M 32k 128+128 2023-10
AST-T5(Gong et al., 2024) UC Berkeley 226M 32k 512+200/300 2024-01
Table 5. The overview of large language models (LLMs) with decoder-only architectures for code generation.
Architecture Model Institution Size Vocabulary Context Window Date Open Source
Decoder-Only GPT-C (Svyatkovskiy et al., 2020) Microsoft 366M 60K 1024 2020-05
CodeGPT (Lu et al., 2021) Microsoft 124M 50K 1024 2021-02
GPT-Neo(Black et al., 2021) EleutherAI 125M, 1.3B, 2.7B 50k 2048 2021-03
GPT-J (Wang and Komatsuzaki, 2021) EleutherAI 6B 50k 2048 2021-05
Codex (Chen et al., 2021) OpenAI 12M, 25M, 42M, 85M, 300M, 679M, 2.5B, 12B - 4096 2021-07
CodeParrot (Tunstall et al., 2022) Hugging Face 110M, 1.5B 33k 1024 2021-11
PolyCoder (Xu et al., 2022) CMU 160M, 400M, 2.7B 50k 2048 2022-02
CodeGen (Nijkamp et al., 2022) Salesforce 350M, 2.7B, 6.1B, 16.1B 51k 2048 2022-03
GPT-NeoX (Black et al., 2022) EleutherAI 20B 50k 2048 2022-04
PaLM-Coder (Chowdhery et al., 2023) Google 8B, 62B, 540B 256k 2048 2022-04
InCoder (Fried et al., 2022) Meta 1.3B, 6.7B 50k 2049 2022-04
PanGu-Coder (Christopoulou et al., 2022) Huawei 317M, 2.6B 42k 1024 2022-07
PyCodeGPT (Zan et al., 2022) Microsoft 110M 32k 1024 2022-06
CodeGeeX (Zheng et al., 2023c) Tsinghua 13B 52k 2048 2022-09
BLOOM (Le Scao et al., 2023) BigScience 176B 251k - 2022-11
ChatGPT (OpenAI, 2022) OpenAI - - 16k 2022-11
SantaCoder (Allal et al., 2023) Hugging Face 1.1B 49k 2048 2022-12
LLaMA (Touvron et al., 2023a) Meta 6.7B, 13.0B, 32.5B, 65.2B 32K 2048 2023-02
GPT-4 (Achiam et al., 2023) OpenAI - - 32K 2023-03
CodeGen2 (Nijkamp et al., 2023) Salesforce 1B, 3.7B, 7B, 16B 51k 2048 2023-05
replit-code (Replit, 2023) replit 3B 33k 2048 2023-05
StarCoder (Li et al., 2023a) Hugging Face 15.5B 49k 8192 2023-05
WizardCoder (Luo et al., 2023) Microsoft 15B, 34B 49k 8192 2023-06
phi-1 (Gunasekar et al., 2023) Microsoft 1.3B 51k 2048 2023-06
CodeGeeX2 (Zheng et al., 2023c) Tsinghua 6B 65k 8192 2023-07
PanGu-Coder2 (Shen et al., 2023) Huawei 15B 42k 1024 2023-07
Llama 2 (Touvron et al., 2023b) Meta 7B, 13B, 70B 32K 4096 2023-07
OctoCoder (Muennighoff et al., 2023) Hugging Face 15.5B 49k 8192 2023-08
Code Llama (Roziere et al., 2023) Meta 7B, 13B, 34B 32k 16384 2023-08
CodeFuse (Liu et al., 2023a) Ant Group 350M, 13B, 34B 101k 4096 2023-09
phi-1.5 (Li et al., 2023b) Microsoft 1.3B 51k 2048 2023-09
CodeShell (Xie et al., 2024) Peking University 7B 70k 8192 2023-10
Magicoder (Wei et al., 2023) UIUC 7B 32k 16384 2023-12
AlphaCode 2 (AlphaCode Team, 2023) Google DeepMind - - - 2023-12
StableCode (Pinnaparaju et al., 2024) StabilityAI 3B 50k 16384 2024-01
WaveCoder (Yu et al., 2023) Microsoft 6.7B 32k 16384 2023-12
phi-2 (Mojan Javaheripi, 2023) Microsoft 2.7B 51k 2048 2023-12
DeepSeek-Coder (Guo et al., 2024) DeepSeek 1.3B, 6.7B, 33B 32k 16384 2023-11
StarCoder 2 (Lozhkov et al., 2024) Hugging Face 15B 49k 16384 2024-02
Claude 3 (Anthropic, 2024) Anthropic - - 200K 2024-03
CodeGemma (CodeGemma Team et al., 2024) Google 2B, 7B 25.6k 8192 2024-04
Code-Qwen (Team, 2024) Qwen Group 7B 92K 65536 2024-04
Llama3 (Meta, 2024) Meta 8B, 70B 128K 8192 2024-04
StarCoder2-Instruct (Yuxiang Wei, 2024) Hugging Face 15.5B 49K 16384 2024-04

4.3.2. Pre-training Tasks

In the initial phase, language models for code generation are typically trained from scratch using datasets consisting of manually annotated pairs of natural language descriptions and corresponding code snippets, within a supervised learning framework. However, manual annotation is not only laborious and time-consuming, but the efficacy of the resulting models is also constrained by both the volume and the quality of the available annotated data. This limitation is especially pronounced in the context of low-resource programming languages, such as Swahili and Yoruba, where annotated examples are scarce (Chen et al., 2022a; Cassano et al., 2023). In light of these challenges, there has been a shift towards an alternative training strategy that involves pre-training models on extensive and unlabelled code corpora. This method is aimed at imbuing the models with a broad understanding of programming knowledge, encompassing elements like identifiers, code structure, and underlying semantics (Chen et al., 2021). In this regard, two pre-training tasks have gained prominence for their effectiveness, namely Causal Language Modeling (CLM), also known as unidirectional language modeling or next-token prediction, and Denoising Autoencoding (DAE). The CLM task can be applied to both decoder-only and encoder-decoder model architectures, while DAE tasks are specifically designed for encoder-decoder frameworks. It should also be noted that there is a variety of additional auxiliary pre-training tasks that can further enhance model performance. These include Masked Identifier Prediction, Identifier Tagging, Bimodal Dual Generation (Wang et al., 2021), Text-Code Matching, and Text-Code Contrastive Learning (Wang et al., 2023b). These tasks contribute to a more nuanced and comprehensive pre-training process, equipping the models with the capabilities necessary to handle a wide range of code generation scenarios.

Causal Language Modeling. In decoder-only LLMs, given a sequence of tokens 𝐱={x1,,xn}𝐱subscript𝑥1subscript𝑥𝑛\mathbf{x}=\{x_{1},\dots,x_{n}\}bold_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the CLM task refers to autoregressively predict the target tokens xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the preceding tokens x<isubscript𝑥absent𝑖x_{<i}italic_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT in a sequence. The causal language modeling objective for training decoder LLMs is to minimize the following likelihood:

(15) CLMDecoderonly(𝐱)=log(i=1nPθ(xi𝐱<i))=i=1nlogPθ(xi𝐱<i)superscriptsubscript𝐶𝐿𝑀𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝑜𝑛𝑙𝑦𝐱superscriptsubscriptproduct𝑖1𝑛subscript𝑃𝜃conditionalsubscript𝑥𝑖subscript𝐱absent𝑖superscriptsubscript𝑖1𝑛subscript𝑃𝜃conditionalsubscript𝑥𝑖subscript𝐱absent𝑖\displaystyle\mathcal{L}_{CLM}^{Decoder-only}(\mathbf{x})=-\log(\prod_{i=1}^{n% }P_{\theta}(x_{i}\mid\mathbf{x}_{<i}))=\sum_{i=1}^{n}-\log P_{\theta}(x_{i}% \mid\mathbf{x}_{<i})caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_e italic_c italic_o italic_d italic_e italic_r - italic_o italic_n italic_l italic_y end_POSTSUPERSCRIPT ( bold_x ) = - roman_log ( ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT )

where 𝐱<isubscript𝐱absent𝑖\mathbf{x}_{<i}bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT represents the sequence of preceding tokens {x1,,xi1}subscript𝑥1subscript𝑥𝑖1\{x_{1},\dots,x_{i-1}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT } before 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the input, θ𝜃\thetaitalic_θ denotes the model parameters. The conditional probability Pθ(xi|𝐱<i))P_{\theta}(x_{i}|\mathbf{x}_{<i}))italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) is modeled by adding a causal attention mask to the multi-head self-attention matrix of each Transformer block. To be specific, causal attention masking is implemented by setting the lower triangular part of the matrix to 0 and the remaining elements to -\infty- ∞, ensuring that each token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT attends only to its predecessors and itself. On the contrary, in encoder-decoder LLMs, a pivot token xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is randomly selected in a sequence of tokens and then regarding the context before it as the source sequence 𝐱in={x1,,xk}subscript𝐱𝑖𝑛subscript𝑥1subscript𝑥𝑘\mathbf{x}_{in}=\{x_{1},\dots,x_{k}\}bold_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } of the encoder and the sequence after it as the target output 𝐱out={xk+1,,xn}subscript𝐱𝑜𝑢𝑡subscript𝑥𝑘1subscript𝑥𝑛\mathbf{x}_{out}=\{x_{k+1},\dots,x_{n}\}bold_x start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } of decoder. Formally, the causal language modeling objective for training encoder-decoder LLMs is to minimize loss function as follows:

(16) CLMEncoderDecoder(𝐱)=log(i=k+1nPθ(xi𝐱k,𝐱<i))=i=k+1nlogPθ(xi𝐱k,𝐱<i)superscriptsubscript𝐶𝐿𝑀𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝐱superscriptsubscriptproduct𝑖𝑘1𝑛subscript𝑃𝜃conditionalsubscript𝑥𝑖subscript𝐱absent𝑘subscript𝐱absent𝑖superscriptsubscript𝑖𝑘1𝑛subscript𝑃𝜃conditionalsubscript𝑥𝑖subscript𝐱absent𝑘subscript𝐱absent𝑖\displaystyle\mathcal{L}_{CLM}^{Encoder-Decoder}(\mathbf{x})=-\log(\prod_{i=k+% 1}^{n}P_{\theta}(x_{i}\mid\mathbf{x}_{\leq k},\mathbf{x}_{<i}))=\sum_{i=k+1}^{% n}-\log P_{\theta}(x_{i}\mid\mathbf{x}_{\leq k},\mathbf{x}_{<i})caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E italic_n italic_c italic_o italic_d italic_e italic_r - italic_D italic_e italic_c italic_o italic_d italic_e italic_r end_POSTSUPERSCRIPT ( bold_x ) = - roman_log ( ∏ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT ≤ italic_k end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT ≤ italic_k end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT )

where 𝐱ksubscript𝐱absent𝑘\mathbf{x}_{\leq k}bold_x start_POSTSUBSCRIPT ≤ italic_k end_POSTSUBSCRIPT is the source sequence input and 𝐱<isubscript𝐱absent𝑖\mathbf{x}_{<i}bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT denotes the target sequence autoregressively generated so far. During the inference phase, pre-trained LLMs that have been trained on large-scale code corpus can generate code in a zero-shot manner without the need for fine-tuning. This is achieved through the technique of prompt engineering, which guides the model to produce the desired output111111For more information on prompt engineering, visit https://www.promptingguide.ai (Radford et al., 2019; Brown et al., 2020). Additionally, recent studies have explored the use of few-shot learning, also referred to as in-context learning, to enhance model performance further (Li et al., 2023d; Patel et al., 2023).

Denoising Autoencoding. In addition to causal language modeling (CLM), the denoising autoencoding (DAE) task has been extensively applied in pre-training encoder-decoder architectures for code generation, such as PLBART (Ahmad et al., 2021), CodeT5 (Wang et al., 2021), and its enhanced successor, CodeT5+ (Wang et al., 2023b). Following T5 (Raffel et al., 2020) and CodeT5 (Wang et al., 2021), the DAE refers to initially perturbing the source sequence by introducing randomly masked spans of varying lengths. This corrupted sequence serves as the input for the encoder. Subsequently, the decoder employs an autoregressive strategy to reconstruct the masked spans, integrating sentinel tokens to facilitate the generation process. This method has proven effective in improving the model’s ability to generate semantically and syntactically accurate code by learning robust contextual representations (Wang et al., 2021, 2023b). Formally, the denoising autoencoding objective for training encoder-decoder LLMs is to minimize the following likelihood:

(17) DAEEncoderDecoder(𝐱)=i=1klogPθ(𝐱imasked_spans𝐱\masked_spans,𝐱<imasked_spans)superscriptsubscript𝐷𝐴𝐸𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝐱superscriptsubscript𝑖1𝑘subscript𝑃𝜃conditionalsuperscriptsubscript𝐱𝑖𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠superscript𝐱\absent𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠superscriptsubscript𝐱absent𝑖𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠\mathcal{L}_{DAE}^{Encoder-Decoder}(\mathbf{x})=\sum_{i=1}^{k}-\log P_{\theta}% (\mathbf{x}_{i}^{masked\_spans}\mid\mathbf{x}^{\backslash masked\_spans},% \mathbf{x}_{<i}^{masked\_spans})caligraphic_L start_POSTSUBSCRIPT italic_D italic_A italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E italic_n italic_c italic_o italic_d italic_e italic_r - italic_D italic_e italic_c italic_o italic_d italic_e italic_r end_POSTSUPERSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUPERSCRIPT \ italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT )

where θ𝜃\thetaitalic_θ denotes the model parameters, 𝐱\masked_spanssuperscript𝐱\absent𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠\mathbf{x}^{\backslash masked\_spans}bold_x start_POSTSUPERSCRIPT \ italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT is the noisy input with masked spans, 𝐱masked_spanssuperscript𝐱𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠\mathbf{x}^{masked\_spans}bold_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT is the masked spans to predict from the decoder with k𝑘kitalic_k denoting the number of tokens in 𝐱masked_spanssuperscript𝐱𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠\mathbf{x}^{masked\_spans}bold_x start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT, and 𝐱<imasked_spanssuperscriptsubscript𝐱absent𝑖𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠\mathbf{x}_{<i}^{masked\_spans}bold_x start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d _ italic_s italic_p italic_a italic_n italic_s end_POSTSUPERSCRIPT is the span sequence autoregressively generated so far. Compared with CLM, the DAE task presents a more challenging scenario, as it necessitates a deeper understanding and capture of the intrinsic semantic relationships among token sequences by LLMs (Raffel et al., 2020).

4.4. Instruction Tuning

Refer to caption
Figure 5. Two exemplars of instruction data sampled from Code Alpaca (Chaudhary, 2023) used to instruction-tune pre-trained code LLM to enhance their alignment with natural language instructions. The instruction corpus encompasses a variety of tasks, each accompanied by distinct instructions, such as prime numbers generation and URLs extraction.

After pre-training Large Language Models (LLM) on large-scale datasets, the next phase typically involves augmenting the model’s ability to process and follow instructions, known as instruction tuning. Instruction tuning generally refers to the supervised fine-tuning of pre-trained LLMs using datasets comprised of structured examples framed as natural language instructions (Wei et al., 2021; Ouyang et al., 2022; Iyer et al., 2022; Zhang et al., 2023d). Two exemplars of instruction data sampled from Code Alpaca (Chaudhary, 2023) are demonstrated in Figure 5. It capitalizes on the heterogeneity of instruction types, positioning instruction tuning as a form of multi-task prompted training that significantly enhances the model’s generalization to unseen tasks (Wei et al., 2021; Sanh et al., 2022; Ouyang et al., 2022; Chung et al., 2024).

In the realm of code generation, natural language descriptions serve as the instructions guiding the model to generate corresponding code snippets. Consequently, a line of research on instruction tuning LLMs for code generation has garnered substantial interest across academia and industry. To perform instruction tuning, instruction data are typically compiled from source code with permissive licenses (Husain et al., 2019; Kocetkov et al., 2022; Lozhkov et al., 2024) (refer to Section 4.1.2) or are constructed from synthetic code data (Luo et al., 2023; Wei et al., 2023; Yuxiang Wei, 2024) (refer to Section 4.2). These datasets are then utilized to fine-tune LLMs through a supervised learning paradigm. However, the substantial computational resources required for full parameter fine-tuning (FFT) LLM pose a notable challenge, particularly in scenarios with constrained resources (Ding et al., 2022a; Lialin et al., 2023). To mitigate this issue, parameter-efficient fine-tuning (PEFT) has emerged as a compelling alternative strategy, gaining increasing attention for its potential to reduce resource consumption (Ding et al., 2022a). In the following subsection, we categorize existing works based on their instruction-tuning strategies to provide a comprehensive and systematic review.

4.4.1. Full Parameter Fine-tuning

Full parameter fine-tuning (FFT) involves updating all parameters within a pre-trained model, as shown in Figure 6(a). This approach is often preferred when ample computational resources and substantial training data are available, as it typically leads to better performance. (Wang et al., 2021) introduces an encoder-decoder pre-trained language model for code generation, named CodeT5+. They instruction-tune this model on a dataset comprising 20k instruction samples from Code Alpaca (Chaudhary, 2023), resulting in an instruction-following model called InstructCodeT5+, which exhibited improved capabilities in code generation. (Luo et al., 2023) leverages the Evol-Instruct data synthesis technique from WizardLM (Xu et al., 2023) to evolve 20K code Alpaca (Chaudhary, 2023) instruction samples into a 78K code instruction dataset. This enriched dataset is then used to fine-tune the StarCoder base model, resulting in WizardCoder, which showcases notable advancements in code generation. In a similar vein, inspired by the successes of WizardCoder (Luo et al., 2023) and RRHF (Yuan et al., 2023), Pangu-Coder 2 (Shen et al., 2023) applies the Evol-Instruct method to generate 68k high-quality instruction samples from the initial 20k Code Alpaca (Chaudhary, 2023) instruction samples. Additionally, they introduces a novel reinforcement learning via Rank Responses to align Test & Teacher Feedback (RRTF), which further enhances the performance of Pangu-Coder 2 in code generation. Diverging from synthetic instruction data generation methods, OctoPack (Muennighoff et al., 2023) utilizes real-world data by curating CommitPack from the natural structure of Git commits, which inherently pair code changes with human-written instructions. This dataset, consisting of 4 terabytes of Git commits across 350 programming languages, is employed to fine-tune StarCoder (Li et al., 2023a) and CodeGeeX2 (Zheng et al., 2023c), leading to the instruction-following code models of OctoCoder and OctoGeeX for code generation, respectively. The most recent innovation comes from Magicoder (Wei et al., 2023), who proposes OSS-INSTRUCT, a novel data synthesis method that leverages open-source code snippets to generate high-quality instruction data for code generation. This approach seeks to reduce the bias often present in synthetic data generated by LLM. In line with OSS-INSTRUCT, the BigCode team introduces StarCoder2-15B-instruct (Yuxiang Wei, 2024), which they claim to be the first entirely self-aligned Large Language Model (LLM) for code generation, trained with a fully permissive and transparent pipeline. Moreover, (CodeGemma Team et al., 2024) harnesses open-source mathematics datasets, such as MATH (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021), along with synthetically generated code following the OSS-INSTRUCT (Wei et al., 2023) paradigm, to instruction-tune CodeGemma 7B, yielding exceptional results in mathematical reasoning and code generation tasks.

4.4.2. Parameter-Efficient Fine-tuning

To mitigate the extensive computational and resource demands inherent in fine-tuning large language models (LLMs), the concept of parameter-efficient fine-tuning (PEFT) has emerged to focus on updating a minimal subset of parameters, which may either be a selection of the model’s parameters or an array of additional parameters specifically introduced for the tuning process (Ding et al., 2022a; Lialin et al., 2023). The categorization of these methods is depicted in Figure 6(b), (c), and (d). A plethora of innovative PEFT approaches have been developed, among which BitFit (Zaken et al., 2021), Adapter (Houlsby et al., 2019), Prompt tuning (Lester et al., 2021), Prefix-tuning (Li and Liang, 2021), LoRA (Hu et al., 2021), IA3 (Liu et al., 2022), QLoRA (Dettmers et al., 2024), and AdaLoRA (Zhang et al., 2023a) are particularly noteworthy. A seminal study in this field, LoRA (Hu et al., 2021), proposes a parameter update mechanism for a pre-trained weight matrix — such as those found in the key or value projection matrices of a Transformer block’s multi-head self-attention layer — by factorizing the update into two low-rank matrices. Crucially, all original model parameters remain frozen, with only the pair of low-rank matrices being trainable. After fine-tuning, the product of these low-rank matrices can be seamlessly incorporated into the existing weight matrix through an element-wise addition. This process can be formally described as:

(18) (𝐖0+Δ𝐖)x=𝐖0x+Δ𝐖x=𝐖0frozenx+αr𝐁uptrainable𝐀downtrainableΔ𝐖xsubscript𝐖0Δ𝐖𝑥subscript𝐖0𝑥Δ𝐖𝑥superscriptsubscript𝐖0𝑓𝑟𝑜𝑧𝑒𝑛𝑥subscript𝛼𝑟subscriptsuperscript𝐁𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒𝑢𝑝subscriptsuperscript𝐀𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒𝑑𝑜𝑤𝑛Δ𝐖𝑥\displaystyle(\mathbf{W}_{0}+\Delta\mathbf{W})x=\mathbf{W}_{0}x+\Delta\mathbf{% W}x=\mathbf{W}_{0}^{frozen}x+\underbrace{\frac{\alpha}{r}\mathbf{B}^{trainable% }_{up}\mathbf{A}^{trainable}_{down}}_{\Delta\mathbf{W}}x( bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ bold_W ) italic_x = bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + roman_Δ bold_W italic_x = bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_r italic_o italic_z italic_e italic_n end_POSTSUPERSCRIPT italic_x + under⏟ start_ARG divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG bold_B start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n italic_a italic_b italic_l italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT bold_A start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n italic_a italic_b italic_l italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_Δ bold_W end_POSTSUBSCRIPT italic_x

where 𝐖0d×ksubscript𝐖0superscript𝑑𝑘\mathbf{W}_{0}\in\mathbb{R}^{d\times k}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT denotes a pre-trained weight matrix, 𝐁uptrainabled×rsubscriptsuperscript𝐁𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒𝑢𝑝superscript𝑑𝑟\mathbf{B}^{trainable}_{up}\in\mathbb{R}^{d\times r}bold_B start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n italic_a italic_b italic_l italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and 𝐀downtrainabler×ksubscriptsuperscript𝐀𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒𝑑𝑜𝑤𝑛superscript𝑟𝑘\mathbf{A}^{trainable}_{down}\in\mathbb{R}^{r\times k}bold_A start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n italic_a italic_b italic_l italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT are two trainable low-rank matrixes and initialized by a zero matrix and a random Gaussian distribution 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) respectively, to ensure Δ𝐖=0Δ𝐖0\Delta\mathbf{W}=0roman_Δ bold_W = 0 at the beginning of training. The rank rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ), the αr𝛼𝑟\frac{\alpha}{r}divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG is a scaling coefficient to balance the importance of the LoRA module, like a learning rate.

Despite the advancements in PEFT methods, their application in code generation remains limited. For instance, (Jiang and Kim, 2023) pioneered the use of parameter-efficient instruction-tuning on a Llama 2 (Touvron et al., 2023b) model with a single RTX 3090 GPU, leading to the development of a multilingual code generation model called CodeUp. More recently, ASTRAIOS (Zhuo et al., 2024) conducted a thorough empirical examination of parameter-efficient instruction tuning for code comprehension and generation tasks. This study yielded several perceptive observations and conclusions, contributing valuable insights to the domain.

Refer to caption
Figure 6. An illustration of full parameter fine-tuning (FFT) and parameter-efficient fine-tuning (PEFT) methods. (a) refers to the Full Fine-tuning method, which updates all parameters of the base model during fine-tuning. (b) stands for the Specification-based PEFT method that conditionally fine-tunes a small subset of the model parameters while freezing the rest of the model, e.g. BitFit (Zaken et al., 2021). (c) represents the Addition-based PEFT method that fine-tunes the incremental parameters introduced into the base model or input, e.g. Adapter (Houlsby et al., 2019), Prefix-tuning (Li and Liang, 2021), and Prompt-tuning (Lester et al., 2021). (d) symbolizes the Reparameterization-based method which reparameterizes existing model parameters by low-rank transformation, e.g. LoRA (Hu et al., 2021), QLoRA (Dettmers et al., 2024), and AdaLoRA (Zhang et al., 2023a).

4.5. Reinforcement Learning with Feedback

Large language models (LLMs) have exhibited remarkable instruction-following capabilities through instruction tuning. However, they often produce outputs that are unexpected, toxic, biased, or hallucinated outputs that do not align with users’ intentions or preferences (Ouyang et al., 2022; Wang et al., 2023c; Ji et al., 2023). Consequently, aligning LLMs with human preference has emerged as a pivotal area of research. A notable work is InstructGPT (Ouyang et al., 2022), which further fine-tunes an instruction-tuned model utilizing reinforcement learning with human feedback (RLHF) on a dataset where labelers have ranked model outputs in order of quality, from best to worst. This method has been instrumental in the development of advanced conversational language models, such as ChatGPT (OpenAI, 2022) and Bard (Manyika and Hsiao, 2023). Despite its success, acquiring high-quality human preference ranking data is a resource-intensive process (Lee et al., 2023). To address this, Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022; Lee et al., 2023) has been proposed to leverage powerful off-the-shelf LLMs (e.g., ChatGPT (OpenAI, 2022) and GPT-4 (Achiam et al., 2023)) to simulate human annotators by generating preference data.

Building on RLHF’s success, researchers have explored reinforcement learning with feedback to enhance code generation in LLMs. Unlike RLHF, which relies on human feedback, this approach employs compilers or interpreters to automatically provide feedback on code samples through code execution on unit test cases, catalyzing the advancement of this research domain. CodeRL (Le et al., 2022) introduced an actor-critic reinforcement learning framework for code generation. In this setup, the language model serves as the actor-network, while a token-level functional correctness reward predictor acts as the critic. Generated code is assessed through unit test signals from a compiler, which can indicate compiler errors, runtime errors, unit test failures, or passes. CompCoder (Wang et al., 2022a) enhances code compilability by employing compiler feedback, including language model fine-tuning, compilability reinforcement, and compilability discrimination strategies. Subsequently, PPOCoder (Shojaee et al., 2023) integrates pre-trained code model CodeT5 (Wang et al., 2021) with Proximal Policy Optimization (PPO) (Schulman et al., 2017). This integration not only utilizes execution (i.e., compilers or interpreters) feedback to assess syntactic and functional correctness but also incorporates a reward function that evaluates the syntactic and semantic congruence between abstract syntax tree (AST) sub-trees and data flow graph (DFG) edges in the generated code against the ground truth. Additionally, the framework applies a KL-divergence penalty to maintain fidelity between the actively learned policy and the referenced pre-trained model, enhancing the optimization process. More recently, RLTF (Liu et al., 2023d) has proposed an online reinforcement learning framework that provides fine-grained feedback based on compiler error information and location, along with adaptive feedback that considers the ratio of passed test cases.

Despite these successes, reinforcement learning algorithms face inherent limitations such as inefficiency, instability, extensive resource requirements, and complex hyperparameter tuning, which can impede the performance and scalability of LLMs. To overcome these challenges, recent studies have introduced various variants of RL methods that do not rely on PPO, including DPO (Rafailov et al., 2024), RRHF (Yuan et al., 2023), and sDPO (Kim et al., 2024). In essence, these methods aim to maximize the likelihood between the logarithm of conditional probabilities of preferred and rejected responses, which may be produced by LLMs with varying capabilities. Inspired by RRHF (Yuan et al., 2023), PanGu-Coder 2 (Shen et al., 2023) leverages a novel framework, Reinforcement Learning via Rank Responses to align Test & Teacher Feedback (RRTF), significantly enhancing code generation capabilities, as evidenced by pass@1 of 62.20% on the HumanEval benchmark.

Taking a step forward, the integration of more non-differentiable code features, such as coding style (Markovtsev et al., 2019; Chen and Abedjan, 2023) and readability (Buse and Weimer, 2009), into the reinforcement learning feedback for LLM-based code generation, presents an exciting avenue for future research.

4.6. Prompting Engineering

Large-scale language models (LLMs) such as GPT-3 and its successors have been trained on large-scale data corpora, endowing them with substantial world knowledge (Brown et al., 2020; Wei et al., 2021; Ouyang et al., 2022). Despite this, crafting an effective prompt to harness the full potential of LLMs remains a long-standing challenge (Liu et al., 2023c). Recent advancements in prompting engineering have expanded the capabilities of LLMs, enabling more sophisticated task completion and enhancing both reliability and performance. Notable techniques include Chain-of-Thought (CoT) (Wei et al., 2022b), Self-Consistency (Wang et al., 2022b), Tree-of-Thought (ToT) (Yao et al., 2024), Reasoning via Planning (RAP) (Hao et al., 2023), ReAct (Yao et al., 2023), Self-Refine (Madaan et al., 2024), Reflexion (Shinn et al., 2024), and LATS (Zhou et al., 2023).

Prompting engineering is particularly advantageous as it bypasses the need for additional training and can significantly elevate performance. Consequently, numerous studies have leveraged this technique for iterative and self-improving (refining) code generation within proprietary LLMs such as ChatGPT and GPT-4. Figure 7 illustrates the general pipeline for self-improving code generation with LLMs. For instance, Self-Debugging (Chen et al., 2023) involves prompting an LLM to iteratively refine a predicted program by utilizing feedback composed of code explanations combined with execution results, which assists in identifying and rectifying errors. When unit tests are unavailable, this feedback can rely solely on code explanations. In parallel, SelfEvolve (Jiang et al., 2023) employs a two-stage process where LLMs first generate domain-specific knowledge for a problem, followed by a trial code. This code is then iteratively refined through interactive prompting and execution feedback. An empirical investigation by (Olausson et al., 2023) provides a comprehensive analysis of the self-repairing capabilities for code generation in models like Code Llama, GPT-3.5, and GPT-4, using problem sets from HumanEval and APPS. This study yields a series of insightful observations and findings, shedding light on the self-refinement effectiveness of these LLMs. Moreover, Reflexion (Shinn et al., 2024) introduces a general approach for code generation wherein LLM-powered agents engage in verbal self-reflection on task feedback signals, storing these reflections in an episodic memory buffer to inform and improve decision-making in subsequent interactions. LATS (Zhou et al., 2023) adopts a novel strategy, utilizing LLMs as agents, value functions, and optimizers. It enhances decision-making by meticulously constructing trajectories through Monte Carlo Tree Search (MCTS) algorithms, integrating external feedback, and learning from experience. This approach has demonstrated remarkable results in code generation, achieving a pass@1 of 94.4% on the HumanEval benchmark with GPT-4.

Refer to caption
Figure 7. An illustration of the self-improving code generation pipeline using prompts for large language models (LLMs). This process incorporates iterative self-refinement by integrating execution outcomes and includes an optional self-reflection mechanism to enhance generation quality.

Distinct from the aforementioned methods, CodeT (Chen et al., 2022b) and LEVER (Ni et al., 2023a) prompt LLMs to generate numerous code samples, which are then re-ranked based on execution outcomes to select the optimal solution. Notably, these approaches do not incorporate a self-refinement step to further improve code generation.

4.7. Repository Level & Long Context

In contemporary software engineering practices, modifications to a code repository are widespread and encompass a range of activities, including package migration, temporary code edits, and the resolution of GitHub issues. While large language models (LLMs) showcase impressive prowess in function-level code generation, they often falter when grappling with the broader context inherent to a repository, such as import dependencies, parent classes, and files bearing similar names. These deficiencies result in suboptimal performance in repository-level code generation, as identified in recent studies (Shrivastava et al., 2023b, a). The challenges faced by LLMs in this domain are primarily due to the following factors:

  • Code repositories typically contain intricate interdependencies scattered across various files, including shared utilities, configurations, and cross-API invocations, which arise from modular design principles (Zhang et al., 2023c; Bairi et al., 2023).

  • Repositories are characterized by their unique structures, naming conventions, and coding styles, which are essential for maintaining clarity and facilitating ongoing maintenance (Chen and Abedjan, 2023).

  • The vast context of an entire repository often exceeds the context length limitations of LLMs, thus hindering their ability to integrate comprehensive contextual information (Bairi et al., 2023).

  • LLMs may not have been adequately trained on extensive sets of repository data, such as proprietary software or projects that are still in development (Shrivastava et al., 2023a).

Given that the scope of a typical software repository encompasses hundreds of thousands of tokens, it is imperative to enhance the capacity of LLMs to handle extensive contexts when they are employed for repository-level code generation. Fortunately, recent advancements in positional encoding techniques, such as ALiBi (Press et al., 2021) and RoPE (Su et al., 2024a), have shown promise in improving the Transformer’s ability to generalize from shorter training sequences to longer inference sequences (Zhao et al., 2023a). This progress addresses the third challenge mentioned above to a certain degree, thereby enabling better contextualization of coding activities within full repositories.

To further refine LLMs for repository-level code completion, several innovative approaches have been introduced. RepoCoder (Zhang et al., 2023c) leverages a similarity-based retrieval system within an iterative retrieval-generation paradigm to enrich the context and enhance code completion quality. In a similar vein, CoCoMIC (Ding et al., 2022b) employs a cross-file context finder named CCFINDER to pinpoint and retrieve the most relevant cross-file contexts within a repository. RepoHyper (Phan et al., 2024) introduces a semantic graph structure, termed RSG, to encapsulate the expansive context of code repositories and uses an “Expand and Refine” retrieval method to obtain relevant code snippets. Moreover, a framework known as RLPG (Shrivastava et al., 2023b) has been proposed to generate repository-level prompts that integrate the repository’s structure with the relevant context across all files. However, the constant reliance on retrieval mechanisms has raised concerns regarding efficiency and robustness, as some retrieved contexts may prove unhelpful or harmful. In response, Repoformer (Wu et al., 2024) introduces a selective Retrieval-Augmented Generation (RAG) framework that judiciously bypasses retrieval when it is deemed redundant. This approach incorporates a self-supervised learning strategy that equips a code LLM with the ability to perform a self-assessment on the utility of retrieval for enhancing the quality of its output, thereby effectively utilizing potentially noisy retrieved contexts.

Additionally, RepoFusion (Shrivastava et al., 2023a) has been developed to train models to combine multiple relevant contexts from a repository, aiming to produce more precise and context-aware code completions. In a novel approach, Microsoft’s CodePlan (Bairi et al., 2023) frames repository-level coding tasks as a planning problem, generating a multi-step chain of edits (plan) where each step involves invoking an LLM on a specific code location, considering context from the entire repository, preceding code modifications, and task-specific instructions.

Advancing the state-of-the-art, (Zan et al., 2024) tackles the formidable challenge of NL2Repo, an endeavor that seeks to create a complete code repository from natural language requirements. To address this complex task, they introduce the CodeS framework, which strategically breaks down NL2Repo into a series of manageable sub-tasks using a multi-layer sketch approach. The CodeS framework comprises three distinct modules: 1) RepoSketcher, for creating a directory structure of the repository based on given requirements; 2) FileSketcher, for sketching out each file within that structure; and 3) SketchFiller, for fleshing out the specifics of each function within the file sketches (Zan et al., 2024).

Accordingly, a surge of benchmarks tailored for repository-level code generation has emerged, such as RepoEval (Zhang et al., 2023c), Stack-Repo (Shrivastava et al., 2023a), Repobench (Liu et al., 2023b), EvoCodeBench (Li et al., 2024), SWE-bench (Jimenez et al., 2023), CrossCodeEval (Ding et al., 2024), and SketchEval (Zan et al., 2024). The detailed statistics and comparisons of these benchmarks are presented in Table 3.

Despite the progress made by these methods in repository-level code generation, significant challenges remain to be addressed. Programming developers are often required to invest considerable time in editing and debugging (Vaithilingam et al., 2022; Mozannar et al., 2022; Shrivastava et al., 2023a; Barke et al., 2023; Bird et al., 2022). However, the advent of LLM-powered coding agents, such as AutoCodeRover (Zhang et al., 2024), SWE-Agent (John Yang, 2024), and OpenDevin (OpenDevin, 2024), has demonstrated their potential to tackle complex problems, paving the way for future exploration in this field (for more details, see Section 4.9).

4.8. Retrieval Augmented

Refer to caption
Figure 8. A workflow illustration of the Retrieval-Augmented Code Generation (RACG). Upon receiving a query (instruction), the retriever selects the relevant contexts from a large-scale vector database. Subsequently, the retrieved contexts are merged with the query, and this combined input is fed into the generator (LLM) to produce the target code solution.

Large Language Models (LLMs) have exhibited impressive capabilities but are hindered by several critical issues such as hallucination (Liang et al., 2023; Zhang et al., 2023e), obsolescence of knowledge (Jang et al., 2022), and non-transparent (Bommasani et al., 2021), untraceable reasoning processes (Zhou et al., 2022b; Wei et al., 2022b; Huang and Chang, 2023; Gao et al., 2023b). While techniques like instruction-tuning (see Section 4.4) and reinforcement learning with feedback (see Section 4.5) mitigate these issues, they also introduce new challenges, such as catastrophic forgetting and the requirement for substantial computational resources during training (Ovadia et al., 2023; Gupta et al., 2024).

Recently, Retrieval-Augmented Generation (RAG) has emerged as an innovative approach to overcoming these limitations by integrating knowledge from external databases. Formally defined, RAG denotes a model that, in response to queries, initially sources relevant information from an extensive corpus of documents, and then leverages this retrieved information in conjunction with the original query to enhance the response’s quality and accuracy, especially for knowledge-intensive tasks. The RAG framework typically consists of a vector database, a retriever, a re-ranker, and a generator. It is commonly implemented using tools such as LangChain121212LangChain facilitates the development of LLM-powered applications. https://www.langchain.com and LLamaIndex131313LLamaIndex is a leading data framework for building LLM applications. https://www.llamaindex.ai. By performing continuous knowledge updates of the database and the incorporation of domain-specific data, RAG circumvents the need for re-training LLMs from scratch (Gao et al., 2023b). Consequently, RAG has substantially advanced LLM performance across a variety of tasks (Lewis et al., 2020; Chen et al., 2024).

Due to the nature of code, code LLMs are also susceptible to the aforementioned issues that affect general-purpose LLMs. For instance, they may exhibit a hallucination phenomenon when instructions fall outside the scope of their training data or necessitate the latest programming packages. Given the dynamic nature of publicly available source-code libraries like PyTorch, which undergo frequent expansion and updates, deprecated calling methods can become a significant challenge. If Code LLMs are not updated in tandem with the latest functions and APIs, this can introduce potential errors and safety risks. Retrieval-Augmented Code Generation (RACG) stands as a promising solution to these concerns. A workflow illustration of the RACG is depicted in Figure 8.

Despite its potential, the adoption of RAG for code generation remains limited. Drawing inspiration from the common practice among programmers of referencing related code snippets, (Liu et al., 2020) introduced a novel retrieval-augmented mechanism with graph neural networks (GNNs), termed HGNN, which unites the advantages of similar examples retrieval with the generalization capabilities of generative models for code summarization, which is the reverse process of code generation. (Parvez et al., 2021) pioneered a retrieval augmented framework named REDCODER for code generation by retrieving and integrating relevant code snippets from a source-code database, thereby providing supplementary context for the generation process. Subsequently, a retrieval-augmented code completion framework termed ReACC (Lu et al., 2022) is proposed to leverage both lexical copying and semantic referencing of related code, achieving state-of-the-art performance on the CodeXGLUE benchmark (Lu et al., 2021). In the spirit of how programmers often consult textual resources such as code manuals and documentation to comprehend functionalities, DocPrompting (Zhou et al., 2022a) explicitly utilizes code documentation by retrieving the relevant documentation pieces based on a natural language query and then generating the target code by blending the query with the retrieved information.

More recently, RepoCoder (Zhang et al., 2023c), an iterative retrieval-generation framework, is proposed for enhancing repository-level code completion by effectively utilizing code analogies across different files within a repository to inform and improve code suggestions. Furthermore, breaking away from reliance on a singular source of retrieval, (Su et al., 2024b) developed a multi-faceted “knowledge soup” that integrates web searches, documentation, execution feedback, and evolved code snippets. Then, it incorporates an active retrieval strategy that iteratively refines the query and enriches the knowledge soup, expanding the scope of information available for code generation.

Despite these advancements, several limitations in retrieval-augmented code generation warrant further exploration: 1) the quality of the retrieved information significantly impacts overall performance; 2) the effective integration of retrieved code information with the query needs optimization; 3) an over-reliance on retrieved information may lead to inadequate responses that fail to address the query’s intent; 4) additional retrieved information necessitates larger context windows for the LLM, resulting in increased computational demands.

4.9. Autonomous Coding Agents

The advent of large language models (LLMs) has marked the beginning of a new era of potential pathways toward artificial general intelligence (AGI), capturing significant attention in both academia and industry (Xi et al., 2023; Weng, 2023; Wang et al., 2024b; Huang et al., 2024). A rapidly expanding array of applications for LLM-based autonomous agents, including AutoGPT (aut, 2023), AgentGPT (age, 2023), BabyAGI (bab, 2023), and AutoGen (Wu et al., 2023), underlines the promise of this technology.

LLM-powered autonomous agents are systems endowed with sophisticated reasoning abilities, leveraging an LLM as a central computational engine or controller. This allows them to formulate and execute problem-solving plans through a series of tool-enabled functions or API calls. Moreover, these agents are designed to function within a shared environment where they can communicate and engage in cooperative, competitive, or negotiating interactions (Huang et al., 2023; Wu et al., 2023; Wang et al., 2024b). The typical architecture of such an agent encompasses an LLM-based Agent, a memory module, a planning component, and a tool utilization module, as depicted in Figure 9.

Refer to caption
Figure 9. The general architecture of an LLM-powered autonomous agent system, adapted from (Weng, 2023). Planning: The agent decomposes large tasks into smaller, manageable sub-goals or engages in self-criticism and self-reflection on past actions to learn from mistakes and improve future performance. Memory: This component enables the agent to store and retrieve past information. Tools: The agent is trained to invoke external functions or APIs. Action: The agent executes actions, with or without the use of tools, to interact with the environment. The gray dashed lines represent the data flow within the system.

In the realm of automated code generation, LLM-powered autonomous agents have demonstrated remarkable proficiency. For instance, AgentCoder (Huang et al., 2023) achieved a groundbreaking pass@1 of 96.3% on the HumanEval benchmark, forwarding a step closer to the future of automated software development (Ishibashi and Nishimura, 2024). The innovative meta-programming framework termed MetaGPT (Hong et al., 2023) integrates human workflow efficiencies into LLM-based multi-agent collaboration. Furthermore, (Huang et al., 2023) introduces AgentCoder, a multi-agent framework composed of three specialized agents, each with distinct roles and capabilities. These roles include a programmer agent responsible for code generation, a test designer agent tasked with generating unit test cases, and a test executor agent that executes the code and provides feedback. This division of labor within AgentCoder promotes more efficient and effective code generation. CodeAct (Wang et al., 2024a) distinguishes itself by utilizing executable Python code to consolidate LLM agent actions within a unified action space, in contrast to the generation of JSON or textual formats. Additionally, AutoCodeRover (Zhang et al., 2024) is proposed to autonomously resolve GitHub issues for program enhancement.

To address the complexity of tasks within software engineering, two innovative autonomous AI software engineers Devin141414https://www.cognition.ai/introducing-devin(Cognition, 2024) and OpenDevin151515https://github.com/OpenDevin/OpenDevin(OpenDevin, 2024), have been released and rapidly garnered considerable interest within the software engineering (SE) and artificial general intelligence (AGI) community. Subsequently, an autonomous system, SWE-agent (John Yang, 2024), leverages a language model to interact with a computer to address software engineering tasks, successfully resolving 12.5% of issues on the SWE-bench benchmark (Jimenez et al., 2023). L2MAC (Holt et al., 2023) has been introduced as the first practical, LLM-based, multi-agent, general-purpose stored-program automatic computer that utilizes a von Neumann architecture, designed specifically for the generation of long and consistent outputs. At the time of writing this survey, OpenDevin has enhanced CodeAct with bash command-based tools, leading to the release of OpenDevin CodeAct 1.0 (Xingyao Wang and Neubig, 2024), which sets a new state-of-the-art performance on the SWE-Bench Lite benchmark (Jimenez et al., 2023).

Despite these remarkable advancements, the journey toward fully realized AI software engineers employing LLM-powered autonomous agents is far from complete (Xi et al., 2023; Wang et al., 2024b). Critical aspects such as prompt design, context length, agent count, and toolsets call for further refinement and optimization, especially as problem complexities escalate (Ishibashi and Nishimura, 2024).

4.10. Evaluation

Table 6. The performance comparison of LLMs for code generation on the HumanEval (Chen et al., 2021) benchmark, measured by Pass@{1, 10, 100}. For models with various sizes, we report only the largest size version of each model.
Model Size pass@k
k=1𝑘1k=1italic_k = 1 k=10𝑘10k=10italic_k = 10 k=100𝑘100k=100italic_k = 100
GPT-4 (Achiam et al., 2023) - 84.1 - -
GPT-3.5-Turbo (OpenAI, 2022) - 76.2 - -
Claude-3-Opus (Anthropic, 2024) - 82.9 - -
Claude-3-Haiku (Anthropic, 2024) - 76.8 - -
Claude-3-Sonnet (Anthropic, 2024) - 70.7 - -
StarCoder2-Instruct (Yuxiang Wei, 2024) 15.5B 72.6 - -
Llama3 (Meta, 2024) 70B 81.7 - -
CodeGemma (CodeGemma Team et al., 2024) 7B 44.5 - -
StarCoder 2 (Lozhkov et al., 2024) 15B 46.3 - -
phi-2 (Mojan Javaheripi, 2023) 2.7B 49.4 - -
WaveCoder (Yu et al., 2023) 6.7B 75 - -
StableCode (Pinnaparaju et al., 2024) 3B 29.3 - -
CodeShell (Xie et al., 2024) 7B 34.32 - -
CodeQwen (Team, 2024) 14B 45.1 - -
DeepSeek-Coder (Guo et al., 2024) 33B 56.1 - -
replit-code (Replit, 2023) 3B 20.12 - -
Phi-1.5 (Li et al., 2023b) 1.3B 41.4 - -
PanGu-Coder2 (Shen et al., 2023) 15B 61.64 79.55 91.75
WizardCoder (Luo et al., 2023) 15B 57.3 73.2 90.46
CodeFuse (Liu et al., 2023a) 34B 74.4 - -
Phi-1 (Gunasekar et al., 2023) 1.3B 50.6 - -
Code Llama (Roziere et al., 2023) 34B 48.8 76.8 93.0
OctoCoder (Muennighoff et al., 2023) 15.5B 46.2 - -
PaLM-Coder (Chowdhery et al., 2023) 540B 36 - 88.4
CodeGeeX2 (Zheng et al., 2023c) 6B 35.9 62.6 88.3
InstructCodeT5+ (Wang et al., 2023b) 16B 35.0 54.5 77.9
CodeGen-NL (Nijkamp et al., 2022) 16.1B 14.24 23.46 38.33
CodeGen-Multi (Nijkamp et al., 2022) 16.1B 18.32 32.07 50.8
CodeGen-Mono (Nijkamp et al., 2022) 16.1B 29.28 49.86 75
StarCoder (Li et al., 2023a) 15B 33.60 45.78 79.82
CodeT5+ (Wang et al., 2021) 16B 30.9 51.6 76.7
LLaMA2 (Touvron et al., 2023b) 70B 30.5 59.4 87.0
Codex (Chen et al., 2021) 12B 28.81 46.81 72.31
PaLM (Chowdhery et al., 2023) 540B 26.2 - 76.2
PanGu-Coder (Christopoulou et al., 2022) 2.6B 23.78 35.36 51.24
LLaMA (Touvron et al., 2023a) 65B 23.7 - 79.3
CodeGeeX (Zheng et al., 2023c) 13B 22.89 39.57 60.92
Replit (Replit, 2016) 3B 21.9 - -
CodeGen2 (Nijkamp et al., 2023) 16B 20.46 36.5 56.71
SantaCoder (Allal et al., 2023) 1.1B 18 29 49
AlphaCode (Li et al., 2022a) 1.1B 17.1 28.2 45.3
BLOOM (Le Scao et al., 2023) 176B 15.52 32.20 55.45
GPT-NeoX (Black et al., 2022) 20B 15.4 25.6 41.2
InCoder (Fried et al., 2022) 6.7B 15.2 27.8 47.0
GPT-J (Wang and Komatsuzaki, 2021) 6B 11.62 15.74 27.74
PyCodeGPT (Zan et al., 2022) 110M 8.33 13.36 19.13
GPT-Neo (Black et al., 2021) 2.7B 6.41 11.27 21.37
PolyCoder (Xu et al., 2022) 2.7B 5.59 9.84 17.68
JuPyT5 (Chandel et al., 2022) 300M 5.4 15.46 25.60
CodeParrot (Tunstall et al., 2022) 1.5B 3.99 8.69 17.88

Despite the impressive capabilities of large language models (LLMs), they exhibit a range of behaviors that are both beneficial and potentially risky. These behaviors can enhance performance across various downstream tasks but may also introduce reliability and trustworthiness concerns in LLM deployment (Chen et al., 2021; Xu et al., 2022; Chang et al., 2024). Consequently, it is imperative to develop precise evaluation approaches to discern the qualitative and quantitive differences between models, thereby encouraging further advancements in LLM capabilities.

Evaluation strategies for LLMs in code generation mirror those for general-purpose LLMs and can be divided into three principal categories: metrics-based, human-centered, and LLM-based approaches. Detailed benchmarks for these evaluation strategies are presented in Section 4.1.3 and summarized in Table 3. Subsequent subsections will provide a thorough analysis of each approach.

4.10.1. Metrics

The pursuit of effective and reliable automatic evaluation metrics for generated content is a long-standing challenge within the field of natural language processing (NLP) (Chen et al., 1998; Papineni et al., 2002; Lin, 2004). At the early stage, most works directly leverage token-matching-based metrics, such as Exact Match, BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005), which are prevalent in text generation of NLP, to assess the quality of code generation.

While these metrics offer a rapid and cost-effective approach for assessing the quality of generated code, they often fall short of capturing the syntactical and functional correctness, as well as the semantic features of the code. To eliminate this limitation, CodeBLEU (Ren et al., 2020) was introduced, enhancing the traditional BLEU metric (Papineni et al., 2002) by incorporating syntactic information through abstract syntax trees (AST) and semantic understanding via data-flow graph (DFG). Despite these improvements, the metric does not fully resolve issues pertaining to execution errors or discrepancies in the execution results of the generated code. In light of these challenges, execution-based metrics have gained prominence for evaluating code generation, including pass@k (Chen et al., 2021), n@k (Li et al., 2022a), test case average (Hendrycks et al., 2021), execution accuracy (Rajkumar et al., 2022), and pass@t (Olausson et al., 2023). In particular, the pass@k, serving as a principal evaluation metric, assesses the probability that at least one out of k𝑘kitalic_k code samples generated by a model will pass all unit tests. An unbiased estimator for pass@k introduced by (Chen et al., 2021) is defined as:

(19) pass@k𝔼task[1(nck)(nk)]pass@ksubscript𝔼taskdelimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘\displaystyle\texttt{pass@k}\coloneqq\mathbb{E}_{\text{task}}\left[1-\frac{% \binom{n-c}{k}}{\binom{n}{k}}\right]pass@k ≔ blackboard_E start_POSTSUBSCRIPT task end_POSTSUBSCRIPT [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) end_ARG ]

where n𝑛nitalic_n is the total number of sampled candidate code solutions, k𝑘kitalic_k is the number of randomly selected code solutions from these candidates for each programming problem, with nk𝑛𝑘n\geq kitalic_n ≥ italic_k, and c𝑐citalic_c is the count of correct samples within the k𝑘kitalic_k selected. Tables 6 and 7 illustrate the performance of contemporary large language models (LLMs) for code generation, measured by the pass@k metric across different values of k{1,10,100}𝑘110100k\in\{1,10,100\}italic_k ∈ { 1 , 10 , 100 } on the HumanEval and MBPP benchmarks, respectively.

Nevertheless, these execution-based methods are heavily dependent on the quality of unit tests and are limited to evaluating executable code (Zan et al., 2023). Consequently, when unit tests are unavailable, token-matching-based metrics are often employed as an alternative for evaluation. Furthermore, in scenarios lacking a ground truth label, unsupervised metrics such as perplexity (PPL) (Jelinek et al., 1977) can serve as evaluative tools. Perplexity quantifies an LLM’s uncertainty in predicting new content, thus providing an indirect measure of the model’s generalization capabilities and the quality of the generated code.

Taken together, while the aforementioned methods primarily focus on the functional correctness of code, they do not provide a holistic evaluation that encompasses other critical dimensions such as code vulnerability (Nappa et al., 2015), maintainability (Ardito et al., 2020), readability (Buse and Weimer, 2009), complexity and efficiency (Peitek et al., 2021), stylistic consistency (Markovtsev et al., 2019), and execution stability (Raemaekers et al., 2012). A comprehensive evaluation framework that integrates these aspects remains an open area for future research and development in the field of code generation assessment.

4.10.2. Human Evaluation

Given the intrinsic characteristics of code, the aforementioned automatic evaluation metrics are inherently limited in their capacity to fully assess code quality. For instance, metrics specifically designed to measure code style consistency are challenging to develop and often fail to capture this aspect adequately (Chen and Abedjan, 2023). When it comes to repository-level code generation, the evaluation of overall code quality is substantially complicated due to the larger scale of the task, which involves cross-file designs and intricate internal as well as external dependencies, as discussed by (Bairi et al., 2023; Shrivastava et al., 2023a).

To overcome these challenges, conducting human evaluations becomes necessary, as it yields relatively robust and reliable results. Human assessments also offer greater adaptability across various tasks, enabling the simplification of complex and multi-step evaluations. Moreover, human evaluations are essential for demonstrating the effectiveness of certain token-matching-based metrics, such as CodeBLEU (Ren et al., 2020). These studies typically conduct experiments to evaluate the correlation coefficient between proposed metrics and quality scores assigned by actual users, demonstrating their superiority over existing metrics.

Table 7. The performance comparison of LLMs for code generation on the MBPP (Austin et al., 2021) benchmark, measured by Pass@{1, 10, 100}. For models with various sizes, we report only the largest size version of each model.
Model Size pass@k
k=1𝑘1k=1italic_k = 1 k=10𝑘10k=10italic_k = 10 k=100𝑘100k=100italic_k = 100
GPT-3.5-Turbo (OpenAI, 2022) - 52.2 - -
Claude-3-Opus (Anthropic, 2024) - 89.4 - -
Claude-3-Haiku (Anthropic, 2024) - 80.2 - -
Claude-3-Sonnet (Anthropic, 2024) - 83.6 - -
StarCoder2-Instruct (Yuxiang Wei, 2024) 15.5B 78 - -
CodeGemma (CodeGemma Team et al., 2024) 7B 65.1 - -
StarCoder 2 (Lozhkov et al., 2024) 15B 50.6 - -
phi-2 (Mojan Javaheripi, 2023) 2.7B 64 - -
WaveCoder (Yu et al., 2023) 6.7B 74.9 - -
CodeFuse (Liu et al., 2023a) 34B 61.0 - -
CodeQwen (Team, 2024) 14B 51.4 - -
DeepSeek Coder (Guo et al., 2024) 33B 66.0 - -
Phi-1.5 (Li et al., 2023b) 1.3B 43.5 - -
WizardCoder (Luo et al., 2023) 16B 51.8 - -
StarCoder (Li et al., 2023a) 5.5B 52.7 - -
SantaCoder (Allal et al., 2023) 1.1B 3.65 21.33 41.92
PyCodeGPT (Zan et al., 2022) 110M 9.39 28.37 48.71
PolyCoder (Xu et al., 2022) 2.7B 4.39 17.99 38.17
phi-1 (Gunasekar et al., 2023) 1.3B 55.5 - -
PaLM-Coder (Chowdhery et al., 2023) 540B 47 - -
PaLM (Chowdhery et al., 2023) 540B 36.8 - -
LLaMA (Touvron et al., 2023a) 65B 37.7 - -
LLaMA 2 (Touvron et al., 2023b) 70B 45.4 66.2 83.1
CodeT5+ (Wang et al., 2021) 16B 56.6 - -
InCoder (Fried et al., 2022) 6.7B 21.3 46.5 66.2
GPT-Neo (Black et al., 2021) 2.7B 5.89 23.09 44.26
GPT-J (Wang and Komatsuzaki, 2021) 6B 11.30 35.62 53.63
CodeT5 (Wang et al., 2021) 770M 15.78 38.63 50.35
CodeParrot (Tunstall et al., 2022) 1.5B 1.29 8.66 27.17
Code Llama (Roziere et al., 2023) 34B 55 76.2 86.6
CodeGen-NL (Nijkamp et al., 2022) 16.1B 10.92 38.43 62.76
CodeGen-Multi (Nijkamp et al., 2022) 16.1B 20.94 51.61 70.02
CodeGen-Mono (Nijkamp et al., 2022) 16.1B 35.28 67.32 80.09
CodeGeeX (Zheng et al., 2023c) 13B 24.4 48 -
BLOOM (Le Scao et al., 2023) 1.7B 3.16 14.23 31.38
PanGu-Coder (Christopoulou et al., 2022) 2.6B 23.0 43.60 59.64
CodeGeeX2 (Zheng et al., 2023c) 6B 24.37 47.95 -

Moreover, in an effort to better align large language models (LLMs) with human preferences and intentions, InstructGPT (Ouyang et al., 2022) employs human-written prompts and demonstrations, and model output ranking in the fine-tuning of LLMs using reinforcement learning from human feedback (RLHF). Although similar alignment learning techniques have been applied to code generation, the feedback in this domain typically comes from a compiler or interpreter, which offers execution feedback, rather than from human evaluators. Notable examples include CodeRL (Le et al., 2022), PPOCoder (Shojaee et al., 2023), RLTF (Liu et al., 2023d), and PanGu-Coder2 (Shen et al., 2023). Further information on this topic is available in Section 4.5.

Nonetheless, human evaluations are not without drawbacks, as they can be prone to certain issues that may compromise their accuracy and consistency. For instance, 1) personalized tastes and varying levels of expertise among human evaluators can introduce biases and inconsistencies into the evaluation process; 2) conducting comprehensive and reliable human evaluations often necessitates a substantial number of evaluators, leading to significant expenses and time-consuming; 3) the reproducibility of human evaluations is often limited, which presents challenges in extending previous evaluation outcomes or monitoring the progress of LLMs, as highlighted by (Zhao et al., 2023b).

4.10.3. LLM-as-a-Judge

The powerful instruction-following capabilities of large language models (LLMs) have stimulated researchers to innovatively investigate the potential of LLM-based evaluations. The LLM-as-a-Judge (Zheng et al., 2024a) refers to the application of advanced proprietary LLMs (e.g., GPT4, Gemini, and Claud 3) as proxies for human evaluators. This involves designing prompts with specific requirements to guide LLMs in conducting evaluations, as demonstrated by AlpacaEval (Li et al., 2023c) and MT-bench (Zheng et al., 2024a). This method reduces reliance on human participation, thereby facilitating more efficient and scalable evaluations. Moreover, LLMs can offer insightful explanations for the assigned rating scores, thereby augmenting the interpretability of evaluations (Zhao et al., 2023b).

Nevertheless, the use of LLM-based evaluation for code generation remains relatively underexplored compared with general-purpose LLM. A recent work (Zhuo, 2024) introduces the ICE-Score evaluation metric, which instructs LLM for code assessments. This approach attains superior correlations with functional correctness and human preferences, thereby eliminating the requirement for test oracles or references. As the capabilities of LLM continue to improve, we anticipate seeing more research in this direction.

Despite their scalability and explainability, the effectiveness of LLM-based evaluation is constrained by the inherent limitations of the chosen LLM. Several studies have shown that most LLMs, including GPT-4, suffer from several issues, including position, verbosity, and self-enhancement biases, as well as restricted reasoning ability (Zheng et al., 2024a). Specifically, position bias refers to the tendency of large language models (LLMs) to disproportionately favor responses that are presented in certain positions, which can skew the perceived quality of answers based on their order of presentation. Meanwhile, verbosity bias describes the inclination of LLMs to prefer lengthier responses, even when these are not necessarily of higher quality compared to more concise ones. Self-enhancement bias, on the other hand, is observed when LLMs consistently overvalue the quality of the text they generate (Zheng et al., 2024a; Zhao et al., 2023b). Moreover, due to their inherent limitations in tackling complex reasoning challenges, LLMs may not be entirely reliable as evaluators for tasks that require intensive reasoning, such as those involving mathematical problem-solving. However, these shortcomings can be partially addressed through the application of deliberate prompt engineering and fine-tuning techniques, as suggested by (Zheng et al., 2024a).

Table 8. The overview of code assistant applications powered by large language models (LLMs). The column labeled ‘PLs’ and ‘IDEs’ indicate programming languages and integrated development environments, respectively (Zan et al., 2023).
Institution Products Model Supported Features Supported PLs Supported IDEs
GitHub & OpenAI GitHub Copilot (Chen et al., 2021) Codex
Code Completions, Code Generation,
Coding Questions Answering,
Code Refactoring, Code Issues Fix,
Unit Test Cases Generation,
Code Documentation Generation
Java, Python, JavaScript, TypeScript,
Perl, R, PowerShell, Rust, SQL, CSS,
Ruby, Julia, C#, PHP, Swift, C++,Go,
HTML, JSON, SCSS, .NET, Less,
T-SQL, Markdown
Visual Studio, VS Code, Neovim,
JetBrains IDE
Zhipu AI CodeGeeX (Zheng et al., 2023c) CodeGeeX
Code Generation, Code Translation,
Code Completion, Code Interpretation,
Code Bugs Fix, Comment Generation,
AI Chatbot
PHP, Go, C, C#, C++, Rust, Perl, CSS,
Java, Python, JavaScript, TypeScript,
Objective C++, Objective C, Pascal,
HTML, SQL, Kotlin, R, Shell, Cuda,
Fortran, Tex, Lean, Scala
Clion, RubyMine, AppCode, Aqua,
IntelliJ IDEA, VS Code, PyCharm,
Android Studio, WebStorm, Rider,
GoLand, DataGrip, DataSpell
Amazon CodeWhisperer (Amazon, 2022) --
Code Completion, Code Explanation,
Code Translation,
Code Security Identification,
Code Suggestion
Java, Python, TypeScript, JavaScript,
C#
JetBrains IDE, VS Code, AWS Cloud9,
AWS Lambda
Codeium Codeium (Codeium, 2023) --
Code Completion, Bug Detection,
Code Suggestions, AI Chatbot,
Test Type Generation,
Test Plan Creation,
Codebase Search
More than 70 languages in total,
including but not limited to:
C, C#, C++, Dart, CSS, Go, Elixir,
HTML, Haskell, Julia, Java, JavaScript,
Lisp, Kotlin, Lua, Objective-C,
Perl, Pascal, PHP, Protobuf,
R, Python, Ruby, Scala, Rust,
Swift, SQL, TS, Vue
JetBrains, VSCode, Visual Studio,
Colab, Jupyter, Deepnote,
Notebooks, Databricks, Chrome,
Vim, Neovim, Eclipse, Emacs,
VSCode Web IDEs, Sublime Text
Huawei CodeArts Snap (Shen et al., 2023) PanGu-Coder
Code Generation, Code Explanation
Research and Development Knowledge
Question and Answer
Code Comment, Code Debug
Unit Test Case Generation
Java, Python
PyCharm, VS Code, IntelliJ
Tabnine TabNine (TabNine, 2018) --
Code Generation, Code Completion,
Code Explanation, Bug Fix,
Code Recommendation, Code Refactoring,
Code Test Generation,
Docstring Generation
Python, Javascript, Java, TypeScript,
HTML, Haskell, Matlab, Kotlin, Sass,
Go, PHP, Ruby, C, C#, C++, Swift,
Rust, CSS, Perl, Angular, Dart, React,
Objective C, NodeJS, Scala,
Sublime, PyCharm, Neovim, Rider,
VS Code, IntelliJ IDE, Visual Studio,
PhpStorm, Vim, RubyMine, DataGrip,
Android Studio, WebStorm, Emacs,
Clion, Jupyter Notebook, JupyterLab,
Eclipse, GoLand, AppCode
Replit Replit(Replit, 2016) replit-code
Code Completion, Code Editing,
Code Generation, Code Explanation,
Code Suggestion, Code Test Generation
C#, Bash, C, CSS, C++, Java, Go,
HTML, JavaScript, Perl, PHP,
Ruby, Python, R, SQL, Rust
--

4.11. Applications

Code LLMs have been integrated with development tools and platforms, such as integrated development environments (IDEs) and version control systems, improving programming efficiency substantially. In this section, we will briefly introduce several widely used applications as coding assistants. The statistics of these applications are provided in Table 8.

GitHub Copilot. GitHub Copilot, powered by OpenAI’s Codex, is an AI pair programmer that helps you write better code faster. Copilot suggests whole lines or blocks of code as you type, based on the context provided by your existing code and comments. It’s trained on a dataset that includes a significant portion of the public code available on GitHub, which enables it to understand a wide range of programming languages and coding styles. Copilot not only improves productivity but also serves as a learning tool by providing programmers with examples of how certain functions can be implemented or how specific problems can be solved.

CodeGeeX. CodeGeeX stands out as a multifaceted programming assistant, proficient in code completion, comment generation, code translation, and developer interactions. Its underlying code generation LLM has been refined with extensive training on vast amounts of code data, exhibiting superior performance on benchmarks like HumanEval, HumanEval-X, and DS1000. Renowned for supporting multilingual code generation, CodeGeeX plays a pivotal role in enhancing the efficiency of code development.

CodeWhisperer. Amazon’s CodeWhisperer is a versatile, machine learning-driven code generator that offers on-the-fly code recommendations. Tailored to your coding patterns and comments, CodeWhisperer provides personalized suggestions that range from succinct comments to complex functions, all aimed at streamlining your coding workflow.

Codeium. Codeium is an AI-accelerated coding toolkit that offers a suite of functions, including code completion, explanation, translation, search, and user chatting. Compatible with over 70 programming languages, Codeium delivers fast and cutting-edge solutions to coding challenges, simplifying the development process for its users.

CodeArts Snap. Huawei’s CodeArts Snap is capable of generating comprehensive function-level code from both Chinese and English descriptions. This tool not only reduces the monotony of manual coding but also efficiently generates test code, in addition to providing automatic code analysis and repair services.

Tabnine. Tabnine is an AI coding assistant that empowers development teams to leverage AI for streamlining the software development lifecycle while maintaining strict standards for privacy, security, and compliance. With a focus on enhancing coding efficiency, code quality, and developer satisfaction, Tabnine offers AI-driven automation that is tailored to the needs of your team. Supporting over one million developers worldwide, Tabnine is applicable across various industries.

Replit. Replit is a multifunctional platform that caters to a diverse array of software development needs. As a complimentary online IDE, it facilitates code collaboration, and cloud services, and fosters a thriving developer community. Replit also enables users to compile and execute code in more than 50 programming languages directly within a web browser, eliminating the need for local software installations.

5. Challenges & Opportunities

According to our investigations, the LLMs have revolutionized the paradigm of code generation and achieved remarkable performance. Despite this promising progress, there are still numerous challenges that need to be addressed. These challenges are mainly caused by the gap between academia and practical development. For example, in academia, the HumanEval benchmark has been established as a de facto standard for evaluating the coding proficiency of LLMs. However, many works have illustrated the evaluation of HumanEval can’t reflect the scenario of practical development (Jimenez et al., 2023; Du et al., 2024; Liu et al., 2024b; Ding et al., 2024). In contrast, these serious challenges offer substantial opportunities for further research and applications. In this section, we pinpoint critical challenges and identify promising opportunities, aiming to bridge the research-practicality divide.

Enhancing complex code generation at repository and software scale. In practical development scenarios, it often involves a large number of complex programming problems of varying difficulty levels. While LLMs have shown proficiency in generating function-level code snippets, these models often struggle with more complex, unseen programming problems, repository- and software-level problems that are commonplace in real-world software development. To this end, it requires strong problem-solving skills in LLM beyond simply functional-level code generation. For example, AlphaCode (Li et al., 2022a) achieved an average ranking in the top 54.3% in programming competitions where an understanding of algorithms and complex natural language is required to solve competitive programming problems. (Jimenez et al., 2023) argues that existing LLMs can’t resolve real-world GitHub issues well since the best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. The reason for poor performance is mainly attributed to the weak reasoning capabilities (Huang and Chang, 2022), complex internal- and external- dependencies (Bairi et al., 2023), and context length limitation of LLMs (Bairi et al., 2023). Therefore, the pursuit of models that can handle more complex, repository- and software-level code generation opens up new avenues for automation in software development and makes programming more productive and accessible.

Innovating model architectures tuned to code structures. Due to their scalability and effectiveness, Transformer-based LLM architectures have become dominant in solving code generation task. Nevertheless, they might not be optimally designed to capture the inherent structure and syntax of programming languages (PLs) (Guo et al., 2020, 2022; Ma et al., 2022; Kou et al., 2023). Code has a highly structured nature, with a syntax that is more rigid than natural language. This presents a unique challenge for LLMs, which are often derived from models that were originally designed for natural language processing (NLP). The development of novel model architectures that inherently understand and integrate the structural properties of code represents a significant opportunity to improve code generation and comprehension. Innovations such as tree-based neural networks (Mou et al., 2014), which mirror the abstract syntax tree (AST) representation of code, can offer a more natural way for models to learn and generate programming languages. Additionally, leveraging techniques from the compiler theory, such as intermediate representations (IR) (Li et al., 2022b), could enable models to operate on a more abstract and generalizable level, making them effective across multiple programming languages (Paul et al., 2024). By exploring architectures beyond the traditional sequential models, researchers can unlock new potentials in code generation.

Curating high-quality code data for pre-training and fine-tuning of LLMs. The efficacy of LLMs largely depends on the quality and diversity of code datasets used during pre-training and fine-tuning phases (Zhou et al., 2024; Köpf et al., 2024; Wettig et al., 2024). Currently, there is a scarcity of large, high-quality datasets that encompass a wide range of programming tasks, styles, and languages. This limitation constrains the ability of LLMs to generalize across unseen programming tasks, different coding environments, and real-world software development scenarios. The development of more sophisticated data acquisition techniques, such as automated code repositories mining (Linstead et al., 2007), advanced filtering algorithms, and code data synthesis (Liu et al., 2024a) (see Section 4.2), can lead to the creation of richer datasets. Collaborations with industry partners (e.g., GitHub) could also facilitate access to proprietary codebases, thereby enhancing the practical relevance of the training material. Furthermore, the adoption of open-source models for dataset sharing can accelerate the collective effort to improve the breadth and depth of code data available for LLM research.

Developing comprehensive benchmarks and metrics for coding proficiency evaluation in LLMs. Current benchmarks like HumanEval may not capture the full spectrum of coding skills required for practical software development (Ni et al., 2023b). Additionally, metrics often focus on syntactic correctness or functional accuracy, neglecting aspects such as code efficiency (Peitek et al., 2021), style (Chen and Abedjan, 2023), readability (Buse and Weimer, 2009), or maintainability (Ardito et al., 2020). The design of comprehensive benchmarks that simulate real-world software development challenges could provide a more accurate assessment of LLMs’ coding capabilities. These benchmarks should include diverse programming tasks of varying difficulty levels, such as debugging (Zhong et al., 2024), refactoring (Shirafuji et al., 2023), and optimization (Ishida et al., 2024), and should be complemented by metrics that evaluate qualitative aspects of code. The establishment of community-driven benchmarking platforms could facilitate continuous evaluation and comparison of LLMs for code generation across the industry and academia.

Support for low-resource, low-level, and domain-specific programming languages. LLMs are predominantly trained in popular high-level programming languages, leaving low-resource, low-level, and domain-specific languages underrepresented. This lack of focus restricts the applicability of LLMs in certain specialized fields and systems programming (Thakur et al., 2023). Intensifying research on transfer learning and meta-learning approaches may enable LLMs to leverage knowledge from high-resource languages to enhance their performance on less common ones (Chen et al., 2022a; Cassano et al., 2023). Additionally, partnerships with domain experts can guide the creation of targeted datasets and fine-tuning strategies to better serve niche markets. The development of LLMs with a capacity for multilingual code generation also presents a significant opportunity for broadening the scope of applications.

Continuous learning for LLMs to keep pace with evolving coding knowledge. The software development landscape is continuously evolving, with new languages, frameworks, and best practices emerging regularly. LLMs risk becoming outdated if they cannot adapt to these changes and incorporate the latest programming knowledge (Jang et al., 2022; Wang et al., 2023d). While retrieval augmented code generation mitigates these issues, the performance is limited by the quality of the retrieval context While retrieval-augmented code generation offers a partial solution to these issues, its effectiveness is inherently constrained by the quality of retrieved context. (Lu et al., 2022; Zhou et al., 2022a; Zhang et al., 2023c). Therefore, establishing mechanisms for continuous learning and updating of LLMs can help maintain their relevance over time. This could involve real-time monitoring of code repositories to identify trends and innovations, as well as the creation of incremental learning systems that can assimilate new information without forgetting previously acquired knowledge. Engaging the LLMs in active learning scenarios where they interact with human developers may also foster ongoing knowledge acquisition.

Ensuring code safety and aligning LLM outputs with human coding preferences. Ensuring the safety and security of code generated by LLMs is a paramount concern, as is their ability to align with human preferences and ethical standards. Current models may inadvertently introduce vulnerabilities or generate code that does not adhere to desired norms (Chen et al., 2021; Yang et al., 2024). Research into the integration of formal verification tools within the LLM pipeline can enhance the safety of the produced code. Additionally, developing frameworks for alignment learning that capture and reflect human ethical preferences can ensure that the code generation process aligns with societal values (Ouyang et al., 2022; Qi et al., 2023). Transparent and explainable AI methodologies can also contribute to building trust in the LLM-generated code by making the decision-making process more accessible to developers.

6. Conclusion

In this survey, we provide a systematic literature review, serving as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. A thorough introduction and analysis for data curation, the latest advances, performance evaluation, and real-world applications are illustrated. In addition, we present a historical overview of the evolution of LLMs for code generation in recent years and offer an empirical comparison using the widely recognized HumanEval and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities for code generation. Critical challenges and promising opportunities regarding the gap between academia and practical development are also identified for future investigation. Furthermore, we have established a dedicated resource website to continuously document and disseminate the most recent advances in the field. We hope this survey can contribute to a comprehensive and systematic overview of LLM for code generation and promote its thriving evolution. We optimistically believe that LLM will ultimately change all aspects of coding and automatically write safe, helpful, accurate, trustworthy, and controllable code, like professional programmers, and even solve coding problems that currently cannot be solved by humans.

References

  • (1)
  • age (2023) 2023. AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser. https://github.com/reworkd/AgentGPT.
  • aut (2023) 2023. AutoGPT is the vision of accessible AI for everyone, to use and to build on. https://github.com/Significant-Gravitas/AutoGPT.
  • bab (2023) 2023. BabyAGI. https://github.com/yoheinakajima/babyagi.
  • Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Ahmad et al. (2021) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
  • Al-Kaswan et al. (2024) Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. 2024. Traces of Memorisation in Large Language Models for Code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12.
  • Allal et al. (2023) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
  • Allamanis and Sutton (2014) Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft international symposium on foundations of software engineering. 472–483.
  • AlphaCode Team (2023) Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
  • Amazon (2022) Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html.
  • Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2357–2367.
  • Anthropic (2024) Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
  • Ardito et al. (2020) Luca Ardito, Riccardo Coppola, Luca Barbato, and Diego Verga. 2020. A tool-based perspective on software code maintainability metrics: a systematic literature review. Scientific Programming 2020 (2020), 1–26.
  • Athiwaratkun et al. (2022) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 (2022).
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Babe et al. (2023) Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2023. StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
  • Bairi et al. (2023) Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. 2023. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499 (2023).
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
  • Barbierato et al. (2022) Enrico Barbierato, Marco L Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. 2022. A methodology for controlling bias and fairness in synthetic data generation. Applied Sciences 12, 9 (2022), 4619.
  • Barke et al. (2023) Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
  • Bi et al. (2024a) Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. 2024a. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024).
  • Bi et al. (2024b) Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Xuanhua Shi, and Hai Jin. 2024b. Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback. arXiv preprint arXiv:2403.16792 (2024).
  • Bird et al. (2022) Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57.
  • Black et al. (2022) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745 (2022).
  • Black et al. (2021) Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata..
  • Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Buse and Weimer (2009) Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software engineering 36, 4 (2009), 546–558.
  • Cai et al. (2024) Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts. arXiv preprint arXiv:2404.05019 (2024).
  • Carlini et al. (2021) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
  • Cassano et al. (2023) Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs. arXiv preprint arXiv:2308.09895 (2023).
  • Cassano et al. (2022) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. A scalable and extensible approach to benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022).
  • Chai et al. (2022) Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. 2022. ERNIE-Code: Beyond english-centric cross-lingual pretraining for programming languages. arXiv preprint arXiv:2212.06742 (2022).
  • Chandel et al. (2022) Shubham Chandel, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and evaluating a jupyter notebook data science assistant. arXiv preprint arXiv:2201.12901 (2022).
  • Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45.
  • Chaudhary (2023) Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
  • Chen and Abedjan (2023) Binger Chen and Ziawasch Abedjan. 2023. DUETCS: Code Style Transfer through Generation and Retrieval. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2362–2373.
  • Chen et al. (2022b) Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022b. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
  • Chen et al. (2022a) Fuxiang Chen, Fatemeh H Fard, David Lo, and Timofey Bryksin. 2022a. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 401–412.
  • Chen et al. (2024) Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17754–17762.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  • Chen et al. (1998) Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998).
  • Chen et al. (2023) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
  • Chen et al. (2018) Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. Advances in neural information processing systems 31 (2018).
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  • Christopoulou et al. (2022) Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li, et al. 2022. Pangu-coder: Program synthesis with function-level language modeling. arXiv preprint arXiv:2207.11280 (2022).
  • Chung et al. (2024) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53.
  • Clement et al. (2020) Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020).
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
  • CodeGemma Team et al. (2024) CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Siqi Zuo, Tris Warkentin, and Zhitao et al. Gong. 2024. CodeGemma: Open Code Models Based on Gemma. (2024). https://goo.gle/codegemma
  • Codeium (2023) Codeium. 2023. Free, ultrafast Copilot alternative for Vim and Neovim. https://github.com/Exafunction/codeium.vim.
  • Cognition (2024) Cognition. 2024. Introducing Devin, the first AI software engineer. https://www.cognition.ai/introducing-devin.
  • Cohn et al. (2010) Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. The Journal of Machine Learning Research 11 (2010), 3053–3096.
  • Computations (2023) Cognitive Computations. 2023. oa_leet10k. https://huggingface.co/datasets/cognitivecomputations/oa_leet10k.
  • De Moura and Bjørner (2008) Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.
  • Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 36 (2024).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Ding et al. (2022a) Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2022a. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904 (2022).
  • Ding et al. (2024) Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024).
  • Ding et al. (2022b) Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2022b. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007 (2022).
  • Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  • Dou et al. (2024) Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan, Zhiheng Xi, et al. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv preprint arXiv:2402.01391 (2024).
  • Du et al. (2024) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  • Fried et al. (2022) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  • Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  • Gao et al. (2023a) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023a. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
  • Gao et al. (2023b) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023b. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
  • Gong et al. (2024) Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung. 2024. AST-T5: Structure-Aware Pretraining for Code Generation and Understanding. arXiv preprint arXiv:2401.03003 (2024).
  • Gulwani (2010) Sumit Gulwani. 2010. Dimensions in program synthesis. In Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming. 13–24.
  • Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023).
  • Guo et al. (2022) Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7212–7225.
  • Guo et al. (2020) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  • Guo et al. (2023) Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language model for code completion. In International Conference on Machine Learning. PMLR, 12098–12107.
  • Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024).
  • Gupta et al. (2021) Aman Gupta, Deepak Bhatt, and Anubha Pandey. 2021. Transitioning from Real to Synthetic data: Quantifying the bias in model. arXiv preprint arXiv:2105.04144 (2021).
  • Gupta et al. (2024) Aman Gupta, Anup Shirgaonkar, Angels de Luis Balaguer, Bruno Silva, Daniel Holstein, Dawei Li, Jennifer Marsman, Leonardo O Nunes, Mahsa Rouzbahman, Morris Sharp, et al. 2024. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv preprint arXiv:2401.08406 (2024).
  • Hämäläinen et al. (2023) Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–19.
  • Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992 (2023).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
  • Hoffa (2016) Felipe Hoffa. 2016. GitHub on BigQuery: Analyze all the open source code. URL: https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code (2016).
  • Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  • Holt et al. (2023) Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. 2023. L2MAC: Large Language Model Automatic Computer for Unbounded Code Generation. In The Twelfth International Conference on Learning Representations.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).
  • Hong et al. (2023) Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
  • Hou et al. (2024) Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Huang et al. (2023) Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
  • Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022).
  • Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics (ACL), 1049–1065.
  • Huang et al. (2022) Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui, Jeevana Priya Inala, Colin Clement, Nan Duan, and Jianfeng Gao. 2022. Execution-based evaluation for data science code generation models. arXiv preprint arXiv:2211.09374 (2022).
  • Huang et al. (2024) Qiuyuan Huang, Naoki Wake, Bidipta Sarkar, Zane Durante, Ran Gong, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Noboru Kuno, Ade Famoti, et al. 2024. Position Paper: Agent AI Towards a Holistic Intelligence. arXiv preprint arXiv:2403.00833 (2024).
  • Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  • Ishibashi and Nishimura (2024) Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization. arXiv preprint arXiv:2404.02183 (2024).
  • Ishida et al. (2024) Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F Henriques, and Anthony Hu. 2024. LangProp: A code optimization framework using Language Models applied to driving. arXiv preprint arXiv:2401.10314 (2024).
  • Iyer et al. (2018) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1643–1652.
  • Iyer et al. (2022) Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. 2022. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017 (2022).
  • Jang et al. (2022) Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Jungkyu Choi, and Minjoon Seo. 2022. Towards Continual Knowledge Learning of Language Models. In 10th International Conference on Learning Representations, ICLR 2022. International Conference on Learning Representations.
  • Jelinek et al. (1977) Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America 62, S1 (1977), S63–S63.
  • Jha et al. (2010) Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. 2010. Oracle-guided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 215–224.
  • Ji et al. (2023) Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. 2023. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852 (2023).
  • Jiang and Kim (2023) Juyong Jiang and Sunghun Kim. 2023. CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning. https://github.com/juyongjiang/CodeUp.
  • Jiang et al. (2021) Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  • Jiang et al. (2023) Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907 (2023).
  • Jimenez et al. (2023) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations.
  • John Yang (2024) Alexander Wettig Kilian Lieret Shunyu Yao Karthik Narasimhan Ofir Press John Yang, Carlos E. Jimenez. 2024. SWE-AGENT: AGENT-COMPUTER INTERFACES ENABLE AUTOMATED SOFTWARE ENGINEERING. (2024). https://swe-agent.com/
  • Joshi and Rambow (2003) Aravind Joshi and Owen Rambow. 2003. A formalism for dependency grammar based on tree adjoining grammar. In Proceedings of the Conference on Meaning-text Theory. MTT Paris, France, 207–216.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  • Khan et al. (2023) Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2023. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv preprint arXiv:2303.03004 (2023).
  • Kim et al. (2024) Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. 2024. sDPO: Don’t Use Your Data All at Once. arXiv preprint arXiv:2403.19270 (2024).
  • Kim et al. (2023) Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. 2023. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166 (2023).
  • Kocetkov et al. (2022) Denis Kocetkov, Raymond Li, LI Jia, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, et al. 2022. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (2022).
  • Köpf et al. (2024) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems 36 (2024).
  • Kou et al. (2023) Bonan Kou, Shengmai Chen, Zhijie Wang, Lei Ma, and Tianyi Zhang. 2023. Is model attention aligned with human attention? an empirical study on large language models for code generation. arXiv preprint arXiv:2306.01220 (2023).
  • Lachaux et al. (2020) Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511 (2020).
  • Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. PMLR, 18319–18345.
  • Laurençon et al. (2022) Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems 35 (2022), 31809–31826.
  • Laurer (2024) Moritz Laurer. 2024. Synthetic data: save money, time and carbon with open source. https://huggingface.co/blog/synthetic-data-save-costs.
  • Le et al. (2022) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328.
  • Le Scao et al. (2023) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model. (2023).
  • Lee et al. (2023) Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023).
  • Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  • Li et al. (2024) Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
  • Li et al. (2023d) Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023d. Towards enhancing in-context learning for code generation. arXiv preprint arXiv:2303.17780 (2023).
  • Li et al. (2023a) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023a. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  • Li et al. (2023c) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023c. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
  • Li et al. (2023b) Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023b. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023).
  • Li et al. (2022a) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022a. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  • Li et al. (2022b) Zongjie Li, Pingchuan Ma, Huaijin Wang, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2022b. Unleashing the power of compiler intermediate representation to enhance neural program embeddings. In Proceedings of the 44th International Conference on Software Engineering. 2253–2265.
  • Lialin et al. (2023) Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647 (2023).
  • Liang et al. (2023) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (2023).
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  • Lin et al. (2022) Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI open 3 (2022), 111–132.
  • Linstead et al. (2007) Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Baldi. 2007. Mining internet-scale software repositories. Advances in neural information processing systems 20 (2007).
  • Liu et al. (2023a) Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen, Hailian Zhou, et al. 2023a. Mftcoder: Boosting code llms with multitask fine-tuning. arXiv preprint arXiv:2311.02303 (2023).
  • Liu et al. (2022) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965.
  • Liu et al. (2024b) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024b. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
  • Liu et al. (2023d) Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023d. Rltf: Reinforcement learning from unit test feedback. arXiv preprint arXiv:2307.04349 (2023).
  • Liu et al. (2023c) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023c. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  • Liu et al. (2024a) Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. 2024a. Best Practices and Lessons Learned on Synthetic Data for Language Models. arXiv preprint arXiv:2404.07503 (2024).
  • Liu et al. (2020) Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2020. Retrieval-Augmented Generation for Code Summarization via Hybrid GNN. In International Conference on Learning Representations.
  • Liu et al. (2023b) Tianyang Liu, Canwen Xu, and Julian McAuley. 2023b. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091 (2023).
  • Lozhkov et al. (2024) Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv preprint arXiv:2402.19173 (2024).
  • Lu et al. (2022) Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A Retrieval-Augmented Code Completion Framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6227–6240.
  • Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  • Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In The Twelfth International Conference on Learning Representations.
  • Ma et al. (2022) Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2022. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? arXiv preprint arXiv:2212.10017 (2022).
  • Madaan et al. (2024) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2024).
  • Manyika and Hsiao (2023) James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. AI. Google Static Documents 2 (2023).
  • Markovtsev et al. (2019) Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. 2019. STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 468–478.
  • Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems 35 (2022), 462–477.
  • Meta (2024) Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/.
  • Mojan Javaheripi (2023) Sébastien Bubeck Mojan Javaheripi. 2023. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models.
  • Mou et al. (2014) Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A tree-based convolutional neural network for programming language processing. arXiv preprint arXiv:1409.5718 (2014).
  • Mozannar et al. (2022) Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2022. Reading between the lines: Modeling user behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
  • Muennighoff et al. (2023) Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124 (2023).
  • Nappa et al. (2015) Antonio Nappa, Richard Johnson, Leyla Bilge, Juan Caballero, and Tudor Dumitras. 2015. The attack of the clones: A study of the impact of shared code on vulnerability patching. In 2015 IEEE symposium on security and privacy. IEEE, 692–708.
  • Ni et al. (2023a) Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023a. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning. PMLR, 26106–26128.
  • Ni et al. (2023b) Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, et al. 2023b. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models. arXiv preprint arXiv:2309.17446 (2023).
  • Nijkamp et al. (2023) Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. 2023. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
  • Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  • Olausson et al. (2023) Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair a Silver Bullet for Code Generation?. In The Twelfth International Conference on Learning Representations.
  • OpenAI (2022) OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  • OpenDevin (2024) OpenDevin. 2024. OpenDevin: Code Less, Make More. https://github.com/OpenDevin/OpenDevin.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  • Ovadia et al. (2023) Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2023. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934 (2023).
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  • Parasaram et al. (2024) Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl Barr, and Sergey Mechtaev. 2024. The Fact Selection Problem in LLM-Based Program Repair. arXiv preprint arXiv:2404.05520 (2024).
  • Parvez et al. (2021) Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2719–2734.
  • Patel et al. (2023) Arkil Patel, Siva Reddy, Dzmitry Bahdanau, and Pradeep Dasigi. 2023. Evaluating In-Context Learning of Libraries for Code Generation. arXiv preprint arXiv:2311.09635 (2023).
  • Paul et al. (2024) Indraneil Paul, Jun Luo, Goran Glavaš, and Iryna Gurevych. 2024. IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators. arXiv preprint arXiv:2403.03894 (2024).
  • Peitek et al. (2021) Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund. 2021. Program comprehension and code complexity metrics: An fmri study. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 524–536.
  • Phan et al. (2024) Huy N Phan, Hoang N Phan, Tien N Nguyen, and Nghi DQ Bui. 2024. RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion. arXiv preprint arXiv:2403.06095 (2024).
  • Pinnaparaju et al. (2024) Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym Zhuravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, et al. 2024. Stable Code Technical Report. arXiv preprint arXiv:2404.01226 (2024).
  • Press et al. (2021) Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 (2021).
  • Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693 (2023).
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  • Raemaekers et al. (2012) Steven Raemaekers, Arie Van Deursen, and Joost Visser. 2012. Measuring software library stability through historical version analysis. In 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, 378–387.
  • Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 (2024).
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  • Rajkumar et al. (2022) Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the text-to-sql capabilities of large language models. arXiv preprint arXiv:2204.00498 (2022).
  • Ren et al. (2020) Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020).
  • Replit (2016) Replit. 2016. Idea to software, fast. https://replit.com.
  • Replit (2023) Replit. 2023. replit-code-v1-3b. https://huggingface.co/replit/replit-code-v1-3b.
  • Ridnik et al. (2024) Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. arXiv preprint arXiv:2401.08500 (2024).
  • Roshdieh (2023) Nick Roshdieh. 2023. Evol-Instruct-Code-80k. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1.
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In ICLR 2022-Tenth International Conference on Learning Representations.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
  • Shen et al. (2023) Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. 2023. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936 (2023).
  • Shinn et al. (2024) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36 (2024).
  • Shirafuji et al. (2023) Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring Programs Using Large Language Models with Few-Shot Examples. arXiv preprint arXiv:2311.11690 (2023).
  • Shojaee et al. (2023) Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. 2023. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816 (2023).
  • Shrivastava et al. (2023a) Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023a. RepoFusion: Training Code Models to Understand Your Repository. arXiv preprint arXiv:2306.10998 (2023).
  • Shrivastava et al. (2023b) Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023b. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
  • Singh et al. (2023) Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023. Codefusion: A pre-trained diffusion model for code generation. arXiv preprint arXiv:2310.17680 (2023).
  • Su et al. (2024b) Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, and Tao Yu. 2024b. ARKS: Active Retrieval in Knowledge Soup for Code Generation. arXiv preprint arXiv:2402.12317 (2024).
  • Su et al. (2024a) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024a. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024), 127063.
  • Svyatkovskiy et al. (2020) Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1433–1443.
  • Szafraniec et al. (2022) Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, and Gabriel Synnaeve. 2022. Code translation with compiler representations. In Proceedings of the Eleventh International Conference on Learning Representations: ICLR.
  • TabNine (2018) TabNine. 2018. AI Code Completions. https://github.com/codota/TabNine.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024).
  • Team (2024) Qwen Team. 2024. Code with CodeQwen1.5. https://qwenlm.github.io/blog/codeqwen1.5.
  • Thakur et al. (2023) Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2023. Benchmarking large language models for automated verilog rtl code generation. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Tunstall et al. (2022) Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. 2022. Natural language processing with transformers. ” O’Reilly Media, Inc.”.
  • Vaithilingam et al. (2022) Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts. 1–7.
  • Van Breugel et al. (2023) Boris Van Breugel, Zhaozhi Qian, and Mihaela Van Der Schaar. 2023. Synthetic data, real errors: how (not) to publish and use synthetic data. In International Conference on Machine Learning. PMLR, 34793–34808.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  • Wang et al. (2024c) Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024c. Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024).
  • Wang et al. (2024b) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024b. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (2024), 1–26.
  • Wang et al. (2022c) Shiqi Wang, Li Zheng, Haifeng Qian, Chenghao Yang, Zijian Wang, Varun Kumar, Mingyue Shang, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. 2022c. ReCode: Robustness Evaluation of Code Generation Models. (2022). https://doi.org/10.48550/arXiv.2212.10264
  • Wang et al. (2023d) Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, et al. 2023d. Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218 (2023).
  • Wang et al. (2024a) Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024a. Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030 (2024).
  • Wang et al. (2022a) Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin Jiang, and Qun Liu. 2022a. Compilable Neural Code Generation with Compiler Feedback. In Findings of the Association for Computational Linguistics: ACL 2022. 9–19.
  • Wang et al. (2022b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022b. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
  • Wang et al. (2023a) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023a. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In The 61st Annual Meeting Of The Association For Computational Linguistics.
  • Wang et al. (2023b) Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023b. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1069–1088.
  • Wang and Li (2021) Yanlin Wang and Hui Li. 2021. Code completion by modeling flattened abstract syntax trees as graphs. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 14015–14023.
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  • Wang et al. (2023c) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023c. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
  • Wang et al. (2022d) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022d. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481 (2022).
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
  • Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • Wei et al. (2023) Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023).
  • Weng (2023) Lilian Weng. 2023. LLM-powered Autonomous Agents. lilianweng.github.io (Jun 2023). https://lilianweng.github.io/posts/2023-06-23-agent/
  • Wettig et al. (2024) Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for Training Language Models. arXiv preprint arXiv:2402.09739 (2024).
  • Wood et al. (2021) Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. 2021. Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision. 3681–3691.
  • Wu et al. (2024) Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. 2024. Repoformer: Selective Retrieval for Repository-Level Code Completion. arXiv preprint arXiv:2403.10059 (2024).
  • Wu et al. (2023) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
  • Xi et al. (2023) Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 (2023).
  • Xie et al. (2024) Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, and Wei Ye. 2024. CodeShell Technical Report. arXiv preprint arXiv:2403.15747 (2024).
  • Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
  • Xingyao Wang and Neubig (2024) Bowen Li Xingyao Wang and Graham Neubig. 2024. Introducing OpenDevin CodeAct 1.0, a new State-of-the-art in Coding Agents. https://www.cognition.ai/introducing-devin.
  • Xu et al. (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
  • Xu et al. (2022) Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  • Yang et al. (2024) Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, security, privacy, explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024).
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024).
  • Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
  • Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories. 476–486.
  • Yoo et al. (2024) Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, et al. 2024. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 (2024).
  • Yu et al. (2024) Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
  • Yu et al. (2018) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3911–3921.
  • Yu et al. (2023) Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. 2023. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187 (2023).
  • Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023).
  • Yuxiang Wei (2024) Jiawei Liu Yifeng Ding Naman Jain Harm de Vries Leandro von Werra Arjun Guha Lingming Zhang Yuxiang Wei, Federico Cassano. 2024. StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation. https://github.com/bigcode-project/starcoder2-self-align.
  • Zaken et al. (2021) Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. 2021. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021).
  • Zan et al. (2022) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint arXiv:2206.06888 (2022).
  • Zan et al. (2023) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7443–7464.
  • Zan et al. (2024) Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan, et al. 2024. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. arXiv preprint arXiv:2403.16443 (2024).
  • Zhang et al. (2023c) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023c. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484.
  • Zhang et al. (2023a) Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023a. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations.
  • Zhang et al. (2023d) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023d. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
  • Zhang et al. (2023e) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023e. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
  • Zhang et al. (2024) Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. arXiv preprint arXiv:2404.05427 (2024).
  • Zhang et al. (2023b) Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023b. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023).
  • Zhao et al. (2023a) Liang Zhao, Xiaocheng Feng, Xiachong Feng, Bin Qin, and Ting Liu. 2023a. Length Extrapolation of Transformers: A Survey from the Perspective of Position Encoding. arXiv preprint arXiv:2312.17044 (2023).
  • Zhao et al. (2023b) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023b. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  • Zheng et al. (2024a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024a. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  • Zheng et al. (2023c) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023c. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
  • Zheng et al. (2024b) Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024b. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658 (2024).
  • Zheng et al. (2023b) Wenqing Zheng, SP Sharan, Ajay Kumar Jaiswal, Kevin Wang, Yihan Xi, Dejia Xu, and Zhangyang Wang. 2023b. Outline, then details: Syntactically guided coarse-to-fine code generation. In International Conference on Machine Learning. PMLR, 42403–42419.
  • Zheng et al. (2023a) Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023a. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
  • Zhong et al. (2024) Li Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step. arXiv preprint arXiv:2402.16906 (2024).
  • Zhou et al. (2023) Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406 (2023).
  • Zhou et al. (2024) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
  • Zhou et al. (2022b) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, et al. 2022b. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations.
  • Zhou et al. (2022a) Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig. 2022a. DocPrompting: Generating Code by Retrieving the Docs. In The Eleventh International Conference on Learning Representations.
  • Zhuo (2024) Terry Yue Zhuo. 2024. ICE-Score: Instructing Large Language Models to Evaluate Code. In Findings of the Association for Computational Linguistics: EACL 2024. 2232–2242.
  • Zhuo et al. (2024) Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas Muennighoff. 2024. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2401.00788 (2024).