Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\lst@Key

numbersnone\lstKV@SwitchCases#1none:
left:
right: \mdfdefinestyleMyFramelinecolor=black, outerlinewidth=0.3pt, roundcorner=5pt, skipabove = 5.5pt, skipbelow = 5.5pt, innertopmargin=5.5pt, innerbottommargin=5.5pt, innerrightmargin=5.5pt, innerleftmargin=5.5pt, backgroundcolor=bg,

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Jialun Cao Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China
Zhiyong Chen State Key Laboratory for Novel Software Technology, Nanjing UniversityNanjingChina Jiarong Wu Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China
Shing-Chi Cheung Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China
 and  Chang Xu State Key Laboratory for Novel Software Technology, Nanjing UniversityNanjingChina
Abstract.

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs’ capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java, resulting in an insufficient understanding of LLMs’ capability to generate Java code. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills (e.g., variables, operators, and control structures), while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). Considering the prevalence of these advanced features in real-world Java project development, constructing benchmarks to test LLMs on handling OOP features is necessary.

To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM’s capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench. We also release a leaderboard and invite model developers to participate and test their models against JavaBench at https://java-bench.github.io/leaderboard.html.

Large Language Model, Program Synthesis, Object-Oriented Programming
* Co-first authors.

1. Introduction

Large language models (LLMs) such as ChatGPT (gpt, 2023a, b) have shown advanced proficiency (Belzner et al., 2023) in various tasks such as code generation (Chen et al., 2021; Du et al., 2023; Yu et al., 2023), code reasoning (Gu et al., 2024) and code summarization (Gao et al., 2023). Emerging code generation/completion benchmarks (Chen et al., 2021; Yu et al., 2023; Austin et al., 2021; Iyer et al., 2018; Zheng et al., 2023b; Ding et al., 2023; Li et al., 2024; Zhang et al., 2023) like HumanEval (Chen et al., 2021) have been introduced to evaluate LLMs’ capabilities, providing insights into their strengths and weaknesses in various real-world scenarios, thereby guiding LLM researchers to address related issues more effectively.

Research gap – To gain a comprehensive overview of the current state of these benchmarks, we consolidated the data from the recent studies (Wang et al., 2024; Zan et al., 2022c; Zheng et al., 2023a) and incorporated the latest benchmarks, resulting in Table 1. By analyzing the statistics, we identified three significant imbalances. 1. Imbalanced Programming Languages. There is a disproportionate focus on Python, which constitutes 95.8% (23/24) of benchmarks. Java, despite being the second most popular language on GitHub (git, 2024) (Java holds 11.708% while Python holds 16.925% and is ranked first), is covered by only five function-level benchmarks. The lack of Java benchmarks limits the understanding of LLMs’ capabilities in generating Java code compared to Python. 2. Imbalanced Code Granularity. These benchmarks predominantly feature function-level granularity or lower (i.e., statement-level), accounting for 83.3% (20/24) of the total. Although these benchmarks can exercise LLMs’ ability to generate code for individual functions, a broader context (e.g., cross-function/class) is often required in real-world development scenarios, e.g., inheriting a class and overwriting the interface (Li et al., 2024). Such scenarios cannot be adequately assessed by statement-/function-level benchmarks. Only a mere handful extends to class- or project-levels, and all are limited to Python. 3. Lacking Advanced Features. Current benchmarks comprehensively assess basic coding skills (e.g., variables, data types, operators, and control structures) while overlooking advanced Object-Oriented Programming (OOP) features (encapsulation, inheritance, and polymorphism). OOP promotes modularity and reusability of the code and is thus commonly adopted in real-world development. However, only one recent benchmark (Wang et al., 2024) claims to exercise OOP features, and it does not provide actual code context but merely mentions the OOP concept in the prompt. In summary, there is a clear gap to fill to adequately test LLMs in handling OOP features, motivating the need for more comprehensive benchmarks.

Benchmark JavaBench – To bridge the research gap, we propose JavaBench, a project-level Java benchmark that exercises OOP features (i.e., encapsulation, inheritance, and polymorphism). It comprises four Java projects that were programming assignments in an entry-level Java course. These four projects contain 389 methods in 106 Java classes, covered by 396 tests, reaching up to 92% code coverage. In addition, JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests in JavaBench. Furthermore, we extensively evaluate five LLMs (e.g., gpt-3.5, DeepSeeker, Phind) against JavaBench under a set of comprehensive settings. In particular, we design three context settings (i.e., maximum/minimum/selected context) in prompting, adopt five synthesis strategies (i.e., holistic, independent, incremental, and its two variants), and evaluate the synthesized projects at two evaluation granularities (i.e., class-wise and test-wise) using three metrics (i.e., Completion@k𝑘kitalic_k, Compilation@k𝑘kitalic_k, and Pass@k𝑘kitalic_k).

Our extensive experiments yield several interesting findings. First, in terms of project-level Java programming ability, LLMs are still far behind undergraduate students. The best LLMs under the best setting only reach a 41.7% Pass@5 in test-wise granularity (Section 4.1), compared with 90.93% achieved by undergraduate students under a stricter evaluation. Second, Providing method signature only in the prompt leads to optimal results, while too much or too little context degrades project-level code generation.

Contributions – Our contribution is summarized as follows.

  • Significance. We proposed the first project-level Java benchmark that exercises OOP features (i.e., encapsulation, inheritance, and polymorphism). It enables observations of LLMs’ strengths and weaknesses in handling Java OOP features.

  • Novelty. Besides introducing JavaBench, we also introduce a systematic evaluation design to assess LLMs’ capabilities under three context settings at two evaluation granularities using three progressive metrics. This evaluation design provides a reference for future project-level code generation assessments.

  • Evaluation. We conduct extensive experiments that yield several instructive findings. We point out that LLM’s capability to handle OOP features is far behind that of undergraduates. We also identified an optimal context setting with only method signatures provided. Our analysis of bad cases also provides directions for future improvement.

2. Benchmark Construction

2.1. Benchmark Format

An example of a Java project in JavaBench is illustrated in Figure 1. A project comprises a description of the whole project in natural language and a code skeleton with multiple classes (Figure 1 only shows one class due to space limit). Each class includes import statements, a class description, a class skeleton with multiple methods. Each method has a docstring and can be complete or incomplete, i.e., the method body is a TODO to be filled in by LLMs.

Table 1. Summarization of 24 Existing Benchmarks plus JavaBench. The ones involving Java are highlighted in gray.
Benchmark Time Language Granularity # Funcs # Class # AvgT # Tests # AvgLOC
Concode (Iyer et al., 2018) 2018 Java Function 2,000 0 - -
CoNaLA(Yin et al., 2018) 2018 Python Statement 500 0 - - 1
APPS(Hendrycks et al., 2021) 2021 Python Function 5,000 0 13.2 66,000 21.4
HumanEval (Chen et al., 2021) 2021 Python Function 164 0 7.7 1,263 11.5
MBPP (Austin et al., 2021) 2021 Python Function 974 0 3 2,922 6.8
math-qa (Amini et al., 2019) 2021 Python Statement 2,985 0 - - 7.6
HumanEval-X (Zheng et al., 2023b) 2022 Python, Java, etc. Function 164 0 7.8 1,279 12.1
MBXP (Athiwaratkun et al., 2022) 2022 Python, Java, etc. Function 974 0 3 2,922 6.8
CodeContests 2022 Python, C++ Function 165 0 203.7 33,610 59.8
PandasEval(Zan et al., 2022b) 2022 Python Function 101 0 6.5 656 1.3
NumpyEval(Zan et al., 2022b) 2022 Python Function 101 0 3.5 354 1.1
TorchDataEval(Zan et al., 2022a) 2022 Python Function 50 0 1.1 55 1.3
DS-1000 (Lai et al., 2023) 2022 Python Statement 1,000 0 1.6 1,600 3.8
DSP(Chandel et al., 2022) 2022 Python Function 1,119 0 2.1 2,350 7.6
MultiPL-MBPP(Cassano et al., 2022) 2022 Python, Java, etc. Function 974 0 3.1 3,019 -
MTBP(Nijkamp et al., 2022) 2022 Python Function 115 0 - - -
ODEX(Wang et al., 2022) 2022 Python Function 945 0 1.8 1,701 1.9
BIG-Bench(bench authors, 2023) 2023 Python Function 32 0 4.7 150 -
CoderEval (Yu et al., 2023) 2023 Python, Java Function 230+230 0 - - 32absent32\leq 32≤ 32
CrossCodeEval (Ding et al., 2023) 2023 Python, Java, etc. Statement - 3,534 0 0 96.2
RepoEval (Zhang et al., 2023) 2023 Python Project 1,973 0 - - 30absent30\leq 30≤ 30
ClassEval (Du et al., 2023) 2023 Python, Java, etc. Class 412 100 33.1 3,310 45.7
DevEval (Li et al., 2024) 2024 Python Project 1,874 0 - - 392.7
OOPEval(Wang et al., 2024) 2024 Python Project 0 431 2.5 1,070 0
JavaBench 2024 Java Project 389 106 99 396 1,740
Table 2. Summary of JavaBench Projects
ID Description Exercised Concepts Human Performance
# Stu Minsimilar-to\simMax Mean similar-to\sim±plus-or-minus\pm±similar-to\simStd
P1
The project is a text-based version of Pipe Mania using Java. The game involves placing
pipes on a grid to connect a source to a sink, utilizing ASCII and Unicode for visualization.
Features include interactive controls, water flow simulation, and strategic game-play with
conditions for winning and losing.
Basic Java, Interface, Encapsulation, Inheritance,
Overriding, Polymorphism, File IO, Exception Handling
62 52.88similar-to\sim100 95.41±7.26plus-or-minus95.417.26\textbf{95.41}~{}\pm~{}7.2695.41 ± 7.26
P2
The project is a text-based console version of Jeson Mor, a Mongolian strategy board game.
Using Java, students will implement game mechanics where two players use knights, similar
to those in chess, to compete by capturing a central square on the board.
Basic Java, Streams, Encapsulation, Inheritance,
Overriding, Polymorphism
Exception Handling
64 20.67similar-to\sim100 91.73±15.05plus-or-minus91.7315.05\textbf{91.73}~{}\pm~{}15.0591.73 ± 15.05
P3
The project is an ASCII version of the Inertia puzzle game in Java. The game challenges
players to navigate a board to collect gems while avoiding mines, with movement continuing
in one direction until an obstruction is encountered.
Basic Java, Encapsulation, Inheritance,
Overriding, Polymorphism,
Exception Handling
77 26.79 similar-to\sim100 90.39±16.66plus-or-minus90.3916.66\textbf{90.39}~{}\pm~{}16.6690.39 ± 16.66
P4
The project is a modified Sokoban game featuring a text-based user interface. This enhanced
version introduces multiplayer functionality, allowing several players to simultaneously
navigate and manipulate designated boxes toward specific locations on the game map.
Basic Java, Encapsulation, Inheritance,
Overriding, Polymorphism, Mocking,
Exception Handling, Streams, Regex
79 34.27 similar-to\sim100 90.96±14.03plus-or-minus90.9614.03\textbf{90.96}~{}\pm~{}14.0390.96 ± 14.03
Total 282 20.67 similar-to\sim 100 90.93±plus-or-minus~{}\pm~{}±14.05

2.2. Benchmark Specification

We describe JavaBench from the following three perspectives: (1) Project Description (Section 2.2.1) describes the projects in JavaBench and the corresponding Java features they exercised. (2) Test Construction (Section 2.2.2) describes the process of constructing test cases and reports the code coverage. (3) Human Performance (Section 2.2.3) shows how first- and second-year undergraduates perform in these projects of JavaBench.

2.2.1. Project Description

A summary of the four Java projects in JavaBench is given in Table 2. The primary design goal behind designing these student projects is to craft exciting and engaging Java projects (e.g., chess games) encompassing a broad array of Java features, including basic Java functionalities, advanced object-oriented programming concepts (e.g., inheritance, polymorphism), and other skills such as file reading and exception handling for undergraduates to practice Java programming. Each project covers similar Java concepts, with a slight variant highlighted by underscore. Such a design goal also fits the benchmarking of LLMs’ capability to understand and exercise various Java features. In particular, each project in JavaBench is designed to exercise OOP-related features (i.e., inheritance, encapsulation, and polymorphism), highlighted in bold in Table 2.

Besides, each project in JavaBench has a canonical solution prepared by an experienced Java programmer with more than 5 years of experience and cross-validated by other experienced programmers. Moreover, these canonical solutions are released to more than 200 undergraduates (see Section 2.2.3) for review, ensuring the solutions’ correctness. Students are required to keep the course assignments and canonical solutions confidential for academic integrity, which reduces the data contamination (Golchin and Surdeanu, 2023) threat to our benchmark.

The number of functions and classes in each project is listed in Table 3. The four projects have similar scales, with 89 to 125 functions spreading across 23 to 30 Java classes. In total, there are 389 functions and 106 Java classes in JavaBench. The lines of code (i.e., LoC) of the entire project range from 2,560 to 6,926, with an average of 3,873 lines. Excluding the lines of test suites, the remaining lines of codes are 1,352 to 2,373, with an average of 1,740. Compared with the existing Java benchmarks at the function level (Table 1), JavaBench involves a much larger context size (1,740 vs. 392.7) and poses new challenges to Java code generation.

Furthermore, to get a better understanding of JavaBench, we measure the code complexity using two metrics (i.e., cyclomatic (Oman and Hagemeister, 1992; Coleman et al., 1994) and cognitive (Bieri, 1955) complexity) as evaluated in existing works (Yu et al., 2023; Cao et al., 2024). We omit the formulas due to space limits. Conceptually, these two metrics consider the number of decision points or branches, the nesting levels, or the number of logical operators. As shown in Table 3 at the “Complexity” entry, the four projects share similar code complexity values, with P3 being relatively easier than others and P1 being relatively more complex.

Refer to caption
Figure 1. An Example of Project Skeleton in JavaBench

2.2.2. Test Construction

The test suites for each project in JavaBench are manually constructed. Similar to canonical solution construction, the test suites are constructed by experienced Java programmers, ensuring the exercised concepts in each project are covered by at least one test case. Specifically, the statistics of tests for each project are tabulated in Table 3. There are 396 tests in total and 49 to 222 tests in each project, with an average of 99 tests. The total lines of code in the test suites range are 8,532, with 2,133 on average. The test sufficiency is shown by three test coverage metrics (i.e., class coverage, function coverage, and line coverage). As shown in the last column of Table 3, 92% classes, 87% functions, and 86.75% lines are covered by the test suites on average.

Table 3. Code and Test Statistics of JavaBench
LoC Complexity Test Info Test Coverage (%)
ID Func Class Total w/o Ts Cyc Cog # Tests LoC Class Func Line
P1 89 24 2,560 1,709 18.70 19.90 55 851 91 89 81
P2 102 23 3,223 1,524 8.93 9.71 49 1,699 95 81 87
P3 125 29 6,926 2,373 12.50 9.21 222 4,553 100 85 87
P4 73 30 2,781 1,352 16.57 10.86 70 1,429 80 93 92
Total 389 106 15,490 6,958 14.18 12.42 396 8,532 92 87 86.75

Notably, we embrace mocking (Alshahwan et al., 2010; Galler et al., 2010; Arcuri et al., 2014, 2017; Zhu et al., 2021) into the test suite. It does not affect the code generation, while helping isolate the component under test from its dependencies and increases test stability. An example of using mocking to test the termination status of a Sokoban Game (i.e., P4) is shown in Listing 1. Typically, we implement it with mockito (line 2) and mock three objects (i.e., GameState, TerminalInputEngine and TerminalRenderingEngine) in lines 7-9. Then, the assertion checks whether the exception has been thrown in line 12. Indeed, embracing mocking into test suite design is worth the endeavor, as emphasized by a recent study (Schäfer et al., 2024).

1import org.junit.jupiter.api.Assertions.assertThrowsExactly;
2import org.mockito.Mockito.*;
3
4class TerminalSokobanGameTest {
5 @Test
6 void testMoreThanTwoPlayers() {
7 final var gameState = mock(GameState.class);
8 final var inputEngine = mock(TerminalInputEngine.class);
9 final var renderingEngine = mock(TerminalRenderingEngine.class);
10 when(gameState.getAllPlayerPositions()).thenReturn(new HashSet<>(Arrays.asList(Position.of(1, 1), Position.of(1, 2), Position.of(1, 3))));
11
12 assertThrowsExactly(IllegalArgumentException.class, () -> new TerminalSokobanGame(gameState, inputEngine, renderingEngine));
13 }
14}
Listing 1: An example of mocking test in JavaBench
Refer to caption
Figure 2. Generation Pipeline for a Java Project. Given a project to be complete, for each method with TODO, there are three types of (➊ similar-to\sim ➌) Context Settings. On top of method completion, there are three Synthesis Strategies to complete an entire class.

2.2.3. Human Performance

The four Java projects in JavaBench are designed for undergraduate students throughout the four academic years from 2019 to 2022. We then use students’ overall scores as indicators of difficulty levels. As shown in the last entry (Human Performance) of Table 2, 282 undergraduates are involved 111When counting the number of participants, we omit course withdrawals, non-submissions, and blank project submissions from the count because these cases do not attempt to complete the project., and each project is finished by at least 62 students.

To rate the students’ submissions, we mainly use the pass rates (full score is 100) of the test suite as evidence. The last two columns of Table 2 demonstrate each project’s maximum/minimum scores and mean and standard deviation. The difficulty of all projects is similar, with an average of 90.96 to 95.41.

{mdframed}

[style=MyFrame] Undergraduates can finish the projects in JavaBench with a 90.96% to 95.41% test pass rate (average 90.93%), indicating the difficulty of JavaBench is acceptable for humans.

3. Experiment Design

Refer to caption
Figure 3. Evaluation Design of Granularities and Metrics. To evaluate an LLM-generated project, two granularities (i.e., class-wise and test-wise) are adopted to replace the related classes to compile corresponding programs PXsubscriptsuperscript𝑃𝑋P^{\prime}_{X}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT where X𝑋Xitalic_X denotes a class (A-C) or a test (M-O). Then, three-fold evaluation metrics (i.e., completion, compilation, and pass) are applied to evaluate PXsubscriptsuperscript𝑃𝑋P^{\prime}_{X}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT.

Given a Java project skeleton, we first synthesize the entire project using three context settings (Section 3.1.1) and various synthesis strategies (Section 3.1.2), as shown in Figure 2. Then, we evaluate the synthesized project in terms of three evaluation metrics (Section 3.2.2) in two granularities (Section 3.2.1), as shown in Figure 3.

3.1. Code Synthesis

The synthesis pipeline is shown in Figure 2. Given a skeleton project, we complete it class-by-class because a class is usually designed to be cohesive and less coupling with others. To complete a method in a class, providing the context of a standalone method/class is insufficient due to the lack of dependencies between classes. Thus, we try three context settings (Section 3.1.1). Since the order to generate multiple TODO methods in a class also matters, we try three synthesis strategies (Section 3.1.2). When all the methods in all the classes are synthesized, the skeleton project is completed.

3.1.1. Context Settings

We design three context settings in the synthesis pipeline, as shown in the middle part of Figure 2. A straightforward context is to feed the entire project skeleton to LLMs, providing as much information as possible, i.e., the Maximum Context setting. Note that due to the limitation of long input contexts, it is possible that an LLM fails to digest the entire project skeleton. Each model has a different context window size, and there is no fixed ratio between code length and token length after tokenization. We truncate the first 8192 characters (\approx 3000 tokens) for all studied LLMs. Then, the smallest maximum window length among the studied LLMs is set to be 8192 tokens, which is larger than the 2,048 tokens set in existing works (Du et al., 2023; cod, 2023), ensuring ample space is reserved for output. With this context size, 53.3% of the contexts are truncated, and the truncated characters account for 42.9% of the total characters. Opposite to the maximum context, it is natural to use only the class to be completed, i.e., Minimum Context. The advantage of this context setting favors the input token limits, while the disadvantage is the lack of necessary dependencies for synthesis. Take ClassA in Figure 2 as an example. In ClassA, a private member cb is declared as an instance of the class ClassB. The Minimum context does not include the code of ClassB into context, which may pose challenges to LLMs to complete methods in ClassA. Finally, inspired by recent works (Cheng et al., 2024; Shrivastava et al., 2023) that use related contexts (e.g., the invoked methods) to strike a balance between rich information and limited input tokens, we take the third context setting, i.e., Selected Context into consideration. Specifically, we took advantage of jdeps 222https://docs.oracle.com/en/java/javase/11/tools/jdeps.html, which is a command-line tool in the JDK that analyzes Java class files to report on package-level and class-level dependencies, to extract the selected context automatically. In addition, to minimize the input tokens while maintaining the context as informative as possible, we only include the method signatures in the related class, excluding the method body. For example, as shown in Figure 2 (❸ Selected Context), to generate func1 in ClassA, the methods in the related ClassB are simplified into signatures.

3.1.2. Synthesis Strategy Design

The order of synthesizing methods in each class may matter, so we consider three synthesis strategies following the practice of a prior work (Du et al., 2023) and consider two more variants. As shown in the right part in Figure 2, strategies are explained as follows:

  • Independent Synthesis: each method is synthesized independently without being affected by other generated methods.

  • Holistic Synthesis: all the methods in a class are synthesized in one pass by LLMs.

  • Incremental Synthesis: methods in a class are generated one by one in a specific order. Different from prior work (Du et al., 2023), in addition to considering sequential the order to synthesize methods, we also consider the reverse and random orders.

Though according to existing work (Du et al., 2023), these synthesis strategies affect open- and close-sourced LLMs differently, our JavaBench differs in their benchmark, i.e., ClassEval, in scale (class-level vs. project-level) and programming languages (Python vs. Java), so similar experiments are still worth exploring on JavaBench. Moreover, we further consider different orders in incremental synthesis, which may yield deeper insights.

3.2. Evaluation Design

Once the entire project is completed by LLMs, we consider two granularities (see Section 3.2.1) to evaluate synthesized projects using three progressive metrics (see Section 3.2.2).

3.2.1. Evaluation Granularity

Figure 3 illustrates an example project P with three classes (A, B, and C; canonical ground-truth solutions are included) and three tests (M, N, and P). The central part shows the following two granularities:

  • Class-wise: To isolate the generated certain class from the others, class-wise granularity uses a generated class to replace the canonical solution’s counterpart class at a time. For example, consider the first column under I. Class-wise setting in Figure 3, Class A is generated by LLMs, while Class B and Class C are canonical. In other words, Class A (Generated), Class B (Canonical), and Class C (Canonical) form a complete project P’A. Similarly, P’B and P’C are constructed by replacing Class B and Class C, respectively. Then, these projects, each with only one generated class, are evaluated using different metrics (Section 3.2.2), and the average is taken as the result at this granularity.

  • Test-wise: Similarly, this granularity iterates all the test cases in the test suites and takes the average result. For each test case, we replace the classes relating to the test case while keeping other classes in the canonical solution unchanged. For example, as shown in II. Test-wise setting of Figure 3, consider test N which relates to Class A and Class C, we replace these two generated classes while keeping Class B with ground-truth. After enumerating all test cases in the test suite, we evaluate all generated projects using different metrics (Section 3.2.2) and take the average as the result at this granularity.

We adopt these finer granularities to capture the nuanced difference in performance. Otherwise, the successful generation of some classes could be shadowed by failures in other classes when evaluating at a large granularity. For example, a project-wise evaluation requires the entire generated project to be completed/compiled and pass the entire test suite. In contrast, class-wise granularity examines one class at a time, allowing for the isolated assessment of each class. Section 4.1 confirmed the effectiveness of such design.

3.2.2. Evaluation Metrics

We evaluate the generation code using three progressive metrics: completion/compilation/test case pass rate. All are based on the widely used Pass@k𝑘kitalic_k metric (Chen et al., 2021):

(1) Completion/Compilation/Pass@k=𝔼[1(nck)/(nk)]𝐶𝑜𝑚𝑝𝑙𝑒𝑡𝑖𝑜𝑛𝐶𝑜𝑚𝑝𝑖𝑙𝑎𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠@𝑘𝔼delimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘Completion/Compilation/Pass@k=\mathbb{E}\left[1-\binom{n-c}{k}/\binom{n}{k}\right]italic_C italic_o italic_m italic_p italic_l italic_e italic_t italic_i italic_o italic_n / italic_C italic_o italic_m italic_p italic_i italic_l italic_a italic_t italic_i italic_o italic_n / italic_P italic_a italic_s italic_s @ italic_k = blackboard_E [ 1 - ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) / ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) ]

where Pass@k𝑘kitalic_k, as defined in prior work (Chen et al., 2021), is the expectation of passing all the tests of a problem at least once within k𝑘kitalic_k attempts. For each problem, n𝑛nitalic_n solutions are sampled from an LLM, and c𝑐citalic_c of n𝑛nitalic_n solutions are correct. The larger n𝑛nitalic_n is, the more accurate Pass@k𝑘kitalic_k is. Considering the cost and time, we set n𝑛nitalic_n to 5 and k𝑘kitalic_k to 1 and 5, following the previous study(Du et al., 2023).

Similarly, we introduce Completion@k and Compilation@k. Completion@kkkitalic_k represents the rate of the designated TODO comments being completed in the generated codes. Compilation@kkkitalic_k represents the rate of the modified projects (replacing parts of the canonical solution at different granularities) being completed and successfully compiled. Pass@kkkitalic_k represents the rate of the modified projects being completed, compiled, and passing the test cases related to a specific class (class-wise granularity) or a specific test case with corresponding modified classes (test-wise granularity).

3.3. Prompt Design

The prompt used for LLMs generation is shown in Listing 2. Following the common practice of prompting LLMs (wiz, 2023; Du et al., 2023), the prompt template consists of two parts: a system message to initialize the model, and an instruction to state the purpose of the task. In Listing 2, ${\cdot}$ is placeholder: ${context}$ denotes the context (e.g., three context settings introduced in Section 3.1.1), and ${class}$ denotes the class to be completed.

3.4. Studied Large Language Models

The studied LLMs are listed in Table 4. We selected five state-of-the-art LLMs that have been widely explored in code generation tasks. We focused on recent LLMs (i.e., released after 2022 as the settings in (Du et al., 2023)) with more than 6B parameters to achieve sufficient efficacy. We considered the instruction version of LLMs because we need to utilize the instruction-following ability. In particular, we selected WizarCoder (wiz, 2023) because it performs better than its base model, StarCoder (sta, 2023), in multiple coding tasks. We chose Phind (phi, 2023a) over CodeLlama (Roziere et al., 2023) for the same reason. We also chose two versions of DeepSeek because they are ranked at the top of the leaderboard. Finally, we include ChatGPT-3.5 because of its popularity and efficacy. The model size of studied LLMs was at most 34B, limited by our computational resources.

Table 4. Studied Large Language Models
Base Model Model Size Time
StarCoder (Li et al., 2023) WizardCoder-15B-V1.0 (Luo et al., 2023) 15B June, 2023
DeepSeek (DeepSeek-AI, 2024) deepseek-coder-6.7b-instruct (Dee, 2023b) 6.7B Sep, 2023
DeepSeek (DeepSeek-AI, 2024) deepseek-coder-33b-instruct (Dee, 2023a) 33B Nov, 2023
CodeLlama (Roziere et al., 2023) Phind-CodeLlama-34B-v2 (phi, 2023b) 34B Aug, 2023
gpt-3.5-turbo-1106 Nov, 2022
1## System Message
2You are a helpful Java programmer who writes the project based
3on the following context. Java is a high-level, class-based,
4object-oriented programming language designed to have as few
5implementation dependencies as possible.
6
7## Instruction
8‘‘‘java
9${context}
10‘‘‘
11
12Complete the code and give the completed class
13‘‘‘java
14${class}
15‘‘‘
Listing 2: Prompt Template used in the Experiment

4. Evaluation

We used nucleus sampling (Holtzman et al., 2020) in line with recent works (Du et al., 2023; Yu et al., 2023; Ouyang et al., 2023; Cao et al., 2024), where five solution samples were randomly generated with a temperature of 0.2 (Chen et al., 2021). The experiments were conducted on a server with two NVIDIA RTX 6000 Ada GPUs, each with 48GB of graphic memory.

The research questions (RQs) were designed as follows:

  • RQ1. Overall Performance. We first showed the overall performance of the studied LLMs on JavaBench. We used the selected context setting and exercised three synthesis strategies (Section 3.1.2) to generate the entire project. The comprehensive results were displayed wtih three metrics at two granularities.

  • RQ2. Context Selection. The context is an important factor in LLMs’ performance, so we iterated three context settings (Section 3.1.1) and observed the corresponding impacts.

  • RQ3. Incremental Strategies. To synthesize methods in one class, we explored whether the order of synthesizing methods in the class matters.

  • RQ4. Bad Case Analysis. We analyzed five bad cases that failed to compile or pass the test cases due to various issues and identified the incapabilities of LLMs in Java code generation.

4.1. RQ1: Overall Performance

Table 5. RQ1 – Overall Results on JavaBench
Compilation@1 (%) Pass@1 (%)
Completion@1 (%) Class-wise Test-wise Class-wise Test-wise
Strategy Model P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
Holistic WizardCoder-15B-V1.0 98.00 70.00 94.29 80.00 64.00 58.57 71.43 37.14 17.50 0.00 31.76 5.71 63.99 56.87 70.19 36.14 16.51 0.00 31.40 5.03
Holistic deepseek-coder-6.7b-instruct 82.00 90.00 100.00 100.00 64.00 71.43 82.86 71.43 15.00 0.00 70.59 54.29 62.67 70.83 82.55 62.21 14.29 0.00 70.59 10.55
Holistic deepseek-coder-33b-instruct 100.00 95.71 100.00 100.00 68.00 81.43 77.14 85.71 60.00 18.18 29.41 71.43 68.00 80.49 76.78 82.04 60.00 8.74 29.41 34.34
Holistic Phind-CodeLlama-34B-v2 96.00 88.57 91.43 80.00 86.00 68.57 80.00 57.14 80.00 0.00 62.35 20.00 85.81 67.54 79.66 54.58 58.06 0.00 62.32 8.42
Holistic gpt-3.5-turbo-1106 90.00 84.29 100.00 94.29 86.00 77.14 84.29 74.29 75.00 0.00 70.59 17.14 85.87 75.98 83.90 72.24 62.29 0.00 70.59 5.40
Average (Holistic) 91.73 72.33 34.95 70.92 27.40
Independent WizardCoder-15B-V1.0 74.00 72.86 71.43 31.43 44.00 62.86 47.14 17.14 0.00 0.00 0.00 0.00 44.00 61.17 46.92 16.62 0.00 0.00 0.00 0.00
Independent deepseek-coder-6.7b-instruct 80.00 90.00 97.14 85.71 70.00 67.14 78.57 62.86 35.00 0.00 70.59 42.86 69.29 66.40 78.57 55.33 31.55 0.00 70.59 12.75
Independent deepseek-coder-33b-instruct 76.00 78.57 92.86 100.00 66.00 75.71 74.29 74.29 62.50 0.00 37.65 60.00 66.00 73.50 74.26 69.45 56.82 0.00 37.54 21.07
Independent Phind-CodeLlama-34B-v2 74.00 87.14 68.57 71.43 64.00 68.57 61.43 48.57 12.50 0.00 25.88 11.43 63.89 67.38 61.30 46.96 12.50 0.00 25.88 2.71
Independent gpt-3.5-turbo-1106 90.00 88.57 92.86 71.43 78.00 81.43 72.86 42.86 12.50 0.00 64.71 0.00 77.80 80.80 72.86 42.64 12.50 0.00 64.71 0.00
Average (Independent) 79.70 62.89 21.78 61.76 17.43
Incremental WizardCoder-15B-V1.0 70.00 74.29 48.57 28.57 46.00 51.43 35.71 17.14 0.00 0.00 0.00 0.00 45.31 50.84 35.68 16.62 0.00 0.00 0.00 0.00
Incremental deepseek-coder-6.7b-instruct 72.00 90.00 98.57 91.43 68.00 72.86 82.86 68.57 0.00 0.00 70.59 34.29 67.51 71.89 82.15 64.30 0.00 0.00 70.59 8.12
Incremental deepseek-coder-33b-instruct 68.00 77.14 92.86 85.71 54.00 74.29 71.43 54.29 55.00 0.00 14.12 0.00 54.00 73.99 71.38 49.81 49.08 0.00 14.02 0.00
Incremental Phind-CodeLlama-34B-v2 72.00 87.14 67.14 54.29 62.00 72.86 60.00 42.86 12.50 0.00 22.35 5.71 61.99 72.05 59.14 41.01 12.50 0.00 21.87 2.53
Incremental gpt-3.5-turbo-1106 72.00 85.71 90.00 91.43 62.00 80.00 71.43 68.57 10.00 0.00 0.00 25.71 61.50 79.80 71.43 64.84 10.00 0.00 0.00 8.10
Average (Incremental) 75.84 60.81 12.51 59.76 9.84
Average 82.42 65.34 23.08 64.15 18.22

The overall performance of the studied LLMs on JavaBench was shown in Table 5. We first fixed the context setting (i.e., the selected context) and exercised three synthesis strategies (i.e., Holistic, Independent, and Incremental) as shown in Figure 2. Three evaluation metrics (Completion@1, Compilation@1 and Pass@1) were three main dimensions in Table 5. To better visualize the results, we use darker background colors to indicate larger values. It is clear that from left to right, the color became lighter, meaning the value was getting smaller as the three metrics got stricter – Only those completed codes had chance to be compiled (incomplete codes were treated failed when computing Compilation@k𝑘kitalic_k), and only those compiled codes had chance to be evaluated against test cases (uncompilable codes were treated failed when computing Pass@k𝑘kitalic_k).

Generally, the average Completion@1 of all LLMs on JavaBench was 82.42%, with variances when different synthesis strategies were applied. Considering a finer granularity (class-wise), the average Compilation@1 and Pass@1 were around 65% (65.34% and 64.15%, respectively). The best scores were achieved using the holistic strategy with 91.73% Completion@1, 72.33% Compilation@1, and 70.92% Pass@1. Among all LLMs, DeepSeek-Coder-33b performed the best, followed by gpt-3.5-turbo and DeepSeek-Coder-6.7b.

{mdframed}

[style=MyFrame] Finding 1: The best average of 91.73%, 72.33%, and 70.92% Completion@1, Compilation@1, and Pass@1, respectively, could be achieved on JavaBench over the studied LLMs. The Top-3 performing LLMs were DeepSeek-Coder-33b, gpt-3.5-turbo, and DeepSeek-Coder-6.7b among the studied LLMs.

Comparing the synthesis strategies, holistic synthesis was generally better than independent and incremental among all LLMs, and the declines could be significant in some cases. For example, WizardCoder dropped 51.43% (80.00% - 28.57%) Completion@1, dropped 20.00% (37.14% - 17.14%) Compilation@1 and 19.52% (36.14% - 16.62%) Pass@1 on P4. Other LLMs also observed similar drops when switching from holistic to independent or incremental synthesis. Although there were occasional cases where incremental or independent synthesis brought improvements, those were sporadic events, and the improvement was subtle, e.g., Completion@1 of WizardCoder improved 4.29% (74.29% - 70.00%) on P2. We also noticed that this observation was different from the observation of the previous work (Du et al., 2023) where they observed open-sourced LLMs performed better using independent synthesis than holistic synthesis. This could be because we used different programming languages (Python VS. Java) and code granularities (Class-level VS. Project-level).

{mdframed}

[style=MyFrame] Finding 2: Among the three synthesis strategies, holistic synthesis yielded a better performance across all LLMs against three metrics at two granularities on JavaBench.

Necessity of Finer-grained Evaluation Granularity. LLMs performed similarly on four projects (P1-P4), with P1 and P3 slightly better than P2 and P4. In particular, the class-wise scores were always better than test-wise scores, with a gap up to 49.92% (= 59.76%-9.84%). This was in line with our claim in Section 3.2.1: finer-grained granularity can capture more nuanced successful cases. We also calculated the Pass@1 under project-wise granularity and found that none of the projects can be correctly completed, yielding all-0 results under all settings.

{mdframed}

[style=MyFrame] Finding 3: The finer granularities (class-/test-wise) can capture subtle success in performance, enabling more distinguishable results. Otherwise, the subtle success can be shadowed by other failures, leading to all-0 results under all settings.

In addition, we increased the size of k𝑘kitalic_k from 1 to 5 to investigate the improvements brought by more trials. The detailed experiment results are omitted due to space limits and can be found on our website (Section 8). Overall, the average Completion/Compilation/Pass@5 under holistic strategy across all LLMs are 97.21%(+5.48%), 84.43%(+12.10%), 84.43%(+13.51%) at the class-wise granularity; 94.21%(+5.48%), 51.23%(+16.28%), 48.24%(+20.84%) at the test-wise granularity. The Pass@5 in project-wise granularity is still all-0s under all settings. {mdframed}[style=MyFrame] Finding 4: Increasing k𝑘kitalic_k yields further improvement. The best average test-wise Pass@5 in JavaBench is 48.24%. Compared with 90.93% achieved by undergraduate students in project-wise evaluation (Section 2.2.3), LLMs’ capability in Java project-level programming still has much room to improve.

4.2. RQ2: Impact of Context Selection

Table 6. RQ2 – Performance Variance Over Different Context Settings.
Compilation @ 1 (%) Pass @ 1 (%)
Completion @ 1 (%) Class-wise Test-wise Class-wise Test-wise
Context Model P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
Selected WizardCoder-15B-V1.0 98.00 70.00 94.29 80.00 64.00 58.57 71.43 37.14 17.50 0.00 31.76 5.71 63.99 56.87 70.19 36.14 16.51 0.00 31.40 5.03
Selected deepseek-coder-6.7b-instruct 82.00 90.00 100.00 100.00 64.00 71.43 82.86 71.43 15.00 0.00 70.59 54.29 62.67 70.83 82.55 62.21 14.29 0.00 70.59 10.55
Selected deepseek-coder-33b-instruct 100.00 95.71 100.00 100.00 68.00 81.43 77.14 85.71 60.00 18.18 29.41 71.43 68.00 80.49 76.78 82.04 60.00 8.74 29.41 34.34
Selected Phind-CodeLlama-34B-v2 96.00 88.57 91.43 80.00 86.00 68.57 80.00 57.14 80.00 0.00 62.35 20.00 85.81 67.54 79.66 54.58 58.06 0.00 62.32 8.42
Selected gpt-3.5-turbo-1106 90.00 84.29 100.00 94.29 86.00 77.14 84.29 74.29 75.00 0.00 70.59 17.14 85.87 75.98 83.90 72.24 62.29 0.00 70.59 5.40
Average (Selected) 91.73 72.33 34.95 70.92 27.40
Maximum WizardCoder-15B-V1.0 98.00 70.00 88.57 54.29 64.00 58.57 58.57 14.29 37.50 0.00 21.18 0.00 63.97 57.66 57.85 13.29 34.12 0.00 21.18 0.00
Maximum deepseek-coder-6.7b-instruct 94.00 91.43 100.00 88.57 80.00 70.00 71.43 80.00 75.00 0.00 29.41 68.57 80.00 68.38 71.24 68.06 69.12 0.00 29.41 13.39
Maximum deepseek-coder-33b-instruct 100.00 95.71 100.00 100.00 74.00 80.00 75.71 68.57 67.50 18.18 45.88 25.71 74.00 79.36 75.63 62.23 67.50 13.64 45.67 10.72
Maximum Phind-CodeLlama-34B-v2 100.00 92.86 97.14 88.57 76.00 62.86 74.29 54.29 70.00 0.00 58.82 20.00 76.00 61.62 74.09 50.62 64.02 0.00 58.82 7.69
Maximum gpt-3.5-turbo-1106 94.00 85.71 100.00 94.29 68.00 68.57 64.29 62.86 55.00 0.00 36.47 22.86 67.56 66.07 63.88 59.77 51.47 0.00 36.47 7.78
Average (Maximum) 91.66 66.31 32.60 64.56 26.55
Minimum WizardCoder-15B-V1.0 100.00 78.57 90.00 77.14 18.00 30.00 48.57 0.00 0.00 0.00 5.88 0.00 18.00 29.22 48.52 0.00 0.00 0.00 5.88 0.00
Minimum deepseek-coder-6.7b-instruct 80.00 92.86 100.00 85.71 56.00 15.71 57.14 22.86 0.00 0.00 0.00 0.00 55.89 15.69 57.14 19.14 0.00 0.00 0.00 0.00
Minimum deepseek-coder-33b-instruct 100.00 91.43 98.57 82.86 48.00 22.86 65.71 11.43 12.50 0.00 37.65 0.00 48.00 22.58 64.98 5.21 12.50 0.00 37.43 0.00
Minimum Phind-CodeLlama-34B-v2 96.00 90.00 97.14 88.57 66.00 47.14 57.14 8.57 7.50 0.00 0.00 0.00 65.69 46.33 57.14 8.02 7.50 0.00 0.00 0.00
Minimum gpt-3.5-turbo-1106 98.00 85.71 95.71 91.43 66.00 40.00 62.86 20.00 12.50 0.00 0.00 0.00 65.94 39.85 62.86 19.27 12.50 0.00 0.00 0.00
Average (Minimum) 90.99 38.20 3.80 37.47 3.79
Overall Average 91.46 58.95 23.78 57.65 19.25
Refer to caption
Figure 4. Number of Characters of Three Context Settings (i.e., Maximum/Minimum and Selected Context, Section 3.1.1). Each color represents each project in JavaBench.

In RQ1, the context setting is fixed to the selected context, where only the context that is related to the class/function to be generated is fed into LLMs. In RQ2, we consider the other two context settings (i.e., maximum and minimum context). We fix the synthesis strategy as holistic because it performs the best in RQ1. The experiment result of RQ2 is visualized in Table 6.

From Table 6, it is clear that among three context settings, the selected context yield the best overall results, with 70.92% (class-wise) and 27.40% (test-wise) Pass@1, which echos the results in Table 5. Though maximum and minimum context achieve similar Completion@1 (i.e., 91.66% and 90.99% compared with 91.73%), the performance of using these two contexts on the subsequent metrics (i.e., Compilation@1 and Pass@1) is not as good as the selected context.

{mdframed}

[style=MyFrame] Finding 5: The selected context is the best setting on JavaBench, resulting in 70.92% (class-wise) Pass@1.

To better understand the size of the total context used by each setting, we visualize the number of characters under each setting (i.e., maximum, minimum and selected context) in Figure 4. Four bars in each group are P1-P4 in JavaBench. Note that we calculate the number of characters instead of the tokens because LLMs utilize different tokenizers, which will affect the counts of tokens. For example, WizardCoder uses GPT2Tokenizer, Phind uses LlamaTokenizer, while DeepSeek uses LlamaTokenizerFast. From Figure 4, it is clear that the number of characters used in Maximum Context is nearly five times that of Minimum Context and more than twice that of Selected Context. Additionally, we can observe that the number of characters for each project is relatively similar, with P2 and P3 having relatively more characters and P4 having fewer.

By combining Table 6 and Figure 4, we can make two observations. First, more context is not always a benefit. For example, a dramatic 54.83% (= 69.12%-14.29%) increase in test-wise Pass@1 is achieved by DeepSeek-6.7b when switching the context from selected to maximum context in P1, while a 23.62% (= 34.34%-10.72%) drop of test-wise Pass@1 in P4 can be observed in DeepSeek-33b. Second, only providing the class to be completed is insufficient without dependencies. We can see from Table 6 that the Pass@1 in the test-wise granularity is almost all zeros under the minimum context setting, meaning that it is almost impossible to generate project-level code that can pass test cases. In addition, the selected context includes only the method signatures (as explained in Section 3.1.1), which turned out to be more effective than the maximum context setting.

{mdframed}

[style=MyFrame] Finding 6: Providing too much or too little context has a negative impact on project-level code generation. Including the method signature only shows a promising performance.

4.3. RQ3: Impact of Incremental Synthesis

Since in Table 5, we only adopt sequential order to synthesize functions incrementally. We then further explore whether the order of synthesizing functions in the class matters. Due to the space limitation, we only show the impact of DeepSeek-Coder-6.7b on Completion/Compilation/Pass@1 and @5. Similar observations are made in other LLMs, found in our released artifact.

Refer to caption
Figure 5. RQ3: Impact of Different Incremental Synthesis on DeepSeek-Coder-6.7b. Completion/Compilation/Pass@1 (Upper) and Completion/Compilation/Pass@5 (Lower).

The result on four projects in JavaBench is illustrated in Figure 5. The dashed horizontal lines in particular colors show the average across four projects. Generally, the three colored bars in each project are similar, with slight variances among projects, meaning that the order of incremental synthesis can slightly affect the final results. Interestingly, with sampling size k increases from 1 to 5, the advantage of random order is more significant. From the lower part of Figure 5, the red dashed line outperforms the other two lines, with an advantage of 6% (86.07% - 80%) in the reversed order. On the other hand, the reversed order may not contribute to better results, according to the experiment.

{mdframed}

[style=MyFrame] Finding 7: Synthesizing programs incrementally in a random and sequential order could yield at most 6% improvement than a reversed order on DeepSeek-Coder-6.7b. A similar conclusion was also observed in other studied LLMs.

4.4. RQ4: Bad Case Analysis

This section discussed failures during completion/compilation/pass, showed the distribution of runtime error, and analyzed five bad cases that failed to compile and pass the test suites.

4.4.1. Completion Errors

A completion error happens when LLMs ignore the code to be completed and leave the method body blank. Similar observations (i.e., LLMs ignore the information in the middle of long contexts) were also made in other communities (Liu et al., 2024). From Table 5 and Table 6 under ‘Completion@1’, we can see that among the studied LLMs, completion errors were more commonly observed in WizardCoder than in others.

4.4.2. Compilation Errors

Compilation errors indicate that LLMs have an insufficient understanding of the syntactic and semantic constraints provided by the context. Since Java’s compilation errors are not explicitly categorized, we can only roughly determine the cause of each compilation error in the code through manual judgment. Below, we present three typical errors that occur frequently and are related to the object-oriented programming paradigm.

Compilation Error 1: Finding 6: Inheritance-related Error. An abstract class is meant to be inherited by other classes and cannot be instantiated. However, in Listing 3, the Move class is defined as an abstract class in line 1, but it is instantiated in line 12, resulting in a compilation error.

1public abstract class Move extends Action {}
2public static final class Down extends Move {}
3
4public class TerminalInputEngine implements InputEngine {
5 @Override
6 public Action fetchAction() {
7 String actionName = ...;
8 switch (actionName) {
9 ...
10 case "move":
11 final Move.Direction direction = Move.Direction.valueOf(args.toUpperCase());
12- return new Move(initiator, direction);
13+ return new Move.Up(initiator);
14 }
15 }
16}
Listing 3: Compilation Error 1. Inheritance-related Error

Compilation Error 2: Encapsulation-related Error. Listing 4 shows an error caused by improperly handled encapsulation. The player field defined in line 3 is private in class Piece and cannot be accessed directly by its subclasses. LLMs ignore the principle of encapsulation and access the player field in line 13, causing this compilation error.

1public abstract class Piece {
2 private final Player player;
3
4 public final Player getPlayer() {
5 return this.player;
6 }
7}
8
9public class Knight extends Piece {
10 public Move[] getAvailableMoves(Game game, Place source) {
11 ...
12- if (this.player.validateMove(game, move)) {
13+ if (this.getPlayer().validateMove(game, move)) {
14 ...
15 }
16 }
17}
Listing 4: Compilation Error 2. Encapsulation-related Error

Compilation error 3: Illegal Inheritance Listing 5 shows a case where LLMs fail to resolve the inheritance relationships. The class Player defined in line 3 has no inheritance relationship with the Cell class defined in line 1, while LLMs mistakenly consider the variable cell could be an instance of Player using the instanceof keyword in line 10, causing the compilation fail.

1public abstract class Cell implements BoardElement {}
2public abstract class Entity implements BoardElement {}
3public final class Player extends Entity {}
4public class EntityCell extends Cell {}
5
6public class GameBoard {
7 public Gameboard(...) {
8 ...
9 Cell cell = ...;
10- if (cell instanceof Player) {}
11+ if (cell instanceof EntityCell entityCell) {
12+ if (entityCell.getEntity() instanceof Player) {}
13+ }
14 ...
15 }
16}
Listing 5: Compilation Error 3. Illegal Inheritance

4.4.3. Test-Failing Errors

The failure of test cases is often accompanied by exceptions thrown during execution. We automatically parsed the error logs and presented the exception distribution among LLMs in Figure 6. Each stacked bar shows the exceptions thrown while running the projects synthesized by the corresponding LLM. Different colors represent different types of exceptions. In total, there are 20 types of test-failing errors. Among them, AssertionFailedError happens the most frequently (50.75%), followed by IllegatlAugumentException (25.88%).

{mdframed}

[style=MyFrame] Finding 8: AssertionFailedError and IllegalAugument-
Exception
are Top-2 dominating contributors to test failures, accounting for 76.63% test-failing errors.

In particular, AssertionFailedError may indicate the LLMs did not fully understand the functionality written in the docstring so the assertions in the test case failed. While IllegalArgumentException mainly suggested a lack of understanding of the code constraints, leading to illegal arguments. In the following, we analyzed two representative cases.

Refer to caption
Figure 6. Exception Distribution in LLM-generated Code

Test-failing Error 1: Documentation Non-Following. In Listing 6, the documentation of method getUndoCount in lines 12-14 clearly stated that count is the number of pop() calls. However, the method push generated by LLM also increases count in line 9, mistakenly understanding count as the size of cellStack and violating the documentation.

1public class CellStack {
2 private final Stack<FillableCell> cellStack = new Stack<>();
3 private int count = 0;
4
5 void push(final FillableCell cell) {
6 cellStack.push(cell);
7 // count is the number of pop() invoked
8 // rather than the size of cellStack.
9- count++;
10 }
11
12 /**
13 * @return Number of undos (i.e. {@link CellStack#pop()}) invoked.
14 */
15 int getUndoCount() {
16 return count;
17 }
18}
Listing 6: Test Error 1. Documentation Non-Following

Test Error 2: Trivial Implementation. In this case, the LLM produced a trivial implementation in line 5, constructing and returning a Move array of size zero. This implementation is evidently a placeholder meant solely to pass compilation. When the Random::nextInt method is called in line 12 and receives the parameter (i.e., the length of the Move array), its internal implementation first checks the parameter. Upon finding it to be less than or equal to zero, it throws an IllegalArgumentException.

1public JesonMor extends Game {
2 @Override
3 public Move[] getAvailableMoves(Player player) {
4 // This is a trivial implementation.
5 return new Move[0];
6 }
7}
8
9Move[] availableMoves = game.getAvailableMoves(player);
10// Random::nextInt will throw IllegalArgumentException
11// if the input n <= 0.
12int next = new Random().nextInt(availableMoves.length);
Listing 7: Test Error 2. Trivial Implementation

5. Threats to validity

This paper has three main threats to validity. First, the threats in benchmark construction. Also, the quality and detailed level of the natural language descriptions for the projects, classes, and methods could affect LLMs’ code generation. To alleviate this threat, we carefully checked the subjects in JavaBench, scanned students’ feedback, and adjusted the descriptions to mitigate confusion and ambiguity. Second, the generalizability to other LLMs. In this study, we only studied five LLMs due to time and hardware limits, so the conclusion may not be able to generalize to other LLMs. Nonetheless, we selected the SOTA LLMs in different families as representatives (Section 3.4). Third, the performance variance brought by prompt engineering. Since choosing one best-performing prompt is challenging (Liu et al., 2023b), and a well-designed prompt could yield better performance, so we follow the common practice of prompting LLMs (wiz, 2023) to design the prompt template, trying to alleviate this threat. Finally, the possible data contamination (Golchin and Surdeanu, 2023) of JavaBench. LLMs having seen the canonical code during training could lead to exaggerated scores, known as data contamination (Golchin and Surdeanu, 2023). However, the projects in JavaBench were kept confidential (Section 2.2.1), thus having minor concerns.

6. Related Work

6.1. Benchmarks for Code Generation

6.1.1. Programming Language

Most benchmarks target Python (Table 1), which is dynamic and rich in handy libraries, making it a good fit for building applications. In contrast, Java is static, object-oriented, and with abundant design patterns, making it ideal for constructing large projects. With such a different style from Python, Java, the most popular static language, receives far less attention in code generation benchmarks, especially at the project level. JavaBench is thus proposed to bridge the gap.

6.1.2. Metric

Similarity measurements such as exact match and BLEU (Papineni et al., 2002) used to be mainstream metrics for code generation, as in other NLP tasks. Recently, it was found that such similarity measurement has a weak correlation with semantic correctness (Chen et al., 2021). More and more benchmarks (Chen et al., 2021; Du et al., 2023; Liu et al., 2023a) adopt execution-based correctness such as Pass@k𝑘kitalic_k as the major metric. Previous project-level benchmarks (Zhang et al., 2023; Li et al., 2024) adopt a pass rate against test cases to assess LLM’s capability to generate codes. JavaBench assesses LLMs with progressive metrics (completion to compilation to testing) and at finer granularities (class-wise and test-wise).

6.1.3. Context

The context of code generation benchmarks varies from line of code to project. Function-level benchmarks such as HumanEval (Chen et al., 2021) remain most commonly adopted because they are easy to evaluate and have certain discrimination. As LLMs get more powerful, recent benchmarks target more complex contexts, such as class-level (Du et al., 2023) and project-level (Zhang et al., 2023; Li et al., 2024; Wang et al., 2024). Different from these benchmarks whose subjects were sourced from GitHub, JavaBench is built from entry-level projects that are carefully designed to assess students’ coding ability. JavaBench provides a straightforward comparison of LLMs’ programming capability against humans.

6.2. LLM-based Code Generation

The code generation technique had a leap enabled by LLMs. In particular, LLMs can handle a much more complex context (e.g., class or project) than symbolic-based approaches (Gulwani, 2011; Feng et al., 2017; Shi et al., 2019) or small-sized language models (Feng et al., 2020; Wang et al., 2021).

6.2.1. Retrieval-Augmented Generation

The code generation technique most relevant to our work is Retrieval-Augmented Generation (RAG). A code project has a long context (documentation and codes), which usually exceeds LLMs’ context limit or exposes performance degradation (Liu et al., 2024). Therefore, a long context is usually decomposed into smaller chunks, and only the most related chunks are retrieved based on the problem and included in the prompt as context. For example, RepoCoder (Zhang et al., 2023) uses a sliding window to decompose a project and does retrievals iteratively with LLM-generated codes. Shrivastava et al. (Shrivastava et al., 2023) has a finer project decomposition into different kinds (e.g., imported classes, child classes) and composes different information in the prompt. JavaBench adopts a simple but effective selected context setting that only includes the signatures of dependent types and can be adopted to complement RAG.

7. Conclusion

In this paper, we introduce a project-level benchmark, JavaBench, to fill the gap of the scarcity and need for high-quality Java benchmarks for LLM evaluation. Intensive experiments are conducted, covering the context settings, synthesis strategies, evaluation granularities, and evaluation metrics.

8. Data Availability

We released the implementation and all associated publicly available data at https://github.com/java-bench/JavaBench. We also release a leaderboard and invite model developers to participate and test their models against JavaBench at https://java-bench.github.io/leaderboard.html.

References

  • (1)
  • sta (2023) 2023. bigcode/starcoderbase. https://huggingface.co/bigcode/starcoderbase.
  • cod (2023) 2023. CoderEval/CoderEval. https://github.com/CoderEval/CoderEval/commits/main/CoderEval4Python.json.
  • Dee (2023a) 2023a. Deepseek-33b. https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct.
  • Dee (2023b) 2023b. Deepseek-6.7b. https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct.
  • gpt (2023a) 2023a. GPT-3.5-turbo Model Availability. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35-turbo-model-availability.
  • gpt (2023b) 2023b. GPT-4 and GPT-4-turbo Preview. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-preview.
  • phi (2023a) 2023a. Phind/Phind-CodeLlama-34B-v2. https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/commit/29c3be6006297754f344ba05678c038b0b77f6c0.
  • phi (2023b) 2023b. Phind/Phind-CodeLlama-34B-v2. https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
  • wiz (2023) 2023. WizardLM/WizardCoder. https://huggingface.co/WizardLM/WizardCoder-15B-V1.0.
  • git (2024) 2024. GitHub 2.0 Pull Requests Ranking over Programming Languages. https://madnight.github.io/githut/#/pull{_}requests/2024/1.
  • Alshahwan et al. (2010) Nadia Alshahwan, Yue Jia, Kiran Lakhotia, Gordon Fraser, David Shuler, and Paolo Tonella. 2010. AUTOMOCK: automated synthesis of a mock environment for test case generation. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  • Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 2357–2367. https://doi.org/10.18653/v1/N19-1245
  • Arcuri et al. (2014) Andrea Arcuri, Gordon Fraser, and Juan Pablo Galeotti. 2014. Automated unit test generation for classes with environment dependencies. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 79–90.
  • Arcuri et al. (2017) Andrea Arcuri, Gordon Fraser, and René Just. 2017. Private api access and functional mocking in automated unit test generation. In 2017 IEEE international conference on software testing, verification and validation (ICST). IEEE, 126–137.
  • Athiwaratkun et al. (2022) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2022. Multi-lingual Evaluation of Code Generation Models. (2022). https://doi.org/10.48550/ARXIV.2210.14868
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021).
  • Belzner et al. (2023) Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. In International Conference on Bridging the Gap between AI and Reality. Springer, 355–374.
  • bench authors (2023) BIG bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=uyTL5Bvosj
  • Bieri (1955) James Bieri. 1955. Cognitive complexity-simplicity and predictive behavior. The Journal of Abnormal and Social Psychology 51, 2 (1955), 263.
  • Cao et al. (2024) Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. arXiv preprint arXiv:2403.16898 (2024).
  • Cassano et al. (2022) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. A scalable and extensible approach to benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022).
  • Chandel et al. (2022) Shubham Chandel, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and evaluating a jupyter notebook data science assistant. arXiv preprint arXiv:2201.12901 (2022).
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:cs.LG/2107.03374
  • Cheng et al. (2024) Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion. http://arxiv.org/abs/2405.19782 arXiv:2405.19782 [cs].
  • Coleman et al. (1994) D. Coleman, D. Ash, B. Lowther, and P. Oman. 1994. Using metrics to evaluate software system maintainability. Computer 27, 8 (1994), 44–49. https://doi.org/10.1109/2.303623
  • DeepSeek-AI (2024) DeepSeek-AI. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024). https://github.com/deepseek-ai/DeepSeek-LLM
  • Ding et al. (2023) Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://arxiv.org/pdf/2310.11248.pdf
  • Du et al. (2023) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. arXiv:cs.CL/2308.01861
  • Feng et al. (2017) Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017. Component-based synthesis for complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 599–612. https://doi.org/10.1145/3009837.3009851
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  • Galler et al. (2010) Stefan J Galler, Andreas Maller, and Franz Wotawa. 2010. Automatically extracting mock object behavior from design by contract™ specification for test data generation. In Proceedings of the 5th Workshop on Automation of Software Test. 43–50.
  • Gao et al. (2023) Shuzheng Gao, Cuiyun Gao, Yulan He, Jichuan Zeng, Lunyiu Nie, Xin Xia, and Michael Lyu. 2023. Code structure–guided transformer for source code summarization. ACM Transactions on Software Engineering and Methodology 32, 1 (2023), 1–32.
  • Golchin and Surdeanu (2023) Shahriar Golchin and Mihai Surdeanu. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. CoRR abs/2308.08493 (2023). https://doi.org/10.48550/ARXIV.2308.08493 arXiv:2308.08493
  • Gu et al. (2024) Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065 (2024).
  • Gulwani (2011) Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 317–330. https://doi.org/10.1145/1926385.1926423
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. NeurIPS (2021).
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH
  • Iyer et al. (2018) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1643–1652. https://doi.org/10.18653/v1/D18-1192
  • Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 18319–18345.
  • Li et al. (2024) Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. http://arxiv.org/abs/2405.19856 arXiv:2405.19856 [cs].
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! (2023). arXiv:cs.CL/2305.06161
  • Liu et al. (2023a) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023a. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7
  • Liu et al. (2024) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
  • Liu et al. (2023b) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023b. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
  • Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023).
  • Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  • Oman and Hagemeister (1992) P. Oman and J. Hagemeister. 1992. Metrics for assessing a software system’s maintainability. In Proceedings Conference on Software Maintenance 1992. 337–344. https://doi.org/10.1109/ICSM.1992.242525
  • Ouyang et al. (2023) Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Schäfer et al. (2024) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. https://doi.org/10.1109/TSE.2023.3334955
  • Shi et al. (2019) Kensen Shi, Jacob Steinhardt, and Percy Liang. 2019. FrAngel: component-based synthesis with control structures. Proc. ACM Program. Lang. 3, POPL (2019), 73:1–73:29. https://doi.org/10.1145/3290386
  • Shrivastava et al. (2023) Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-Level Prompt Generation for Large Language Models of Code. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 31693–31715. https://proceedings.mlr.press/v202/shrivastava23a.html
  • Wang et al. (2024) Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, and Dacheng Tao. 2024. OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models. arXiv preprint arXiv:2401.06628 (2024).
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 8696–8708. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.685
  • Wang et al. (2022) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481 (2022).
  • Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. In International Conference on Mining Software Repositories (MSR). ACM, 476–486. https://doi.org/10.1145/3196398.3196408
  • Yu et al. (2023) Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2023. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv preprint arXiv:2302.00288 (2023).
  • Zan et al. (2022a) Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022a. When language model meets private library. arXiv preprint arXiv:2210.17236 (2022).
  • Zan et al. (2022b) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022b. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint arXiv:2206.06888 (2022).
  • Zan et al. (2022c) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022c. Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420 (2022).
  • Zhang et al. (2023) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2471–2484. https://doi.org/10.18653/v1/2023.emnlp-main.151
  • Zheng et al. (2023b) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023b. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5673–5684. https://doi.org/10.1145/3580305.3599790
  • Zheng et al. (2023a) Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023a. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
  • Zhu et al. (2021) Hengcheng Zhu, Lili Wei, Ming Wen, Yepang Liu, Shing-Chi Cheung, Qin Sheng, and Cui Zhou. 2021. MockSniffer: characterizing and recommending mocking decisions for unit tests (ASE ’20). Association for Computing Machinery, New York, NY, USA, 436–447. https://doi.org/10.1145/3324884.3416539