\lst@Key

numbersnone\lstKV@SwitchCases#1none:
left:
right: \mdfdefinestyleMyFramelinecolor=black, outerlinewidth=0.3pt, roundcorner=5pt, skipabove = 5.5pt, skipbelow = 5.5pt, innertopmargin=5.5pt, innerbottommargin=5.5pt, innerrightmargin=5.5pt, innerleftmargin=5.5pt, backgroundcolor=bg,

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Jialun Cao Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China , Zhiyong Chen State Key Laboratory for Novel Software Technology, Nanjing UniversityNanjingChina , Jiarong Wu Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China , Shing-Chi Cheung Department of Computer Science
and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China and Chang Xu State Key Laboratory for Novel Software Technology, Nanjing UniversityNanjingChina

Abstract.

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs’ capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java, resulting in an insufficient understanding of LLMs’ capability to generate Java code. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills (e.g., variables, operators, and control structures), while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). Considering the prevalence of these advanced features in real-world Java project development, constructing benchmarks to test LLMs on handling OOP features is necessary.

To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM’s capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench. We also release a leaderboard and invite model developers to participate and test their models against JavaBench at https://java-bench.github.io/leaderboard.html.

Large Language Model, Program Synthesis, Object-Oriented Programming

* Co-first authors.

1. Introduction

Large language models (LLMs) such as ChatGPT (gpt, 2023a, b) have shown advanced proficiency (Belzner et al., 2023) in various tasks such as code generation (Chen et al., 2021; Du et al., 2023; Yu et al., 2023), code reasoning (Gu et al., 2024) and code summarization (Gao et al., 2023). Emerging code generation/completion benchmarks (Chen et al., 2021; Yu et al., 2023; Austin et al., 2021; Iyer et al., 2018; Zheng et al., 2023b; Ding et al., 2023; Li et al., 2024; Zhang et al., 2023) like HumanEval (Chen et al., 2021) have been introduced to evaluate LLMs’ capabilities, providing insights into their strengths and weaknesses in various real-world scenarios, thereby guiding LLM researchers to address related issues more effectively.

Research gap – To gain a comprehensive overview of the current state of these benchmarks, we consolidated the data from the recent studies (Wang et al., 2024; Zan et al., 2022c; Zheng et al., 2023a) and incorporated the latest benchmarks, resulting in Table 1. By analyzing the statistics, we identified three significant imbalances. 1. Imbalanced Programming Languages. There is a disproportionate focus on Python, which constitutes 95.8% (23/24) of benchmarks. Java, despite being the second most popular language on GitHub (git, 2024) (Java holds 11.708% while Python holds 16.925% and is ranked first), is covered by only five function-level benchmarks. The lack of Java benchmarks limits the understanding of LLMs’ capabilities in generating Java code compared to Python. 2. Imbalanced Code Granularity. These benchmarks predominantly feature function-level granularity or lower (i.e., statement-level), accounting for 83.3% (20/24) of the total. Although these benchmarks can exercise LLMs’ ability to generate code for individual functions, a broader context (e.g., cross-function/class) is often required in real-world development scenarios, e.g., inheriting a class and overwriting the interface (Li et al., 2024). Such scenarios cannot be adequately assessed by statement-/function-level benchmarks. Only a mere handful extends to class- or project-levels, and all are limited to Python. 3. Lacking Advanced Features. Current benchmarks comprehensively assess basic coding skills (e.g., variables, data types, operators, and control structures) while overlooking advanced Object-Oriented Programming (OOP) features (encapsulation, inheritance, and polymorphism). OOP promotes modularity and reusability of the code and is thus commonly adopted in real-world development. However, only one recent benchmark (Wang et al., 2024) claims to exercise OOP features, and it does not provide actual code context but merely mentions the OOP concept in the prompt. In summary, there is a clear gap to fill to adequately test LLMs in handling OOP features, motivating the need for more comprehensive benchmarks.

Benchmark JavaBench – To bridge the research gap, we propose JavaBench, a project-level Java benchmark that exercises OOP features (i.e., encapsulation, inheritance, and polymorphism). It comprises four Java projects that were programming assignments in an entry-level Java course. These four projects contain 389 methods in 106 Java classes, covered by 396 tests, reaching up to 92% code coverage. In addition, JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests in JavaBench. Furthermore, we extensively evaluate five LLMs (e.g., gpt-3.5, DeepSeeker, Phind) against JavaBench under a set of comprehensive settings. In particular, we design three context settings (i.e., maximum/minimum/selected context) in prompting, adopt five synthesis strategies (i.e., holistic, independent, incremental, and its two variants), and evaluate the synthesized projects at two evaluation granularities (i.e., class-wise and test-wise) using three metrics (i.e., Completion@ $k$ , Compilation@ $k$ , and Pass@ $k$ ).

Our extensive experiments yield several interesting findings. First, in terms of project-level Java programming ability, LLMs are still far behind undergraduate students. The best LLMs under the best setting only reach a 41.7% Pass@5 in test-wise granularity (Section 4.1), compared with 90.93% achieved by undergraduate students under a stricter evaluation. Second, Providing method signature only in the prompt leads to optimal results, while too much or too little context degrades project-level code generation.

Contributions – Our contribution is summarized as follows.

•

Significance. We proposed the first project-level Java benchmark that exercises OOP features (i.e., encapsulation, inheritance, and polymorphism). It enables observations of LLMs’ strengths and weaknesses in handling Java OOP features.
•

Novelty. Besides introducing JavaBench, we also introduce a systematic evaluation design to assess LLMs’ capabilities under three context settings at two evaluation granularities using three progressive metrics. This evaluation design provides a reference for future project-level code generation assessments.
•

Evaluation. We conduct extensive experiments that yield several instructive findings. We point out that LLM’s capability to handle OOP features is far behind that of undergraduates. We also identified an optimal context setting with only method signatures provided. Our analysis of bad cases also provides directions for future improvement.

2. Benchmark Construction

2.1. Benchmark Format

An example of a Java project in JavaBench is illustrated in Figure 1. A project comprises a description of the whole project in natural language and a code skeleton with multiple classes (Figure 1 only shows one class due to space limit). Each class includes import statements, a class description, a class skeleton with multiple methods. Each method has a docstring and can be complete or incomplete, i.e., the method body is a TODO to be filled in by LLMs.

Table 1. Summarization of 24 Existing Benchmarks plus JavaBench. The ones involving Java are highlighted in gray.

Benchmark	Time	Language	Granularity	# Funcs	# Class	# AvgT	# Tests	# AvgLOC
Concode (Iyer et al., 2018)	2018	Java	Function	2,000	0	-	-
CoNaLA(Yin et al., 2018)	2018	Python	Statement	500	0	-	-	1
APPS(Hendrycks et al., 2021)	2021	Python	Function	5,000	0	13.2	66,000	21.4
HumanEval (Chen et al., 2021)	2021	Python	Function	164	0	7.7	1,263	11.5
MBPP (Austin et al., 2021)	2021	Python	Function	974	0	3	2,922	6.8
math-qa (Amini et al., 2019)	2021	Python	Statement	2,985	0	-	-	7.6
HumanEval-X (Zheng et al., 2023b)	2022	Python, Java, etc.	Function	164	0	7.8	1,279	12.1
MBXP (Athiwaratkun et al., 2022)	2022	Python, Java, etc.	Function	974	0	3	2,922	6.8
CodeContests	2022	Python, C++	Function	165	0	203.7	33,610	59.8
PandasEval(Zan et al., 2022b)	2022	Python	Function	101	0	6.5	656	1.3
NumpyEval(Zan et al., 2022b)	2022	Python	Function	101	0	3.5	354	1.1
TorchDataEval(Zan et al., 2022a)	2022	Python	Function	50	0	1.1	55	1.3
DS-1000 (Lai et al., 2023)	2022	Python	Statement	1,000	0	1.6	1,600	3.8
DSP(Chandel et al., 2022)	2022	Python	Function	1,119	0	2.1	2,350	7.6
MultiPL-MBPP(Cassano et al., 2022)	2022	Python, Java, etc.	Function	974	0	3.1	3,019	-
MTBP(Nijkamp et al., 2022)	2022	Python	Function	115	0	-	-	-
ODEX(Wang et al., 2022)	2022	Python	Function	945	0	1.8	1,701	1.9
BIG-Bench(bench authors, 2023)	2023	Python	Function	32	0	4.7	150	-
CoderEval (Yu et al., 2023)	2023	Python, Java	Function	230+230	0	-	-	$\leq 32$
CrossCodeEval (Ding et al., 2023)	2023	Python, Java, etc.	Statement	-	3,534	0	0	96.2
RepoEval (Zhang et al., 2023)	2023	Python	Project	1,973	0	-	-	$\leq 30$
ClassEval (Du et al., 2023)	2023	Python, Java, etc.	Class	412	100	33.1	3,310	45.7
DevEval (Li et al., 2024)	2024	Python	Project	1,874	0	-	-	392.7
OOPEval(Wang et al., 2024)	2024	Python	Project	0	431	2.5	1,070	0
JavaBench	2024	Java	Project	389	106	99	396	1,740

Table 2. Summary of JavaBench Projects

Description

Exercised Concepts

Human Performance

# Stu

Min

\sim

Max

Mean

\sim

\pm

\sim

Std

The project is a text-based version of Pipe Mania using Java. The game involves placing

pipes on a grid to connect a source to a sink, utilizing ASCII and Unicode for visualization.

Features include interactive controls, water flow simulation, and strategic game-play with

conditions for winning and losing.

Basic Java, Interface, Encapsulation, Inheritance,

Overriding, Polymorphism, File IO, Exception Handling

52.88

\sim

100

\textbf{95.41}~{}\pm~{}7.26

The project is a text-based console version of Jeson Mor, a Mongolian strategy board game.

Using Java, students will implement game mechanics where two players use knights, similar

to those in chess, to compete by capturing a central square on the board.

Basic Java, Streams, Encapsulation, Inheritance,

Overriding, Polymorphism

Exception Handling

20.67

\sim

100

\textbf{91.73}~{}\pm~{}15.05

The project is an ASCII version of the Inertia puzzle game in Java. The game challenges

players to navigate a board to collect gems while avoiding mines, with movement continuing

in one direction until an obstruction is encountered.

Basic Java, Encapsulation, Inheritance,

Overriding, Polymorphism,

Exception Handling

26.79

\sim

100

\textbf{90.39}~{}\pm~{}16.66

The project is a modified Sokoban game featuring a text-based user interface. This enhanced

version introduces multiplayer functionality, allowing several players to simultaneously

navigate and manipulate designated boxes toward specific locations on the game map.

Basic Java, Encapsulation, Inheritance,

Overriding, Polymorphism, Mocking,

Exception Handling, Streams, Regex

34.27

\sim

100

\textbf{90.96}~{}\pm~{}14.03

Total

282

20.67

\sim

100

90.93

~{}\pm~{}

14.05

2.2. Benchmark Specification

We describe JavaBench from the following three perspectives: (1) Project Description (Section 2.2.1) describes the projects in JavaBench and the corresponding Java features they exercised. (2) Test Construction (Section 2.2.2) describes the process of constructing test cases and reports the code coverage. (3) Human Performance (Section 2.2.3) shows how first- and second-year undergraduates perform in these projects of JavaBench.

2.2.1. Project Description

A summary of the four Java projects in JavaBench is given in Table 2. The primary design goal behind designing these student projects is to craft exciting and engaging Java projects (e.g., chess games) encompassing a broad array of Java features, including basic Java functionalities, advanced object-oriented programming concepts (e.g., inheritance, polymorphism), and other skills such as file reading and exception handling for undergraduates to practice Java programming. Each project covers similar Java concepts, with a slight variant highlighted by underscore. Such a design goal also fits the benchmarking of LLMs’ capability to understand and exercise various Java features. In particular, each project in JavaBench is designed to exercise OOP-related features (i.e., inheritance, encapsulation, and polymorphism), highlighted in bold in Table 2.

Besides, each project in JavaBench has a canonical solution prepared by an experienced Java programmer with more than 5 years of experience and cross-validated by other experienced programmers. Moreover, these canonical solutions are released to more than 200 undergraduates (see Section 2.2.3) for review, ensuring the solutions’ correctness. Students are required to keep the course assignments and canonical solutions confidential for academic integrity, which reduces the data contamination (Golchin and Surdeanu, 2023) threat to our benchmark.

The number of functions and classes in each project is listed in Table 3. The four projects have similar scales, with 89 to 125 functions spreading across 23 to 30 Java classes. In total, there are 389 functions and 106 Java classes in JavaBench. The lines of code (i.e., LoC) of the entire project range from 2,560 to 6,926, with an average of 3,873 lines. Excluding the lines of test suites, the remaining lines of codes are 1,352 to 2,373, with an average of 1,740. Compared with the existing Java benchmarks at the function level (Table 1), JavaBench involves a much larger context size (1,740 vs. 392.7) and poses new challenges to Java code generation.

Furthermore, to get a better understanding of JavaBench, we measure the code complexity using two metrics (i.e., cyclomatic (Oman and Hagemeister, 1992; Coleman et al., 1994) and cognitive (Bieri, 1955) complexity) as evaluated in existing works (Yu et al., 2023; Cao et al., 2024). We omit the formulas due to space limits. Conceptually, these two metrics consider the number of decision points or branches, the nesting levels, or the number of logical operators. As shown in Table 3 at the “Complexity” entry, the four projects share similar code complexity values, with P3 being relatively easier than others and P1 being relatively more complex.

Refer to caption — Figure 1. An Example of Project Skeleton in JavaBench

2.2.2. Test Construction

The test suites for each project in JavaBench are manually constructed. Similar to canonical solution construction, the test suites are constructed by experienced Java programmers, ensuring the exercised concepts in each project are covered by at least one test case. Specifically, the statistics of tests for each project are tabulated in Table 3. There are 396 tests in total and 49 to 222 tests in each project, with an average of 99 tests. The total lines of code in the test suites range are 8,532, with 2,133 on average. The test sufficiency is shown by three test coverage metrics (i.e., class coverage, function coverage, and line coverage). As shown in the last column of Table 3, 92% classes, 87% functions, and 86.75% lines are covered by the test suites on average.

Table 3. Code and Test Statistics of JavaBench

			LoC		Complexity		Test Info		Test Coverage (%)
ID	Func	Class	Total	w/o Ts	Cyc	Cog	# Tests	LoC	Class	Func	Line
P1	89	24	2,560	1,709	18.70	19.90	55	851	91	89	81
P2	102	23	3,223	1,524	8.93	9.71	49	1,699	95	81	87
P3	125	29	6,926	2,373	12.50	9.21	222	4,553	100	85	87
P4	73	30	2,781	1,352	16.57	10.86	70	1,429	80	93	92
Total	389	106	15,490	6,958	14.18	12.42	396	8,532	92	87	86.75

Notably, we embrace mocking (Alshahwan et al., 2010; Galler et al., 2010; Arcuri et al., 2014, 2017; Zhu et al., 2021) into the test suite. It does not affect the code generation, while helping isolate the component under test from its dependencies and increases test stability. An example of using mocking to test the termination status of a Sokoban Game (i.e., P4) is shown in Listing 1. Typically, we implement it with mockito (line 2) and mock three objects (i.e., GameState, TerminalInputEngine and TerminalRenderingEngine) in lines 7-9. Then, the assertion checks whether the exception has been thrown in line 12. Indeed, embracing mocking into test suite design is worth the endeavor, as emphasized by a recent study (Schäfer et al., 2024).

⬇

1import org.junit.jupiter.api.Assertions.assertThrowsExactly;

2import org.mockito.Mockito.*;

4class TerminalSokobanGameTest {

5 @Test

6 void testMoreThanTwoPlayers() {

7 final var gameState = mock(GameState.class);

8 final var inputEngine = mock(TerminalInputEngine.class);

9 final var renderingEngine = mock(TerminalRenderingEngine.class);

10 when(gameState.getAllPlayerPositions()).thenReturn(new HashSet<>(Arrays.asList(Position.of(1, 1), Position.of(1, 2), Position.of(1, 3))));

12 assertThrowsExactly(IllegalArgumentException.class, () -> new TerminalSokobanGame(gameState, inputEngine, renderingEngine));

13 }

14}

Listing 1: An example of mocking test in JavaBench

2.2.3. Human Performance

The four Java projects in JavaBench are designed for undergraduate students throughout the four academic years from 2019 to 2022. We then use students’ overall scores as indicators of difficulty levels. As shown in the last entry (Human Performance) of Table 2, 282 undergraduates are involved ¹¹1When counting the number of participants, we omit course withdrawals, non-submissions, and blank project submissions from the count because these cases do not attempt to complete the project., and each project is finished by at least 62 students.

To rate the students’ submissions, we mainly use the pass rates (full score is 100) of the test suite as evidence. The last two columns of Table 2 demonstrate each project’s maximum/minimum scores and mean and standard deviation. The difficulty of all projects is similar, with an average of 90.96 to 95.41.

{mdframed}

[style=MyFrame] Undergraduates can finish the projects in JavaBench with a 90.96% to 95.41% test pass rate (average 90.93%), indicating the difficulty of JavaBench is acceptable for humans.

3. Experiment Design

Given a Java project skeleton, we first synthesize the entire project using three context settings (Section 3.1.1) and various synthesis strategies (Section 3.1.2), as shown in Figure 2. Then, we evaluate the synthesized project in terms of three evaluation metrics (Section 3.2.2) in two granularities (Section 3.2.1), as shown in Figure 3.

3.1. Code Synthesis

The synthesis pipeline is shown in Figure 2. Given a skeleton project, we complete it class-by-class because a class is usually designed to be cohesive and less coupling with others. To complete a method in a class, providing the context of a standalone method/class is insufficient due to the lack of dependencies between classes. Thus, we try three context settings (Section 3.1.1). Since the order to generate multiple TODO methods in a class also matters, we try three synthesis strategies (Section 3.1.2). When all the methods in all the classes are synthesized, the skeleton project is completed.

3.1.1. Context Settings

We design three context settings in the synthesis pipeline, as shown in the middle part of Figure 2. A straightforward context is to feed the entire project skeleton to LLMs, providing as much information as possible, i.e., the Maximum Context setting. Note that due to the limitation of long input contexts, it is possible that an LLM fails to digest the entire project skeleton. Each model has a different context window size, and there is no fixed ratio between code length and token length after tokenization. We truncate the first 8192 characters ( $\approx$ 3000 tokens) for all studied LLMs. Then, the smallest maximum window length among the studied LLMs is set to be 8192 tokens, which is larger than the 2,048 tokens set in existing works (Du et al., 2023; cod, 2023), ensuring ample space is reserved for output. With this context size, 53.3% of the contexts are truncated, and the truncated characters account for 42.9% of the total characters. Opposite to the maximum context, it is natural to use only the class to be completed, i.e., Minimum Context. The advantage of this context setting favors the input token limits, while the disadvantage is the lack of necessary dependencies for synthesis. Take ClassA in Figure 2 as an example. In ClassA, a private member cb is declared as an instance of the class ClassB. The Minimum context does not include the code of ClassB into context, which may pose challenges to LLMs to complete methods in ClassA. Finally, inspired by recent works (Cheng et al., 2024; Shrivastava et al., 2023) that use related contexts (e.g., the invoked methods) to strike a balance between rich information and limited input tokens, we take the third context setting, i.e., Selected Context into consideration. Specifically, we took advantage of jdeps ²²2https://docs.oracle.com/en/java/javase/11/tools/jdeps.html, which is a command-line tool in the JDK that analyzes Java class files to report on package-level and class-level dependencies, to extract the selected context automatically. In addition, to minimize the input tokens while maintaining the context as informative as possible, we only include the method signatures in the related class, excluding the method body. For example, as shown in Figure 2 (❸ Selected Context), to generate func1 in ClassA, the methods in the related ClassB are simplified into signatures.

3.1.2. Synthesis Strategy Design

The order of synthesizing methods in each class may matter, so we consider three synthesis strategies following the practice of a prior work (Du et al., 2023) and consider two more variants. As shown in the right part in Figure 2, strategies are explained as follows:

•

Independent Synthesis: each method is synthesized independently without being affected by other generated methods.
•

Holistic Synthesis: all the methods in a class are synthesized in one pass by LLMs.
•

Incremental Synthesis: methods in a class are generated one by one in a specific order. Different from prior work (Du et al., 2023), in addition to considering sequential the order to synthesize methods, we also consider the reverse and random orders.

Though according to existing work (Du et al., 2023), these synthesis strategies affect open- and close-sourced LLMs differently, our JavaBench differs in their benchmark, i.e., ClassEval, in scale (class-level vs. project-level) and programming languages (Python vs. Java), so similar experiments are still worth exploring on JavaBench. Moreover, we further consider different orders in incremental synthesis, which may yield deeper insights.

3.2. Evaluation Design

Once the entire project is completed by LLMs, we consider two granularities (see Section 3.2.1) to evaluate synthesized projects using three progressive metrics (see Section 3.2.2).

3.2.1. Evaluation Granularity

Figure 3 illustrates an example project P with three classes (A, B, and C; canonical ground-truth solutions are included) and three tests (M, N, and P). The central part shows the following two granularities:

•

Class-wise: To isolate the generated certain class from the others, class-wise granularity uses a generated class to replace the canonical solution’s counterpart class at a time. For example, consider the first column under I. Class-wise setting in Figure 3, Class A is generated by LLMs, while Class B and Class C are canonical. In other words, Class A (Generated), Class B (Canonical), and Class C (Canonical) form a complete project P’_A. Similarly, P’_B and P’_C are constructed by replacing Class B and Class C, respectively. Then, these projects, each with only one generated class, are evaluated using different metrics (Section 3.2.2), and the average is taken as the result at this granularity.
•

Test-wise: Similarly, this granularity iterates all the test cases in the test suites and takes the average result. For each test case, we replace the classes relating to the test case while keeping other classes in the canonical solution unchanged. For example, as shown in II. Test-wise setting of Figure 3, consider test N which relates to Class A and Class C, we replace these two generated classes while keeping Class B with ground-truth. After enumerating all test cases in the test suite, we evaluate all generated projects using different metrics (Section 3.2.2) and take the average as the result at this granularity.

We adopt these finer granularities to capture the nuanced difference in performance. Otherwise, the successful generation of some classes could be shadowed by failures in other classes when evaluating at a large granularity. For example, a project-wise evaluation requires the entire generated project to be completed/compiled and pass the entire test suite. In contrast, class-wise granularity examines one class at a time, allowing for the isolated assessment of each class. Section 4.1 confirmed the effectiveness of such design.

3.2.2. Evaluation Metrics

We evaluate the generation code using three progressive metrics: completion/compilation/test case pass rate. All are based on the widely used Pass@ $k$ metric (Chen et al., 2021):

(1)

Completion/Compilation/Pass@k=\mathbb{E}\left[1-\binom{n-c}{k}/\binom{n}{k}\right]

where Pass@ $k$ , as defined in prior work (Chen et al., 2021), is the expectation of passing all the tests of a problem at least once within $k$ attempts. For each problem, $n$ solutions are sampled from an LLM, and $c$ of $n$ solutions are correct. The larger $n$ is, the more accurate Pass@ $k$ is. Considering the cost and time, we set $n$ to 5 and $k$ to 1 and 5, following the previous study(Du et al., 2023).

Similarly, we introduce Completion@k and Compilation@k. Completion@ $k$ represents the rate of the designated TODO comments being completed in the generated codes. Compilation@ $k$ represents the rate of the modified projects (replacing parts of the canonical solution at different granularities) being completed and successfully compiled. Pass@ $k$ represents the rate of the modified projects being completed, compiled, and passing the test cases related to a specific class (class-wise granularity) or a specific test case with corresponding modified classes (test-wise granularity).

3.3. Prompt Design

The prompt used for LLMs generation is shown in Listing 2. Following the common practice of prompting LLMs (wiz, 2023; Du et al., 2023), the prompt template consists of two parts: a system message to initialize the model, and an instruction to state the purpose of the task. In Listing 2, ${ $\cdot$ }$ is placeholder: ${context}$ denotes the context (e.g., three context settings introduced in Section 3.1.1), and ${class}$ denotes the class to be completed.

3.4. Studied Large Language Models

The studied LLMs are listed in Table 4. We selected five state-of-the-art LLMs that have been widely explored in code generation tasks. We focused on recent LLMs (i.e., released after 2022 as the settings in (Du et al., 2023)) with more than 6B parameters to achieve sufficient efficacy. We considered the instruction version of LLMs because we need to utilize the instruction-following ability. In particular, we selected WizarCoder (wiz, 2023) because it performs better than its base model, StarCoder (sta, 2023), in multiple coding tasks. We chose Phind (phi, 2023a) over CodeLlama (Roziere et al., 2023) for the same reason. We also chose two versions of DeepSeek because they are ranked at the top of the leaderboard. Finally, we include ChatGPT-3.5 because of its popularity and efficacy. The model size of studied LLMs was at most 34B, limited by our computational resources.

Table 4. Studied Large Language Models

Base Model	Model	Size	Time
StarCoder (Li et al., 2023)	WizardCoder-15B-V1.0 (Luo et al., 2023)	15B	June, 2023
DeepSeek (DeepSeek-AI, 2024)	deepseek-coder-6.7b-instruct (Dee, 2023b)	6.7B	Sep, 2023
DeepSeek (DeepSeek-AI, 2024)	deepseek-coder-33b-instruct (Dee, 2023a)	33B	Nov, 2023
CodeLlama (Roziere et al., 2023)	Phind-CodeLlama-34B-v2 (phi, 2023b)	34B	Aug, 2023
–	gpt-3.5-turbo-1106	–	Nov, 2022

⬇

1## System Message

2You are a helpful Java programmer who writes the project based

3on the following context. Java is a high-level, class-based,

4object-oriented programming language designed to have as few

5implementation dependencies as possible.

7## Instruction

8‘‘‘java

9${context}

10‘‘‘

12Complete the code and give the completed class

13‘‘‘java

14${class}

15‘‘‘

Listing 2: Prompt Template used in the Experiment

4. Evaluation

We used nucleus sampling (Holtzman et al., 2020) in line with recent works (Du et al., 2023; Yu et al., 2023; Ouyang et al., 2023; Cao et al., 2024), where five solution samples were randomly generated with a temperature of 0.2 (Chen et al., 2021). The experiments were conducted on a server with two NVIDIA RTX 6000 Ada GPUs, each with 48GB of graphic memory.

The research questions (RQs) were designed as follows:

•

RQ1. Overall Performance. We first showed the overall performance of the studied LLMs on JavaBench. We used the selected context setting and exercised three synthesis strategies (Section 3.1.2) to generate the entire project. The comprehensive results were displayed wtih three metrics at two granularities.
•

RQ2. Context Selection. The context is an important factor in LLMs’ performance, so we iterated three context settings (Section 3.1.1) and observed the corresponding impacts.
•

RQ3. Incremental Strategies. To synthesize methods in one class, we explored whether the order of synthesizing methods in the class matters.
•

RQ4. Bad Case Analysis. We analyzed five bad cases that failed to compile or pass the test cases due to various issues and identified the incapabilities of LLMs in Java code generation.

4.1. RQ1: Overall Performance

Table 5. RQ1 – Overall Results on JavaBench

						Compilation@1 (%)								Pass@1 (%)
		Completion@1 (%)				Class-wise				Test-wise				Class-wise				Test-wise
Strategy	Model	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4
Holistic	WizardCoder-15B-V1.0	98.00	70.00	94.29	80.00	64.00	58.57	71.43	37.14	17.50	0.00	31.76	5.71	63.99	56.87	70.19	36.14	16.51	0.00	31.40	5.03
Holistic	deepseek-coder-6.7b-instruct	82.00	90.00	100.00	100.00	64.00	71.43	82.86	71.43	15.00	0.00	70.59	54.29	62.67	70.83	82.55	62.21	14.29	0.00	70.59	10.55
Holistic	deepseek-coder-33b-instruct	100.00	95.71	100.00	100.00	68.00	81.43	77.14	85.71	60.00	18.18	29.41	71.43	68.00	80.49	76.78	82.04	60.00	8.74	29.41	34.34
Holistic	Phind-CodeLlama-34B-v2	96.00	88.57	91.43	80.00	86.00	68.57	80.00	57.14	80.00	0.00	62.35	20.00	85.81	67.54	79.66	54.58	58.06	0.00	62.32	8.42
Holistic	gpt-3.5-turbo-1106	90.00	84.29	100.00	94.29	86.00	77.14	84.29	74.29	75.00	0.00	70.59	17.14	85.87	75.98	83.90	72.24	62.29	0.00	70.59	5.40
Average (Holistic)		91.73				72.33				34.95				70.92				27.40
Independent	WizardCoder-15B-V1.0	74.00	72.86	71.43	31.43	44.00	62.86	47.14	17.14	0.00	0.00	0.00	0.00	44.00	61.17	46.92	16.62	0.00	0.00	0.00	0.00
Independent	deepseek-coder-6.7b-instruct	80.00	90.00	97.14	85.71	70.00	67.14	78.57	62.86	35.00	0.00	70.59	42.86	69.29	66.40	78.57	55.33	31.55	0.00	70.59	12.75
Independent	deepseek-coder-33b-instruct	76.00	78.57	92.86	100.00	66.00	75.71	74.29	74.29	62.50	0.00	37.65	60.00	66.00	73.50	74.26	69.45	56.82	0.00	37.54	21.07
Independent	Phind-CodeLlama-34B-v2	74.00	87.14	68.57	71.43	64.00	68.57	61.43	48.57	12.50	0.00	25.88	11.43	63.89	67.38	61.30	46.96	12.50	0.00	25.88	2.71
Independent	gpt-3.5-turbo-1106	90.00	88.57	92.86	71.43	78.00	81.43	72.86	42.86	12.50	0.00	64.71	0.00	77.80	80.80	72.86	42.64	12.50	0.00	64.71	0.00
Average (Independent)		79.70				62.89				21.78				61.76				17.43
Incremental	WizardCoder-15B-V1.0	70.00	74.29	48.57	28.57	46.00	51.43	35.71	17.14	0.00	0.00	0.00	0.00	45.31	50.84	35.68	16.62	0.00	0.00	0.00	0.00
Incremental	deepseek-coder-6.7b-instruct	72.00	90.00	98.57	91.43	68.00	72.86	82.86	68.57	0.00	0.00	70.59	34.29	67.51	71.89	82.15	64.30	0.00	0.00	70.59	8.12
Incremental	deepseek-coder-33b-instruct	68.00	77.14	92.86	85.71	54.00	74.29	71.43	54.29	55.00	0.00	14.12	0.00	54.00	73.99	71.38	49.81	49.08	0.00	14.02	0.00
Incremental	Phind-CodeLlama-34B-v2	72.00	87.14	67.14	54.29	62.00	72.86	60.00	42.86	12.50	0.00	22.35	5.71	61.99	72.05	59.14	41.01	12.50	0.00	21.87	2.53
Incremental	gpt-3.5-turbo-1106	72.00	85.71	90.00	91.43	62.00	80.00	71.43	68.57	10.00	0.00	0.00	25.71	61.50	79.80	71.43	64.84	10.00	0.00	0.00	8.10
Average (Incremental)		75.84				60.81				12.51				59.76				9.84
Average		82.42				65.34				23.08				64.15				18.22

The overall performance of the studied LLMs on JavaBench was shown in Table 5. We first fixed the context setting (i.e., the selected context) and exercised three synthesis strategies (i.e., Holistic, Independent, and Incremental) as shown in Figure 2. Three evaluation metrics (Completion@1, Compilation@1 and Pass@1) were three main dimensions in Table 5. To better visualize the results, we use darker background colors to indicate larger values. It is clear that from left to right, the color became lighter, meaning the value was getting smaller as the three metrics got stricter – Only those completed codes had chance to be compiled (incomplete codes were treated failed when computing Compilation@ $k$ ), and only those compiled codes had chance to be evaluated against test cases (uncompilable codes were treated failed when computing Pass@ $k$ ).

Generally, the average Completion@1 of all LLMs on JavaBench was 82.42%, with variances when different synthesis strategies were applied. Considering a finer granularity (class-wise), the average Compilation@1 and Pass@1 were around 65% (65.34% and 64.15%, respectively). The best scores were achieved using the holistic strategy with 91.73% Completion@1, 72.33% Compilation@1, and 70.92% Pass@1. Among all LLMs, DeepSeek-Coder-33b performed the best, followed by gpt-3.5-turbo and DeepSeek-Coder-6.7b.

{mdframed}

[style=MyFrame] Finding 1: The best average of 91.73%, 72.33%, and 70.92% Completion@1, Compilation@1, and Pass@1, respectively, could be achieved on JavaBench over the studied LLMs. The Top-3 performing LLMs were DeepSeek-Coder-33b, gpt-3.5-turbo, and DeepSeek-Coder-6.7b among the studied LLMs.

Comparing the synthesis strategies, holistic synthesis was generally better than independent and incremental among all LLMs, and the declines could be significant in some cases. For example, WizardCoder dropped 51.43% (80.00% - 28.57%) Completion@1, dropped 20.00% (37.14% - 17.14%) Compilation@1 and 19.52% (36.14% - 16.62%) Pass@1 on P4. Other LLMs also observed similar drops when switching from holistic to independent or incremental synthesis. Although there were occasional cases where incremental or independent synthesis brought improvements, those were sporadic events, and the improvement was subtle, e.g., Completion@1 of WizardCoder improved 4.29% (74.29% - 70.00%) on P2. We also noticed that this observation was different from the observation of the previous work (Du et al., 2023) where they observed open-sourced LLMs performed better using independent synthesis than holistic synthesis. This could be because we used different programming languages (Python VS. Java) and code granularities (Class-level VS. Project-level).

{mdframed}

[style=MyFrame] Finding 2: Among the three synthesis strategies, holistic synthesis yielded a better performance across all LLMs against three metrics at two granularities on JavaBench.

Necessity of Finer-grained Evaluation Granularity. LLMs performed similarly on four projects (P1-P4), with P1 and P3 slightly better than P2 and P4. In particular, the class-wise scores were always better than test-wise scores, with a gap up to 49.92% (= 59.76%-9.84%). This was in line with our claim in Section 3.2.1: finer-grained granularity can capture more nuanced successful cases. We also calculated the Pass@1 under project-wise granularity and found that none of the projects can be correctly completed, yielding all-0 results under all settings.

{mdframed}

[style=MyFrame] Finding 3: The finer granularities (class-/test-wise) can capture subtle success in performance, enabling more distinguishable results. Otherwise, the subtle success can be shadowed by other failures, leading to all-0 results under all settings.

In addition, we increased the size of $k$ from 1 to 5 to investigate the improvements brought by more trials. The detailed experiment results are omitted due to space limits and can be found on our website (Section 8). Overall, the average Completion/Compilation/Pass@5 under holistic strategy across all LLMs are 97.21%(+5.48%), 84.43%(+12.10%), 84.43%(+13.51%) at the class-wise granularity; 94.21%(+5.48%), 51.23%(+16.28%), 48.24%(+20.84%) at the test-wise granularity. The Pass@5 in project-wise granularity is still all-0s under all settings. {mdframed}[style=MyFrame] Finding 4: Increasing $k$ yields further improvement. The best average test-wise Pass@5 in JavaBench is 48.24%. Compared with 90.93% achieved by undergraduate students in project-wise evaluation (Section 2.2.3), LLMs’ capability in Java project-level programming still has much room to improve.

4.2. RQ2: Impact of Context Selection

Table 6. RQ2 – Performance Variance Over Different Context Settings.

						Compilation @ 1 (%)								Pass @ 1 (%)
		Completion @ 1 (%)				Class-wise				Test-wise				Class-wise				Test-wise
Context	Model	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4	P1	P2	P3	P4
Selected	WizardCoder-15B-V1.0	98.00	70.00	94.29	80.00	64.00	58.57	71.43	37.14	17.50	0.00	31.76	5.71	63.99	56.87	70.19	36.14	16.51	0.00	31.40	5.03
Selected	deepseek-coder-6.7b-instruct	82.00	90.00	100.00	100.00	64.00	71.43	82.86	71.43	15.00	0.00	70.59	54.29	62.67	70.83	82.55	62.21	14.29	0.00	70.59	10.55
Selected	deepseek-coder-33b-instruct	100.00	95.71	100.00	100.00	68.00	81.43	77.14	85.71	60.00	18.18	29.41	71.43	68.00	80.49	76.78	82.04	60.00	8.74	29.41	34.34
Selected	Phind-CodeLlama-34B-v2	96.00	88.57	91.43	80.00	86.00	68.57	80.00	57.14	80.00	0.00	62.35	20.00	85.81	67.54	79.66	54.58	58.06	0.00	62.32	8.42
Selected	gpt-3.5-turbo-1106	90.00	84.29	100.00	94.29	86.00	77.14	84.29	74.29	75.00	0.00	70.59	17.14	85.87	75.98	83.90	72.24	62.29	0.00	70.59	5.40
Average (Selected)		91.73				72.33				34.95				70.92				27.40
Maximum	WizardCoder-15B-V1.0	98.00	70.00	88.57	54.29	64.00	58.57	58.57	14.29	37.50	0.00	21.18	0.00	63.97	57.66	57.85	13.29	34.12	0.00	21.18	0.00
Maximum	deepseek-coder-6.7b-instruct	94.00	91.43	100.00	88.57	80.00	70.00	71.43	80.00	75.00	0.00	29.41	68.57	80.00	68.38	71.24	68.06	69.12	0.00	29.41	13.39
Maximum	deepseek-coder-33b-instruct	100.00	95.71	100.00	100.00	74.00	80.00	75.71	68.57	67.50	18.18	45.88	25.71	74.00	79.36	75.63	62.23	67.50	13.64	45.67	10.72
Maximum	Phind-CodeLlama-34B-v2	100.00	92.86	97.14	88.57	76.00	62.86	74.29	54.29	70.00	0.00	58.82	20.00	76.00	61.62	74.09	50.62	64.02	0.00	58.82	7.69
Maximum	gpt-3.5-turbo-1106	94.00	85.71	100.00	94.29	68.00	68.57	64.29	62.86	55.00	0.00	36.47	22.86	67.56	66.07	63.88	59.77	51.47	0.00	36.47	7.78
Average (Maximum)		91.66				66.31				32.60				64.56				26.55
Minimum	WizardCoder-15B-V1.0	100.00	78.57	90.00	77.14	18.00	30.00	48.57	0.00	0.00	0.00	5.88	0.00	18.00	29.22	48.52	0.00	0.00	0.00	5.88	0.00
Minimum	deepseek-coder-6.7b-instruct	80.00	92.86	100.00	85.71	56.00	15.71	57.14	22.86	0.00	0.00	0.00	0.00	55.89	15.69	57.14	19.14	0.00	0.00	0.00	0.00
Minimum	deepseek-coder-33b-instruct	100.00	91.43	98.57	82.86	48.00	22.86	65.71	11.43	12.50	0.00	37.65	0.00	48.00	22.58	64.98	5.21	12.50	0.00	37.43	0.00
Minimum	Phind-CodeLlama-34B-v2	96.00	90.00	97.14	88.57	66.00	47.14	57.14	8.57	7.50	0.00	0.00	0.00	65.69	46.33	57.14	8.02	7.50	0.00	0.00	0.00
Minimum	gpt-3.5-turbo-1106	98.00	85.71	95.71	91.43	66.00	40.00	62.86	20.00	12.50	0.00	0.00	0.00	65.94	39.85	62.86	19.27	12.50	0.00	0.00	0.00
Average (Minimum)		90.99				38.20				3.80				37.47				3.79
Overall Average		91.46				58.95				23.78				57.65				19.25

In RQ1, the context setting is fixed to the selected context, where only the context that is related to the class/function to be generated is fed into LLMs. In RQ2, we consider the other two context settings (i.e., maximum and minimum context). We fix the synthesis strategy as holistic because it performs the best in RQ1. The experiment result of RQ2 is visualized in Table 6.

From Table 6, it is clear that among three context settings, the selected context yield the best overall results, with 70.92% (class-wise) and 27.40% (test-wise) Pass@1, which echos the results in Table 5. Though maximum and minimum context achieve similar Completion@1 (i.e., 91.66% and 90.99% compared with 91.73%), the performance of using these two contexts on the subsequent metrics (i.e., Compilation@1 and Pass@1) is not as good as the selected context.

{mdframed}

[style=MyFrame] Finding 5: The selected context is the best setting on JavaBench, resulting in 70.92% (class-wise) Pass@1.

To better understand the size of the total context used by each setting, we visualize the number of characters under each setting (i.e., maximum, minimum and selected context) in Figure 4. Four bars in each group are P1-P4 in JavaBench. Note that we calculate the number of characters instead of the tokens because LLMs utilize different tokenizers, which will affect the counts of tokens. For example, WizardCoder uses GPT2Tokenizer, Phind uses LlamaTokenizer, while DeepSeek uses LlamaTokenizerFast. From Figure 4, it is clear that the number of characters used in Maximum Context is nearly five times that of Minimum Context and more than twice that of Selected Context. Additionally, we can observe that the number of characters for each project is relatively similar, with P2 and P3 having relatively more characters and P4 having fewer.

By combining Table 6 and Figure 4, we can make two observations. First, more context is not always a benefit. For example, a dramatic 54.83% (= 69.12%-14.29%) increase in test-wise Pass@1 is achieved by DeepSeek-6.7b when switching the context from selected to maximum context in P1, while a 23.62% (= 34.34%-10.72%) drop of test-wise Pass@1 in P4 can be observed in DeepSeek-33b. Second, only providing the class to be completed is insufficient without dependencies. We can see from Table 6 that the Pass@1 in the test-wise granularity is almost all zeros under the minimum context setting, meaning that it is almost impossible to generate project-level code that can pass test cases. In addition, the selected context includes only the method signatures (as explained in Section 3.1.1), which turned out to be more effective than the maximum context setting.

{mdframed}

[style=MyFrame] Finding 6: Providing too much or too little context has a negative impact on project-level code generation. Including the method signature only shows a promising performance.

4.3. RQ3: Impact of Incremental Synthesis

Since in Table 5, we only adopt sequential order to synthesize functions incrementally. We then further explore whether the order of synthesizing functions in the class matters. Due to the space limitation, we only show the impact of DeepSeek-Coder-6.7b on Completion/Compilation/Pass@1 and @5. Similar observations are made in other LLMs, found in our released artifact.

The result on four projects in JavaBench is illustrated in Figure 5. The dashed horizontal lines in particular colors show the average across four projects. Generally, the three colored bars in each project are similar, with slight variances among projects, meaning that the order of incremental synthesis can slightly affect the final results. Interestingly, with sampling size k increases from 1 to 5, the advantage of random order is more significant. From the lower part of Figure 5, the red dashed line outperforms the other two lines, with an advantage of 6% (86.07% - 80%) in the reversed order. On the other hand, the reversed order may not contribute to better results, according to the experiment.

{mdframed}

[style=MyFrame] Finding 7: Synthesizing programs incrementally in a random and sequential order could yield at most 6% improvement than a reversed order on DeepSeek-Coder-6.7b. A similar conclusion was also observed in other studied LLMs.

4.4. RQ4: Bad Case Analysis

This section discussed failures during completion/compilation/pass, showed the distribution of runtime error, and analyzed five bad cases that failed to compile and pass the test suites.

4.4.1. Completion Errors

A completion error happens when LLMs ignore the code to be completed and leave the method body blank. Similar observations (i.e., LLMs ignore the information in the middle of long contexts) were also made in other communities (Liu et al., 2024). From Table 5 and Table 6 under ‘Completion@1’, we can see that among the studied LLMs, completion errors were more commonly observed in WizardCoder than in others.

4.4.2. Compilation Errors

Compilation errors indicate that LLMs have an insufficient understanding of the syntactic and semantic constraints provided by the context. Since Java’s compilation errors are not explicitly categorized, we can only roughly determine the cause of each compilation error in the code through manual judgment. Below, we present three typical errors that occur frequently and are related to the object-oriented programming paradigm.

Compilation Error 1: Finding 6: Inheritance-related Error. An abstract class is meant to be inherited by other classes and cannot be instantiated. However, in Listing 3, the Move class is defined as an abstract class in line 1, but it is instantiated in line 12, resulting in a compilation error.

⬇

1public abstract class Move extends Action {}

2public static final class Down extends Move {}

4public class TerminalInputEngine implements InputEngine {

5 @Override

6 public Action fetchAction() {

7 String actionName = ...;

8 switch (actionName) {

9 ...

10 case "move":

11 final Move.Direction direction = Move.Direction.valueOf(args.toUpperCase());

12- return new Move(initiator, direction);

13+ return new Move.Up(initiator);

14 }

15 }

16}

Listing 3: Compilation Error 1. Inheritance-related Error

Compilation Error 2: Encapsulation-related Error. Listing 4 shows an error caused by improperly handled encapsulation. The player field defined in line 3 is private in class Piece and cannot be accessed directly by its subclasses. LLMs ignore the principle of encapsulation and access the player field in line 13, causing this compilation error.

⬇

1public abstract class Piece {

2 private final Player player;

4 public final Player getPlayer() {

5 return this.player;

6 }

9public class Knight extends Piece {

10 public Move[] getAvailableMoves(Game game, Place source) {

11 ...

12- if (this.player.validateMove(game, move)) {

13+ if (this.getPlayer().validateMove(game, move)) {

14 ...

15 }

16 }

17}

Listing 4: Compilation Error 2. Encapsulation-related Error

Compilation error 3: Illegal Inheritance Listing 5 shows a case where LLMs fail to resolve the inheritance relationships. The class Player defined in line 3 has no inheritance relationship with the Cell class defined in line 1, while LLMs mistakenly consider the variable cell could be an instance of Player using the instanceof keyword in line 10, causing the compilation fail.

⬇

1public abstract class Cell implements BoardElement {}

2public abstract class Entity implements BoardElement {}

3public final class Player extends Entity {}

4public class EntityCell extends Cell {}

6public class GameBoard {

7 public Gameboard(...) {

8 ...

9 Cell cell = ...;

10- if (cell instanceof Player) {}

11+ if (cell instanceof EntityCell entityCell) {

12+ if (entityCell.getEntity() instanceof Player) {}

13+ }

14 ...

15 }

16}

Listing 5: Compilation Error 3. Illegal Inheritance

4.4.3. Test-Failing Errors

The failure of test cases is often accompanied by exceptions thrown during execution. We automatically parsed the error logs and presented the exception distribution among LLMs in Figure 6. Each stacked bar shows the exceptions thrown while running the projects synthesized by the corresponding LLM. Different colors represent different types of exceptions. In total, there are 20 types of test-failing errors. Among them, AssertionFailedError happens the most frequently (50.75%), followed by IllegatlAugumentException (25.88%).

{mdframed}

[style=MyFrame] Finding 8: AssertionFailedError and IllegalAugument-
Exception are Top-2 dominating contributors to test failures, accounting for 76.63% test-failing errors.

In particular, AssertionFailedError may indicate the LLMs did not fully understand the functionality written in the docstring so the assertions in the test case failed. While IllegalArgumentException mainly suggested a lack of understanding of the code constraints, leading to illegal arguments. In the following, we analyzed two representative cases.

Test-failing Error 1: Documentation Non-Following. In Listing 6, the documentation of method getUndoCount in lines 12-14 clearly stated that count is the number of pop() calls. However, the method push generated by LLM also increases count in line 9, mistakenly understanding count as the size of cellStack and violating the documentation.

⬇

1public class CellStack {

2 private final Stack<FillableCell> cellStack = new Stack<>();

3 private int count = 0;

5 void push(final FillableCell cell) {

6 cellStack.push(cell);

7 // count is the number of pop() invoked

8 // rather than the size of cellStack.

9- count++;

10 }

12 /**

13 * @return Number of undos (i.e. {@link CellStack#pop()}) invoked.

14 */

15 int getUndoCount() {

16 return count;

17 }

18}

Listing 6: Test Error 1. Documentation Non-Following

Test Error 2: Trivial Implementation. In this case, the LLM produced a trivial implementation in line 5, constructing and returning a Move array of size zero. This implementation is evidently a placeholder meant solely to pass compilation. When the Random::nextInt method is called in line 12 and receives the parameter (i.e., the length of the Move array), its internal implementation first checks the parameter. Upon finding it to be less than or equal to zero, it throws an IllegalArgumentException.

⬇

1public JesonMor extends Game {

2 @Override

3 public Move[] getAvailableMoves(Player player) {

4 // This is a trivial implementation.

5 return new Move[0];

6 }

9Move[] availableMoves = game.getAvailableMoves(player);

10// Random::nextInt will throw IllegalArgumentException

11// if the input n <= 0.

12int next = new Random().nextInt(availableMoves.length);

Listing 7: Test Error 2. Trivial Implementation

5. Threats to validity

This paper has three main threats to validity. First, the threats in benchmark construction. Also, the quality and detailed level of the natural language descriptions for the projects, classes, and methods could affect LLMs’ code generation. To alleviate this threat, we carefully checked the subjects in JavaBench, scanned students’ feedback, and adjusted the descriptions to mitigate confusion and ambiguity. Second, the generalizability to other LLMs. In this study, we only studied five LLMs due to time and hardware limits, so the conclusion may not be able to generalize to other LLMs. Nonetheless, we selected the SOTA LLMs in different families as representatives (Section 3.4). Third, the performance variance brought by prompt engineering. Since choosing one best-performing prompt is challenging (Liu et al., 2023b), and a well-designed prompt could yield better performance, so we follow the common practice of prompting LLMs (wiz, 2023) to design the prompt template, trying to alleviate this threat. Finally, the possible data contamination (Golchin and Surdeanu, 2023) of JavaBench. LLMs having seen the canonical code during training could lead to exaggerated scores, known as data contamination (Golchin and Surdeanu, 2023). However, the projects in JavaBench were kept confidential (Section 2.2.1), thus having minor concerns.

6. Related Work

6.1. Benchmarks for Code Generation

6.1.1. Programming Language

Most benchmarks target Python (Table 1), which is dynamic and rich in handy libraries, making it a good fit for building applications. In contrast, Java is static, object-oriented, and with abundant design patterns, making it ideal for constructing large projects. With such a different style from Python, Java, the most popular static language, receives far less attention in code generation benchmarks, especially at the project level. JavaBench is thus proposed to bridge the gap.

6.1.2. Metric

Similarity measurements such as exact match and BLEU (Papineni et al., 2002) used to be mainstream metrics for code generation, as in other NLP tasks. Recently, it was found that such similarity measurement has a weak correlation with semantic correctness (Chen et al., 2021). More and more benchmarks (Chen et al., 2021; Du et al., 2023; Liu et al., 2023a) adopt execution-based correctness such as Pass@ $k$ as the major metric. Previous project-level benchmarks (Zhang et al., 2023; Li et al., 2024) adopt a pass rate against test cases to assess LLM’s capability to generate codes. JavaBench assesses LLMs with progressive metrics (completion to compilation to testing) and at finer granularities (class-wise and test-wise).

6.1.3. Context

The context of code generation benchmarks varies from line of code to project. Function-level benchmarks such as HumanEval (Chen et al., 2021) remain most commonly adopted because they are easy to evaluate and have certain discrimination. As LLMs get more powerful, recent benchmarks target more complex contexts, such as class-level (Du et al., 2023) and project-level (Zhang et al., 2023; Li et al., 2024; Wang et al., 2024). Different from these benchmarks whose subjects were sourced from GitHub, JavaBench is built from entry-level projects that are carefully designed to assess students’ coding ability. JavaBench provides a straightforward comparison of LLMs’ programming capability against humans.

6.2. LLM-based Code Generation

The code generation technique had a leap enabled by LLMs. In particular, LLMs can handle a much more complex context (e.g., class or project) than symbolic-based approaches (Gulwani, 2011; Feng et al., 2017; Shi et al., 2019) or small-sized language models (Feng et al., 2020; Wang et al., 2021).

6.2.1. Retrieval-Augmented Generation

The code generation technique most relevant to our work is Retrieval-Augmented Generation (RAG). A code project has a long context (documentation and codes), which usually exceeds LLMs’ context limit or exposes performance degradation (Liu et al., 2024). Therefore, a long context is usually decomposed into smaller chunks, and only the most related chunks are retrieved based on the problem and included in the prompt as context. For example, RepoCoder (Zhang et al., 2023) uses a sliding window to decompose a project and does retrievals iteratively with LLM-generated codes. Shrivastava et al. (Shrivastava et al., 2023) has a finer project decomposition into different kinds (e.g., imported classes, child classes) and composes different information in the prompt. JavaBench adopts a simple but effective selected context setting that only includes the signatures of dependent types and can be adopted to complement RAG.

7. Conclusion

In this paper, we introduce a project-level benchmark, JavaBench, to fill the gap of the scarcity and need for high-quality Java benchmarks for LLM evaluation. Intensive experiments are conducted, covering the context settings, synthesis strategies, evaluation granularities, and evaluation metrics.

8. Data Availability

We released the implementation and all associated publicly available data at https://github.com/java-bench/JavaBench. We also release a leaderboard and invite model developers to participate and test their models against JavaBench at https://java-bench.github.io/leaderboard.html.

References

(1)
sta (2023) 2023. bigcode/starcoderbase. https://huggingface.co/bigcode/starcoderbase.
cod (2023) 2023. CoderEval/CoderEval. https://github.com/CoderEval/CoderEval/commits/main/CoderEval4Python.json.
Dee (2023a) 2023a. Deepseek-33b. https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct.
Dee (2023b) 2023b. Deepseek-6.7b. https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct.
gpt (2023a) 2023a. GPT-3.5-turbo Model Availability. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35-turbo-model-availability.
gpt (2023b) 2023b. GPT-4 and GPT-4-turbo Preview. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-preview.
phi (2023a) 2023a. Phind/Phind-CodeLlama-34B-v2. https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/commit/29c3be6006297754f344ba05678c038b0b77f6c0.
phi (2023b) 2023b. Phind/Phind-CodeLlama-34B-v2. https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
wiz (2023) 2023. WizardLM/WizardCoder. https://huggingface.co/WizardLM/WizardCoder-15B-V1.0.
git (2024) 2024. GitHub 2.0 Pull Requests Ranking over Programming Languages. https://madnight.github.io/githut/#/pull{_}requests/2024/1.
Alshahwan et al. (2010) Nadia Alshahwan, Yue Jia, Kiran Lakhotia, Gordon Fraser, David Shuler, and Paolo Tonella. 2010. AUTOMOCK: automated synthesis of a mock environment for test case generation. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 2357–2367. https://doi.org/10.18653/v1/N19-1245
Arcuri et al. (2014) Andrea Arcuri, Gordon Fraser, and Juan Pablo Galeotti. 2014. Automated unit test generation for classes with environment dependencies. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 79–90.
Arcuri et al. (2017) Andrea Arcuri, Gordon Fraser, and René Just. 2017. Private api access and functional mocking in automated unit test generation. In 2017 IEEE international conference on software testing, verification and validation (ICST). IEEE, 126–137.
Athiwaratkun et al. (2022) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, and Bing Xiang. 2022. Multi-lingual Evaluation of Code Generation Models. (2022). https://doi.org/10.48550/ARXIV.2210.14868
Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021).
Belzner et al. (2023) Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. In International Conference on Bridging the Gap between AI and Reality. Springer, 355–374.
bench authors (2023) BIG bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=uyTL5Bvosj
Bieri (1955) James Bieri. 1955. Cognitive complexity-simplicity and predictive behavior. The Journal of Abnormal and Social Psychology 51, 2 (1955), 263.
Cao et al. (2024) Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. arXiv preprint arXiv:2403.16898 (2024).
Cassano et al. (2022) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. A scalable and extensible approach to benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022).
Chandel et al. (2022) Shubham Chandel, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and evaluating a jupyter notebook data science assistant. arXiv preprint arXiv:2201.12901 (2022).
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:cs.LG/2107.03374
Cheng et al. (2024) Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion. http://arxiv.org/abs/2405.19782 arXiv:2405.19782 [cs].
Coleman et al. (1994) D. Coleman, D. Ash, B. Lowther, and P. Oman. 1994. Using metrics to evaluate software system maintainability. Computer 27, 8 (1994), 44–49. https://doi.org/10.1109/2.303623
DeepSeek-AI (2024) DeepSeek-AI. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024). https://github.com/deepseek-ai/DeepSeek-LLM
Ding et al. (2023) Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://arxiv.org/pdf/2310.11248.pdf
Du et al. (2023) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. arXiv:cs.CL/2308.01861
Feng et al. (2017) Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017. Component-based synthesis for complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 599–612. https://doi.org/10.1145/3009837.3009851
Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
Galler et al. (2010) Stefan J Galler, Andreas Maller, and Franz Wotawa. 2010. Automatically extracting mock object behavior from design by contract™ specification for test data generation. In Proceedings of the 5th Workshop on Automation of Software Test. 43–50.
Gao et al. (2023) Shuzheng Gao, Cuiyun Gao, Yulan He, Jichuan Zeng, Lunyiu Nie, Xin Xia, and Michael Lyu. 2023. Code structure–guided transformer for source code summarization. ACM Transactions on Software Engineering and Methodology 32, 1 (2023), 1–32.
Golchin and Surdeanu (2023) Shahriar Golchin and Mihai Surdeanu. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. CoRR abs/2308.08493 (2023). https://doi.org/10.48550/ARXIV.2308.08493 arXiv:2308.08493
Gu et al. (2024) Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065 (2024).
Gulwani (2011) Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 317–330. https://doi.org/10.1145/1926385.1926423
Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. NeurIPS (2021).
Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH
Iyer et al. (2018) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1643–1652. https://doi.org/10.18653/v1/D18-1192
Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 18319–18345.
Li et al. (2024) Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. 2024. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. http://arxiv.org/abs/2405.19856 arXiv:2405.19856 [cs].
Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! (2023). arXiv:cs.CL/2305.06161
Liu et al. (2023a) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023a. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7
Liu et al. (2024) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
Liu et al. (2023b) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023b. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023).
Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
Oman and Hagemeister (1992) P. Oman and J. Hagemeister. 1992. Metrics for assessing a software system’s maintainability. In Proceedings Conference on Software Maintenance 1992. 337–344. https://doi.org/10.1109/ICSM.1992.242525
Ouyang et al. (2023) Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
Schäfer et al. (2024) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. https://doi.org/10.1109/TSE.2023.3334955
Shi et al. (2019) Kensen Shi, Jacob Steinhardt, and Percy Liang. 2019. FrAngel: component-based synthesis with control structures. Proc. ACM Program. Lang. 3, POPL (2019), 73:1–73:29. https://doi.org/10.1145/3290386
Shrivastava et al. (2023) Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-Level Prompt Generation for Large Language Models of Code. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 31693–31715. https://proceedings.mlr.press/v202/shrivastava23a.html
Wang et al. (2024) Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, and Dacheng Tao. 2024. OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models. arXiv preprint arXiv:2401.06628 (2024).
Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 8696–8708. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.685
Wang et al. (2022) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481 (2022).
Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. In International Conference on Mining Software Repositories (MSR). ACM, 476–486. https://doi.org/10.1145/3196398.3196408
Yu et al. (2023) Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2023. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv preprint arXiv:2302.00288 (2023).
Zan et al. (2022a) Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022a. When language model meets private library. arXiv preprint arXiv:2210.17236 (2022).
Zan et al. (2022b) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022b. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint arXiv:2206.06888 (2022).
Zan et al. (2022c) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022c. Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420 (2022).
Zhang et al. (2023) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2471–2484. https://doi.org/10.18653/v1/2023.emnlp-main.151
Zheng et al. (2023b) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023b. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5673–5684. https://doi.org/10.1145/3580305.3599790
Zheng et al. (2023a) Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023a. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
Zhu et al. (2021) Hengcheng Zhu, Lili Wei, Ming Wen, Yepang Liu, Shing-Chi Cheung, Qin Sheng, and Cui Zhou. 2021. MockSniffer: characterizing and recommending mocking decisions for unit tests (ASE ’20). Association for Computing Machinery, New York, NY, USA, 436–447. https://doi.org/10.1145/3324884.3416539