Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

Yixi Wu University of ManitobaWinnipegCanada wuy8@myumanitoba.ca Pengfei He University of ManitobaWinnipegCanada hep2@myumanitoba.ca Zehao Wang Concordia UniversityMontrealCanada w˙zeha@encs.concordia.ca Shaowei Wang University of ManitobaWinnipegCanada shaowei.wang@umanitoba.ca Yuan Tian Queen’s UniversityKingstonCanada y.tian@queensu.ca  and  Tse-Hsun (Peter) Chen Concordia UniversityMontrealCanada peterc@encs.concordia.ca
Abstract.

Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework’s effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.

Code Quality, API recommendation, code generation, hallucination, LLM
ccs: Software and its engineering

1. Introduction

With the technological advancements in AI, various large language models (LLMs) have been developed for code-related tasks (Lu et al., [n. d.]; Chen et al., 2021; Wang et al., 2021; Guo et al., 2022; Luo et al., 2023; Wei et al., 2023; Li et al., 2023), such as GitHub Copilot (Git, [n. d.]) and ChatGPT (OpenAI, 2022). Those LLMs have significantly propelled the field of code generation. When used correctly, those LLMs can significantly enhance productivity and accelerate software development through various tasks (Peng et al., 2023; Paredes et al., 2023; Dakhel et al., 2023). To address the growing demand for LLM-based code generation, prior studies have devoted substantial effort to evaluating and understanding the effectiveness of LLMs in generating code (Dou et al., 2024; Liu et al., 2024; Spracklen et al., 2024; hum, [n. d.]; Du et al., 2023). For instance, to assess LLM’s ability of code generation, researchers proposed various code generation benchmarks (e.g., HumanEval (hum, [n. d.]) and ClassEval (Du et al., 2023)). These benchmarks rely on a set of pre-defined tests to evaluate the correctness of the generated code.

According to the latest survey on developers’ perspectives regarding code generation tools (Ciniselli et al., 2023), developers frequently expect these tools to incorporate more contextual knowledge, particularly by leveraging specific libraries. They emphasize the importance of API-oriented code generation, where tools should be capable of generating code that invokes APIs from specific libraries. However, the majority of the existing studies (Khoury et al., 2023; Dou et al., 2024; Liu et al., 2024; Spracklen et al., 2024; hum, [n. d.]; Du et al., 2023) focus on the general code generation based on specific functional requirements described in natural language, without specifying APIs. Little work has been done to assess the quality of API-oriented code generated by LLMs. While some research has explored API-oriented code generation for specific libraries (Liu et al., 2023; Zan et al., 2023), these studies typically derive their tasks from Stack Overflow or general code generation benchmarks, which involve APIs from the target libraries (e.g., Pandas). They then manually create test cases to verify if the generated code meets the required functionality. However, a key limitation of these approaches is that the coverage of tested APIs is limited and cannot be scaled to a broader range of libraries due to the manual effort required for test creation and the difficulty of collecting relevant tasks for the target libraries. Hence, a systematic and automated approach is needed to evaluate API-oriented code generation by LLMs for a broad range of libraries. Enabling the assessment of API-oriented code generation can help practitioners evaluate and analyze the code generation capabilities of LLMs for specific libraries, thereby offering new insights and solutions to enhance their performance.

To bridge this gap, we propose a lightweight framework AutoAPIEval, which enables automatic and systematic evaluation of LLMs on API-oriented code generation. AutoAPIEval works with any library that has API documentation. We consider four requirements when designing AutoAPIEval: First, it should be fully automated to enable scalability, unlike existing benchmarks that require manually crafted test cases (hum, [n. d.]; Du et al., 2023). Second, it should be applicable to a wide range of libraries, leveraging easily accessible datasets such as API documentation. Third, the framework’s tasks must be simple enough for LLMs to understand without complex prompt engineering, ensuring consistent benchmarking (Hajipour et al., 2023). Lastly, the framework should mimic how developers interact with LLMs (Hao et al., 2024), helping them identify available APIs and providing examples of working code.

To fulfill the requirements, we design two unit tasks in AutoAPIEval to benchmark LLMs’ ability in API-oriented code generation, using only API documentation as input. We refer to them as “unit tasks” because, similar to unit tests, they are performed independently on each class within a package, allowing for focused evaluation of the LLM’s performance at the class level. The tasks are: 1) API recommendation, and 2) code example generation, evaluated using four specific metrics. Specifically, in Task 1, given a library, we iteratively query the LLM to recommend a list of APIs for each class (i.e., unit) in the library. In Task 2, we query the LLM to generate code examples for a given API. We propose four evaluation metrics to evaluate the API and code examples generated by LLMs, such as the proportion of incorrect APIs for Task 1, the proportion of code examples where no specific API is invoked, and uncompilable/unexecutable code examples for Task 2. Our framework enables further analysis to 1) understand errors that occur in the API recommendation and code generation tasks; 2) investigate factors that are associated with the quality of APIs recommended and code examples generated by LLMs; and 3) investigate potential solutions to mitigate errors and improve the quality APIs and code examples generated by LLMs. For simplicity, we refer to the API recommendation in Task 1 and code example generation in Task 2 as API-oriented code generation below. Similarly, we refer to the APIs and code examples generated by LLMs as API-oriented code.

To illustrate how our framework can be used for further analysis, we conducted a case study on Java Runtime Environment 8 (JRE 8). We collected the API documentation of JRE 8, which comprises 217 packages and 2,397 classes. We selected three popular LLMs for evaluation, including one state-of-the-art closed-source LLM, ChatGPT (OpenAI, 2022), and two open-source LLMs that are trained for code-related tasks, MagiCoder (mag, [n. d.]) and DeepSeek Coder (dee, [n. d.]).

We made the following observations through the case study:

  • Different LLMs vary in performance across different tasks. Compared with MagiCoder and DeepSeek Coder, ChatGPT tends to follow the instructions better and have a lower rate of hallucinations, while producing a similar proportion of uncompilable/unexecutable code examples.

  • For API recommendation, 58.1% to 84.1% of recommended APIs do not exist in the specified library. In code example generation, 39.4% to 54.4% of examples contain errors, with 5.4% to 20.7% missing the specified API and the rest failing to compile or execute. To further understand the errors, we created a taxonomy of the errors that occurred in the API-oriented code generation by LLMs. For the API recommendation task, most errors stem from Factual Fabrication Hallucinations, followed by Instruction Inconsistencies, and Factual Inconsistencies. In code example generation, errors are primarily due to Runtime Errors (e.g., Initialization Error), followed by Hallucinations and Compilation Errors.

  • Factors such as API popularity and model confidence are strongly associated with API-oriented code quality. Using our proposed factors, we built highly accurate classifiers to detect incorrect API recommendations and erroneous code examples (e.g., F1 scores of 0.96 and 0.8 for Task 1 and 2 on MagiCoder). 0.79 for Task 1 and Task 2 using MagiCoder.

  • Overall, RAG enhances the quality of code generated by LLMs, though its effectiveness varies across different models. For Task 1, even when provided with a list of correct APIs, LLMs still exhibit an error rate of at least 27.9% in their recommendations.

We summarize our contributions below:

  • We proposed an automated and systematic framework to enable the evaluation of LLMs on API-oriented code generation and various further analyses.

  • We conducted a case study on three LLMs and 217 packages from JRE 8 using our framework, uncovering key insights that suggest valuable insights for future research.

  • We release our replication package (anonymous, 2024) to facilitate future research.

2. AutoAPIEval: Framework for Evaluating API-oriented Code Generation

We propose a lightweight framework named AutoAPIEval. This framework enables the automatic and systematic evaluation of LLMs for API-oriented code generation, which is applicable to any specific library and requires only API documentation as input.

We consider four key requirements when designing the framework. First, Fully Automated: The evaluation must be conducted automatically. Without full automation, scaling to new libraries is impractical. Existing benchmarks, such as HumanEval (hum, [n. d.]) and ClassEval (Du et al., 2023), rely on predefined test cases to assess the correctness of generated code. While these benchmarks are meticulously crafted, creating test cases requires substantial manual effort, making it challenging to extend them to new libraries. Second, Easily Accessible Dataset: To ensure our framework’s applicability across a wide range of libraries, the datasets required for evaluation must be readily accessible. Third, Easy to Implement: The tasks in our framework should be straightforward so that LLMs can easily understand the intent of the prompts. Complex tasks often require intricate prompt engineering, which does not support a standardized approach for benchmarking LLMs (Hajipour et al., 2023). Lastly, Mimic Practical Use: The tasks should emulate how developers interact with LLMs for API-oriented code generation. Typically, developers start by querying LLMs about the APIs available in a library and requesting usage examples before diving deeper into specific tasks (Hao et al., 2024).

Figure 1 provides an overview of the proposed framework for assessing the API-oriented code generated by LLMs. To fulfill the four requirements, we design two tasks in AutoAPIEval for evaluating LLMs’ ability in API-oriented code generation: API recommendation and code example generation. We also propose four evaluation metrics, which can be used to evaluate the quality of the recommended APIs for Task 1 and the generated code examples for Task 2, such as the proportion of correctly recommended APIs and the proportion of code examples that are uncompilable or unexecutable (see more details in Section 2.2).

In addition, our framework can facilitate further analysis, including 1) identifying errors in the generated code, 2) determining factors and indicators potentially related to code quality, and 3) evaluating solutions to mitigate errors in the generated code. We demonstrate how to apply our AutoAPIEval to conduct such analyses in Section 3 via a case study.

Refer to caption
Figure 1. Framework of AutoAPIEval.

2.1. Tasks

Given a library, in Task 1, the LLM is iteratively prompted to recommend methods for each class in the library. In Task 2, the LLM generates code examples for a specified method in the given library. Note that Task 2 is only applied to methods that were successfully recommended by the LLM in Task 1, as we assume that the LLM has adequate knowledge of these APIs. We elaborate on those two tasks in the following sub-sections.

2.1.1. Task 1: API Recommendation

For Task 1, we test LLM’s knowledge of the available APIs in a specific library. Therefore, we prompt the LLM to recommend APIs (methods) in a given class in the specific library. As revealed by prior studies, hallucinations are common in code generation (Kabir et al., 2024; Spracklen et al., 2024), such as generating non-existing and fake APIs. Therefore, we then examine if the recommended APIs by LLM are deemed in the class. We design our prompt template for Task 1 as follows.

Prompt Template for Task 1 Instruction:
I want to use {packageName.className} class from Java. Recommend a list of useful with at most {snippet_number} API method for this class, excluding method inherent from its parent class. For each API method specify its return type and parameters in the below format:
“API signature”: Description of the API

For example:
“boolean add(E e)”: This method appends the specified element to the end of this list.

Response:

We keep the template simple to avoid fluctuations in results caused by prompting techniques on different LLMs. We provide an example in the context (i.e., one-shot) to help LLM understand our intent and desired format for output. We initially tried a zero-shot prompt, but the LLM did not consistently follow the instructions well, leading us to adopt the one-shot strategy. When prompting following this template, we replace the placeholder {packageName.className} with the actual full class name for which we seek a recommendation. To facilitate automated analysis and evaluation, we instruct the LLM to return API recommendations in a specific format, as shown in the template, enabling easy extraction using regular expressions. To reinforce this format, we include an example that demonstrates the desired output. We also use the placeholder {snippet_number} to limit the number of APIs recommended by the LLM. In this study, we set {snippet_number} to five. Although we set up a threshold, LLMs possibly generate more APIs than we ask for. sometimes LLMs After the LLM generates its response, we use regular expressions to extract the signatures of all recommended APIs for further analysis.

2.1.2. Task 2: API-oriented Code Example Generation

For Task 2, we instruct the LLM to generate a code example for a given API (i.e., method). We present the prompt template for Task 2 below:

Prompt Template for Task 2 Instruction:
I want to learn how to use {method} from {packageName.className}. Generate a complete code example of this method. The code example needs to be executable with import statement and put the method and code snippet in the format below:
Code snippet:
public class Main {
public static void main(String[] args) {
}
}
For example:
boolean add(E e): This method appends the specified element to the end of this list.
Code snippet:
import java.util.ArrayList;
public class Main {
public static void main(String[] args) { ArrayList¡String¿ list = new ArrayList¡¿(); list.add(”Hello”); System.out.println(list); }
}
Response:

In this template, similar to the one used in Task 1, we prompt the LLM to generate a code example for a given API (i.e., {method}) in a package (i.e., {package.className}). We ask the LLM to output an executable code example with the necessary import statement and follow a specific format shown in the template. Like Task 1, this design ensures that we can extract the code examples from the response using regular expressions. However, our observations show that providing instructions alone is insufficient to consistently enforce the desired output format. Therefore, we provide an example of the format we hope the LLM will follow in the “For example” section to improve the likelihood of generating code in a format as we expected. We use regular expressions to extract code examples from the LLM’s response for further analysis.

2.2. Evaluation Metrics

For Task 1, our goal is to assess whether LLMs can accurately recommend APIs within a specified package. To evaluate the quality of these recommendations, we compute the proportion of incorrectly recommended APIs relative to the total recommended APIs, denoted as IncorrectAPI. This metric is calculated as |IncorrectAPIs||AllrecommendedAPIs|𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝐴𝑃𝐼𝑠𝐴𝑙𝑙𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝐴𝑃𝐼𝑠\frac{|IncorrectAPIs|}{|All\ recommended\ APIs|}divide start_ARG | italic_I italic_n italic_c italic_o italic_r italic_r italic_e italic_c italic_t italic_A italic_P italic_I italic_s | end_ARG start_ARG | italic_A italic_l italic_l italic_r italic_e italic_c italic_o italic_m italic_m italic_e italic_n italic_d italic_e italic_d italic_A italic_P italic_I italic_s | end_ARG. A lower value indicates that the LLM provides more accurate API recommendations, reflecting higher quality. A recommended API is considered correct only if it exists within the specified package and all elements of its signature (i.e., return type, method name, number of parameters, and parameter types) match exactly with the corresponding API in the package. To compute this metric, we cross-reference the actual APIs in the package to ensure that the recommended APIs not only exist but also have signatures that precisely match our records. It is important to note that if an LLM recommends incorrect APIs, this can be considered a form of hallucination, as the recommended APIs do not exist within the package.

For Task 2, we propose three metrics to evaluate the quality of the generated code examples, each capturing a different type of error:

  • NoAPIInvoked: This metric assesses whether the specified API is invoked in the generated code example. When the LLM fails to include the requested API, it be categorized as instruction inconsistency hallucinations, where the model does not follow the given instruction (Huang et al., 2023). For instance, consider a scenario where the model is asked to generate a code example for the ‘getDecoder()’ method from ‘java.util.Base64’, but the method is absent in the generated code. This situation would be classified as a NoAPIInvoked case. The metric is calculated as the proportion of examples where the specified API is not invoked, defined as |NoAPIInvoked||AllrecommendedCode|𝑁𝑜𝐴𝑃𝐼𝐼𝑛𝑣𝑜𝑘𝑒𝑑𝐴𝑙𝑙𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝐶𝑜𝑑𝑒\frac{|NoAPIInvoked|}{|All\ recommended\ Code|}divide start_ARG | italic_N italic_o italic_A italic_P italic_I italic_I italic_n italic_v italic_o italic_k italic_e italic_d | end_ARG start_ARG | italic_A italic_l italic_l italic_r italic_e italic_c italic_o italic_m italic_m italic_e italic_n italic_d italic_e italic_d italic_C italic_o italic_d italic_e | end_ARG, where |AllrecommendedCode|𝐴𝑙𝑙𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝐶𝑜𝑑𝑒|All\ recommended\ Code|| italic_A italic_l italic_l italic_r italic_e italic_c italic_o italic_m italic_m italic_e italic_n italic_d italic_e italic_d italic_C italic_o italic_d italic_e | represents the total number of generated examples, and |NoAPIInvoked|𝑁𝑜𝐴𝑃𝐼𝐼𝑛𝑣𝑜𝑘𝑒𝑑|NoAPIInvoked|| italic_N italic_o italic_A italic_P italic_I italic_I italic_n italic_v italic_o italic_k italic_e italic_d | is the number of examples where the API is missing.

  • Uncompilable: After excluding the NoAPIUsed cases, we examine whether the generated code examples can be compiled successfully. We calculate the proportion of code examples that are not compilable with compilation errors, denoted as Uncompilable, which is defined as |UnCompilable||AllrecommendedCode|𝑈𝑛𝐶𝑜𝑚𝑝𝑖𝑙𝑎𝑏𝑙𝑒𝐴𝑙𝑙𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝐶𝑜𝑑𝑒\frac{|UnCompilable|}{|All\ recommended\ Code|}divide start_ARG | italic_U italic_n italic_C italic_o italic_m italic_p italic_i italic_l italic_a italic_b italic_l italic_e | end_ARG start_ARG | italic_A italic_l italic_l italic_r italic_e italic_c italic_o italic_m italic_m italic_e italic_n italic_d italic_e italic_d italic_C italic_o italic_d italic_e | end_ARG, where |UnCompilable|𝑈𝑛𝐶𝑜𝑚𝑝𝑖𝑙𝑎𝑏𝑙𝑒|UnCompilable|| italic_U italic_n italic_C italic_o italic_m italic_p italic_i italic_l italic_a italic_b italic_l italic_e | is the number of uncompilable code examples.

  • Unexecutable: We calculate the proportion of code examples that cannot be executed with runtime errors while can be compiled successfully, be denoted as Unexecutable. It is calculated as |Unexecutable||AllrecommendedCode|𝑈𝑛𝑒𝑥𝑒𝑐𝑢𝑡𝑎𝑏𝑙𝑒𝐴𝑙𝑙𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑𝐶𝑜𝑑𝑒\frac{|Unexecutable|}{|All\ recommended\ Code|}divide start_ARG | italic_U italic_n italic_e italic_x italic_e italic_c italic_u italic_t italic_a italic_b italic_l italic_e | end_ARG start_ARG | italic_A italic_l italic_l italic_r italic_e italic_c italic_o italic_m italic_m italic_e italic_n italic_d italic_e italic_d italic_C italic_o italic_d italic_e | end_ARG, where |Unexecutable|𝑈𝑛𝑒𝑥𝑒𝑐𝑢𝑡𝑎𝑏𝑙𝑒|Unexecutable|| italic_U italic_n italic_e italic_x italic_e italic_c italic_u italic_t italic_a italic_b italic_l italic_e | is the number of Unexecutable code examples.

In Task 2, we do not evaluate the correctness of the generated code examples with test cases, as we prompt LLMs to only generate code examples. Instead, we rely on our metrics, i.e., NoAPIInvoked, Uncompilable, and Unexecutable —whose sum, denoted as TotalError, serves as a lower bound for evaluating the quality of the generated code. This is because functionally correct API-oriented code must, at the very least, be executable and correctly invoke the specified API.

3. Case Study Design

In this section, we illustrate a case study, where we utilize our framework AutoAPIEval to perform an empirical study to evaluate and understand API-oriented code generation by LLMs. We present our research questions (RQs), dataset, LLMs, and approach for each RQ in the following subsections.

3.1. Research Questions

We conduct our case study around four research questions:

  • Quality Assessment - RQ1: What is the quality of API-oriented code generation by LLMs?

  • Error Analysis - RQ2: What types of errors occur in API-oriented code generation?

  • Factor Analysis - RQ3: What factors are associated with the quality of API-oriented code generation by LLMs?

  • Error Mitigation - RQ4: Does RAG help mitigate the errors in API-oriented code generation?

In RQ1, we investigate the quality of API-oriented code generated by LLMs, including both the recommended APIs and the generated code examples. In RQ2, we focus on understanding the types of errors that occur in the APIs and code examples produced by LLMs. This analysis provides insights into the strengths and limitations of different LLMs in generating API-oriented code. In RQ3, we explore the factors that may be associated with the quality of API-oriented code generated by LLMs. Understanding these factors can offer insights into improving LLM performance and help develop models to predict low-quality code. Lastly, Retrieval-Augmented-Generation (RAG) has been shown to enhance code generation (Chen et al., 2024; Daneshvar et al., 2024; Parvez et al., 2021). Accordingly, in RQ4, we investigate whether RAG can help reduce errors and improve the quality of API-oriented code generated by LLMs.

3.2. Datasets

In this case study, we focus on all packages from the Java Runtime Environment 8 (JRE 8), released on July 19, 2016 (Oracle, [n. d.]). We chose JRE packages due to the widespread usage of their APIs in code repositories, such as those hosted on GitHub. Since these open-source repositories are frequently used to train LLMs, it is likely that the models have substantial knowledge of these APIs. To collect all packages and APIs in each package, we crawled the API documents of JRE 8 from its official website (Oracle, [n. d.]) and stored them in our database. Our dataset consists of 217 packages, comprising 2,397 classes. For Task 1, we generate prompts based on a predefined template for each class, and for Task 2, we construct prompts using the APIs correctly recommended in Task 1.

To determine whether a code example is compilable and executable, we compile and run the Java file as a subprocess from our main script and collect any errors. Furthermore, we have implemented a timeout mechanism to terminate subprocesses that exceed a threshold. Specifically, if the code snippet runs for over 15 seconds, it is automatically terminated and marked as a failure.

3.3. Base LLMs

In our study, we employed three different LLMs as the base models: ChatGPT (OpenAI, 2022), Magicoder (Wei et al., 2023), and DeepSeek Coder (DeepSeek-AI et al., 2024). These models were chosen due to their representation of both commercial general-purpose LLMs and open-source LLMs specialized for code tasks, as well as their ranking as top performers in code generation tasks111https://www.vellum.ai/llm-leaderboard.

For ChatGPT, we use gpt-3.5-turbo (OpenAI, 2022) for state-of-the-art capabilities in both general-purpose natural language processing tasks and code generation (Dong et al., 2023; Yetiştiren et al., 2023). For MagiCoder, we use Magicoder-S-DS-6.7B (mag, [n. d.]), which is trained specifically for designed for code generation and coding-related tasks. DeepSeek Coder (dee, [n. d.]), built upon Deepseek-LLM 7B, was chosen for its superior ability to solve code-related tasks. For all LLMs, we set the temperature to 0.6 by following, and all other hyper-parameters were maintained at their default values.

We conducted our analysis on all the three LLMs for RQ1. Since the performance of MagiCoder and DeepSeek Coder are similar as shown in the results of RQ1 (ref. Table 2), we conducted our experiments and analysis on MagiCoder and ChatGPT for the rest RQs (RQ2-RQ4).

3.4. Approaches for RQs

3.4.1. RQ1 - Quality Assessment

In RQ1, we perform the two unit tasks on each of the 2,397 target classes extracted from JRE 8 leveraging three selected LLMs. We record the inference output from each prompt and extract the generated code. We then evaluate the quality of generated code using the metrics outlined in Section 2.2 for both tasks. We repeated each task 10 times to reduce the bias from randomness.

3.4.2. RQ2 - Error Analysis

For Task 1 - API recommendation, as discussed in Section 2.2, we classify any recommended APIs that do not exist within the given package as incorrect, which can be considered hallucinations generated by the LLM. Consequently, we adopt an established categorization scheme from prior hallucination studies (Huang et al., 2023) to classify the errors in API recommendations as follows:

  • Factual Fabrication: The recommended API is entirely fabricated, meaning the API name does not exist in the specified package.

  • Factual Inconsistency: Unlike factual fabrication, the recommended API name does exist; however, other elements in the signature (e.g., return type or parameters) are incorrect.

  • Instruction Inconsistency: he LLM fails to follow instructions regarding generating a list of APIs with their full signatures. This may include missing return types, missing parameters, or generating irrelevant text instead of API signatures.

  • Context Inconsistency: The LLM incorrectly claims that the context provided in the prompt is wrong and fails to follow the instructions. For example, it may claim that a specified class is not part of the package and thus does not provide any API recommendations.

We randomly selected a statistically representative sample of 384 incorrect APIs (method signatures) recommended by ChatGPT, using a 95% confidence level and a 5% margin of error. Given the similarity in API recommendation quality between MagiCoder and DeepSeek Coder (as shown in the RQ1 results), we focused on MagiCoder’s incorrect recommendations and similarly sampled 384 incorrect APIs. Two authors independently categorized the errors in these APIs, identifying deficiencies or ambiguities by cross-referencing API documentation. They then discussed their categorizations to resolve any disagreements and reached a consensus.

In Task 2, the four types of hallucinations defined in Task 1 also occur. For example, if a fabricated API is used in a generated code example, we classify it as a Factual Fabrication. Similarly, if the LLM fails to invoke the specified API, resulting in a NoAPIInvoked case, we label it as Instruction Inconsistency. However, in Task 2, a code example could still have compilation or runtime errors without hallucinating on APIs. In other words, the four types of hallucination errors cannot cover all cases. To address this, we conducted an open coding process to derive additional error categories (subcategories under compilation errors and runtime errors), following the methodology of prior studies (Zhang et al., 2019; Wu et al., 2019; Seaman, 1999). We began by randomly sampling 384 unexecutable code examples generated by MagiCoder. The first two authors manually reviewed 100 randomly selected examples from this set to develop an initial list of error categories. During this process, the categories were iteratively refined and revised. Once the error types were finalized, both authors independently applied these categories to the remaining 284 samples. After labeling, they discussed their findings to resolve disagreements and reached a consensus. The final list of derived error types is shown in Table 1. We followed the same process for ChatGPT, i.e., randomly sampling 384 unexecutable code examples and categorizing them accordingly. It is important to note that a code example may be unexecutable or uncompilable due to hallucinations, such as invoking fabricated APIs. In such cases, we categorize it as a hallucination rather than a Runtime Error or Compilation Error.

Table 1. Types of errors occurred in generated code examples by LLMs.
Type Sub-Type Definition
Hallucination Factual Fabrication A fabricated API was invoked in the code example.
Factual Inconsistency Different from Factual Fabrication. The name of the recommended API does exist, while other elements in the signature (i.e., return type and parameters) are wrong.
Instruction Inconsistency The LLM does not follow the instructions to generate code examples, such as NoAPIInvoked cases.
Context Inconsistency The LLM indicates that the context provided in the prompt is wrong and does not follow instructions. For example, the LLM indicates that the specified API is not in the given package and does not generate any code.
Compilation Errors Type Mismatch Errors that occurs when an operation or function receives a variable or argument of a different data type than expected (e.g., assigning an int to a string variable.).
Missing Import Statement Errors are caused by missing import statements in the code example.
Polymorphism Error Error that are related to polymorphism. (e.g., does not fully implement an abstract method from a superclass).
Undeclared variable/class/structure Error are caused due to a variable/class/structure being used without being previously declared or defined.
API misuse Errors occur when APIs are misused (e.g., using non-static method in a static context).
Runtime Errors Initialization Error Errors occur when a variable/class/structure is not initialized correctly (e.g., an object is used before it has been fully or correctly instantiated or a system environment variable is not set up properly).
Exception Handling Error Errors occur when Exceptions are not caught or thrown properly.
Timeout Error Errors occur when running beyond a pre-defined time (15 seconds in this study).
Connection Error Errors arise when the code example is unable to establish a successful connection to a remote server, service, network, or another device.
Missing External Resource Errors occur when the code example fails to access or retrieve data from an external resource, such as a file, database, API, or network service.
API misuse Errors occur when APIs are misused (e.g., using protected/private APIs, or passing the parameters in the wrong format).
Deprecated Error Warning occurs when deprecated APIs are used.

3.4.3. RQ3 - Factors Analysis

In this RQ, we aim to investigate the factors at the API level that may be associated with the quality of code generation. Specifically, for Task 1, we seek to investigate factors that are associated with whether a recommended API is correct. For Task 2, we focus on understanding the factors that are associated with the erroneous code examples (i.e., the case where is either noAPIinvoked, unexecutable, or uncompilable).

We examine these factors from two perspectives: 1) the API itself and 2) the model used. From the API perspective, we consider two factors: the API’s popularity (i.e., API_popularity) and its length (i.e., API_length). From the model perspective, we analyze three factors: Self-Probing (i.e., Probing), Perplexity (i.e., PPL) and self-consistency (i.e., Consistency). Each of these factors is discussed in more detail below.

API_popularity: LLMs are trained on public datasets, so APIs that are widely used are more likely to have substantial representation in the training data. Hence, an LLM is more likely to generate high-quality code for these popular APIs. To measure the popularity of APIs across GitHub repositories, we employ Google BigQuery (Google, [n. d.]) to analyze the frequency of API package imports. Developers commonly import specific packages to access corresponding APIs. By querying BigQuery’s public GitHub dataset (Hoffa, [n. d.]), we quantify API popularity by counting the number of repositories referencing each package, offering a reliable metric for API usage across GitHub.

API_length: Longer APIs are intuitively more challenging for LLMs to predict accurately, as there are more components that need to match exactly with the existing API. Therefore, we take the length of the API into account by considering its fully qualified name, which includes the return type, method name, and parameter types in terms of characters.

Probing: As noted in a previous study (Hao et al., 2024), developers often start by asking LLMs if they are familiar with specific APIs before proceeding with detailed tasks, allowing them to probe the LLM’s capacity to support their software engineering inquiries. Following this strategy, we probe the LLM to determine its knowledge of a given library. For Task 1, we ask if the LLM recognizes a specific class and request a “Yes” or “No” response by adjusting the prompt template for Task 1. Similarly, for Task 2, we ask if the LLM knows a particular API, again expecting a “Yes” or “No” response, with the value being binary (“Yes/No”).

PPL: Perplexity quantifies a model’s uncertainty when generating text, with a lower value suggesting more accurate predictions that align closely with the actual text distribution (Chen et al., 1998). To compute perplexity, we enabled the LLMs to return output tokens and log probabilities, where the log probability reflects the likelihood of a token appearing in a sequence given its preceding context. Perplexity is calculated by exponentiating the negative average of these log probabilities. For Task 1, we computed the perplexity for each returned API, and for Task 2, we calculated it for the returned code snippets, providing a measure of the model’s confidence.

Consistency: Self-consistency measures the certainty or reliability an LLM demonstrates through its internal coherence across multiple responses to the same or similar inputs (Wang et al., 2022). Higher self-consistency usually indicates greater confidence in generating outputs aligned with factual knowledge. For Task 1, we ran each prompt 10 times and determined self-consistency by calculating the frequency of each API’s occurrence; a higher frequency indicates greater consistency (e.g., an API appearing 8 out of 10 times has a self-consistency of 0.8). For Task 2, we repeated the prompt for each API 10 times and measured consistency using the distance to center, which quantifies intra-similarity among the generated code examples (Rajaraman and Ullman, 2011). To measure the distance of two code examples, we embedded examples into vectors using Sentence Transformer (sen, [n. d.]). A smaller distance indicates a code example is more consistent with others.

After collecting the factors, we divided the APIs and code examples into two groups. For Task 1, we categorized the APIs based on whether they were recommended correctly, as described in Section 2.2: correctly recommended APIs (APIcorrect) and incorrectly recommended APIs (APIincorrect). For Task 2, we grouped the code examples into two classes: erroneous code examples (Codeerroneous) and non-erroneous code examples (Codenon-erroneous). To determine whether the studied factors differ significantly between the two groups in both tasks, we employed the Mann-Whitney U test, a non-parametric test that does not require any assumptions about the underlying data distribution. Additionally, we computed Cliff’s d to measure the effect size of the differences between the two groups, indicating the magnitude of the difference. The interpretation of effect size follows the thresholds provided by Cliff (Cliff, 1993): —d— ¡ 0.147 indicates a negligible effect, —d— ¡ 0.33 indicates a small effect, —d— ¡ 0.474 indicates a medium effect, and values larger than 0.474 indicate a large effect. A factor with a significant difference and non-negligible effect size between the two groups indicates this factor is a good indicator of the difference between the two groups.

We also built classification models to determine whether it is feasible to predict incorrect APIs and erroneous code examples for Tasks 1 and 2, respectively, using the proposed factors. We split the dataset into an 80:20 ratio for training and testing, following prior studies (Yang et al., 2023; Ahmed et al., 2024; Xu et al., 2017; Qiao et al., 2020). We selected Random Forest as the classifier for both tasks due to its generally high accuracy (Rajbahadur et al., 2017; Ghotra et al., 2015; Yang et al., 2024) and its robustness to outliers (Cutler et al., 2012). To evaluate the performance of the models, we measured precision, recall, and F1-score. Additionally, we assessed feature importance (Menze et al., 2009) to understand the contribution of each factor to the prediction following the approach of prior studies (Rajbahadur et al., 2019; Wang et al., 2023; Santos et al., 2020). Note that to mitigate the bias from highly correlated factors, we computed the correlation among all studied factors, and observed that those factors present various ranges of correlation. For instance, model-related factors PPL, Probing, and Consistency share relatively high correlations, ranging from 0.14 to 0.59. API-related factors API_length and API_popularity are correlated with a range from 0.44 to 0.51. According to prior studies (Rajbahadur et al., 2019; Wang et al., 2023; Santos et al., 2020), factors with a correlation of 0.7 are considered highly correlated and should select one among them as the representative. In our case, no pairs of factors exceed this threshold, so we keep all of them.

Similar to RQ2, we conduct our experiments on ChatGPT and MagiCoder for this RQ.

3.4.4. RQ4 - Error mitigation

Retrieval-augmented generation (RAG) has been shown to enhance code generation by integrating relevant external knowledge into the language model’s context (Chen et al., 2024; Parvez et al., 2021; Daneshvar et al., 2024; Lu et al., 2022; Tan et al., 2024; Nashid et al., 2023). We aim to investigate whether employing RAG can improve API recommendation and code example generation. For Task 1, we enrich the context provided to the LLM by including descriptions of both the package and the class in front of the “Instruction” section in the prompt template, as detailed in Section 2.1.1. We denote this RAG strategy as RAGT1descsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝑇1{}^{desc}_{T1}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 end_POSTSUBSCRIPT. For example, when requesting API recommendations for the class “HashTable” in the “java.util” package, we prepend the prompt with relevant descriptions: “Package description: package description of java.util; Class description: class description of HashTable”. In addition, we explore a variant of RAGT1descsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝑇1{}^{desc}_{T1}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 end_POSTSUBSCRIPT, where we add a list of existing APIs within the class as the additional context to test if this further improves the LLM’s ability to recommend correct APIs, when the existing correct APIs are actually provided. We denote this RAG strategy as RAGT1desc+APIsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝐴𝑃𝐼𝑇1{}^{desc+API}_{T1}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c + italic_A italic_P italic_I end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 end_POSTSUBSCRIPT. In this study, we provide a list of 10 APIs. For Task 2, we extend the context used in Task 1 by adding a detailed description of the specific API, including its summary, return type, and input parameters. Our goal is to determine if this additional information improves the quality of code example generation. We denote the RAG strategy as RAGT2descsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝑇2{}^{desc}_{T2}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 2 end_POSTSUBSCRIPT.

4. Results of Case Study

4.1. RQ1 - Quality Assessment

Table 2. The quality of API-oriented code generated by the three studied LLMs, MagiCoder, DeekSeek Coder, and ChatGPT for both tasks. The cells with the best result are marked in bold.
Model Task 1 Task 2
IncorrectAPI NoAPIInvoked Uncompilable Unexecutable TotalError
MagiCoder 84.1% 20.7% 22.4% 11.4% 54.5%
DeepSeek Coder 82.9% 9.9% 25.5% 15.7% 51.1%
ChatGPT 58.1% 5.4% 20.4% 13.6% 39.4%

Hallucinations are prevalent in the API recommendation task, with 58.1% to 84.1% of the recommended APIs not existing in the specified package. Table 2 summarizes the quality of recommended APIs for Task 1. Specifically, 84.1%, 82.9%, and 58.1% of the recommended APIs from MagiCoder, DeepSeek Coder, and ChatGPT, respectively, do not exist in the specified package. In addition, we analyze the errors that occurred within incorrect API recommendations by analyzing where the errors occurred in the method signature. Table 3 presents the types of errors produced by the three LLMs. 1.8% - 15.6% of the errors are due to not recommending method APIs (i.e., NotMethod), where LLMs suggested other class elements, such as fields, instead of methods in these cases. The majority of the errors involved recommending method names that do not exist. The remaining errors were attributed to incorrect return types or parameters (Incorrect ReturnType/Parameter). Notably, we only assessed the correctness of return types and parameters when the method name was accurate. Therefore, it is possible that multiple errors could occur in different parts of a single recommendation. Interestingly, among the Incorrect Return/Parameter cases, a remarkable portion of errors (85.6% - 86.2%) resulted from combining return types and parameters from multiple overloaded methods. For example, the method ”boolean remove(Object o)” was recommended by MagiCoder for the java.util.Hashtable class. However, only two overloaded methods “V remove(Object key)” and “boolean remove(Object key, Object value)” exist in this class. This is likely because of the common scenario of method overloading in Java, making LLMs confused when recommending API methods.

Table 3. The types of errors occurred in the recommended APIs by MagiCoder, DeepSeek Coder, and ChatGPT.
Model NotMethod MethodNameNotExist Incorrect ReturnType/Parameter
MagiCoder 15.6% 77.0% 7.4%
DeepSeek Coder 10.0% 81.5% 8.5%
ChatGPT 1.8% 77.9% 20.3%

39.4% to 54.5% of LLM-generated code examples have errors. More specifically, a range from 5.4% to 20.7% of the code examples fail to include the API they are intended to demonstrate and the rest of the generated code examples fail to compile or execute. As shown in Table 2, 5.4% to 20.7% of the recommended code examples omit the specified APIs entirely, which is a form of hallucination (see details in RQ2). In these cases, the LLM fails to follow the instructions to generate code examples for the given API. 20.4% to 25.5% of the generated code examples fail to compile and 11.4% and 15.7% of the code examples fail to execute successfully.

Different LLMs vary in performance across different tasks. Compared with MagiCoder and DeepSeek Coder, ChatGPT tends to follow the instructions better and have a lower rate of hallucinations, while producing a similar proportion of uncompilable/unexecutable code examples. For Task 1, ChatGPT generated a significantly lower proportion of incorrect APIs than MagiCoder and DeepSeek Coder. Furthermore, MagiCoder and DeepSeek Coder tended to produce more APIs than ChatGPT. Although the LLMs were instructed to recommend a maximum of 5 APIs per prompt (totaling 11,9850 APIs for 2,397 classes with 10 repetition), MagiCoder and DeepSeek Coder often exceeded this limit, whereas ChatGPT followed the instructions more accurately. For instance, ChatGPT generated 114,339 APIs while MagiCoder generated 153,470 for Task 1. A similar trend was observed in Task 2, where ChatGPT had fewer NoAPIInvoked cases than MagiCoder and DeepSeek Coder. However, MagiCoder, DeepSeek Coder, and ChatGPT share similar performance in generating compilable and executable code.

Different LLMs vary in performance across different tasks. Compared with MagiCoder and DeepSeek Coder, ChatGPT tends to follow the instructions better and have a lower rate of hallucinations, while producing a similar proportion of uncompilable/unexecutable code examples. Hallucinations are prevalent in the API recommendation task, with 581.% to 84.1% of the recommended APIs not existing in the specified package. For code example generation, 39.4% to 54.4% of code examples have errors. More specifically, a range from 5.4% to 20.7% of the code examples fail to include the given API, and the rest of the generated code examples fail to compile or execute.

4.2. RQ2 - Error Analysis

Table 4. The count and distribution of each type of error in Task 1 and Task 2.
Type Sub-Type MagiCoder ChatGPT
Task 1 Task 2 Task 1 Task 2
Hallucination Factual Fabrication 180 (46.8%) 52 (13.5%) 197 (51.3%) 46 (12.0%)
Factual Inconsistency 89 (23.2%) 55 (14.3%) 66 (17.2%) 42 (10.9%)
Instruction Inconsistency 115 (30.0%) 27 (7.0%) 121 (31.5%) 6 (1.6%)
Context Inconsistency N/A 6 (1.6%) N/A 0
Compilation Errors Missing Import Statement N/A 19 (4.9%) N/A 32 (8.3%)
Type Mismatch N/A 21 (5.5%) N/A 9 (2.3%)
Polymorphism Error N/A 22 (5.7%) N/A 16 (4.2%)
Undeclared variable/class/structure N/A 3 (0.8%) N/A 0
API misuse N/A 6 (1.6%) N/A 52 (13.5%)
Runtime Errors Initialization Error N/A 68 (19.3%) N/A 106 (27.6%)
Exception Handling Error N/A 22 (5.7%) N/A 20 (5.2%)
Timeout Error N/A 11 (2.9%) N/A 1 (0.3%)
Connection Error N/A 16 (4.2%) N/A 3 (0.8%)
Missing External Resource N/A 25 (6.5%) N/A 20 (5.2%)
API Misuse N/A 21 (5.5%) N/A 10 (2.6%)
Deprecated Error N/A 9 (2.3%) N/A 23 (6.0%)

For the API recommendation task, most errors are due to Factual Fabrication Hallucinations (46.0%/51.3%), followed by Instruction Inconsistencies (30.0%/31.5%), and finally, Factual Inconsistencies for both MagiCoder and ChatGPT. Table 4 presents the distribution of error types for Task 1. A similar trend is observed for both MagiCoder and ChatGPT. The majority of errors stem from Factual Fabrication. For example, when asked to recommend APIs for “java.util.Arrays”, the LLM suggested “createCompatibleGraphics()”, which does not exist. Upon reviewing the API documentation, we found a similar method, “createCompatibleImage()”, indicating that the LLM likely confused the terms ‘Graphics” and “Image” after generating the prefix “createCompatible”. The second most common error is Instruction Inconsistency. For instance, the LLM often disregards the required format, omitting return types or parameters for certain static APIs. The least frequent error is Factual Inconsistency, where the LLM generates the correct API name but with incorrect return types or parameters.

In the code example generation task, the most common error categories are runtime errors (46.4%/47.7%), followed by hallucinations (36.1%/24.5%) and compilation errors. As presented in Table 4, Runtime Errors occur the most frequently. 46.4% and 47.7% of the errors occur during runtime for MagiCoder and ChatGPT, respectively. In general, those two LLMs share similar patterns, the most frequent errors are Initialization Error. For instance, a code example for “getClickCount()” in the package “java.awt.event.MouseEvent” was generated by MagiCoder as shown in Listing 1, null was passed as the Component source parameter for “MouseEvent()”. The source should not be null, it should be a valid component (like a JFrame or a JButton). The code was not initialized properly. Other examples of this type of error could relate to system environment preparation/configuration (e.g., system environment variable).

Listing 1: Code example of Context Initialization Error
import java.awt.event.MouseEvent;
public class MouseEvent_3 {
public static void main(String[] args) {
MouseEvent event = new MouseEvent(null, 0, 0, 0, 0, 0, 0, false, 0);
int clickCount = event.getClickCount();
System.out.println(”Number of times the mouse button has been clicked: + clickCount);
}
}

Hallucinations accounted for 36.1% and 24.5% of the errors in Task 2 when using MagiCoder and ChatGPT, respectively. Most of these hallucinations were categorized as Factual Fabrication and Factual Inconsistency. Unlike in Task 1, Instruction Inconsistency was rare in Task 2. Interestingly, we observed six cases of Context Inconsistency when using MagiCoder. For instance, we requested the LLM to generate a code snippet for the method “void update(Graphics g)” from the “java.awt.Canvas” package. However, the LLM incorrectly asserted that the method “void update(Graphics g)” is not part of Canvas package, which contradicted the context we provided. In fact, the “java.awt.Component” package does contain a method with the same name, “void update(Graphics g)”. This likely confused the LLM, resulting in the hallucination.

For the API recommendation task, most errors are due to Factual Fabrication Hallucinations (46.0%/51.3%), followed by Instruction Inconsistencies (30.0%/31.5%), and finally, Factual Inconsistencies for both MagiCoder and ChatGPT. For code example generation, the most errors occur due to Runtime Errors (e.g., Context Initialization Error), followed by Hallucination and Compilation Errors.

4.3. RQ3 - Factors Analysis

Table 5. Statistic test results for Task 1 and Task 2.
Model Factor Mean of APIcorrect Mean of APIincorrect p-value Cliff’s D
Task 1 MagiCoder API_popularity 71557.7 23322.5 0.0 0.490 (large)
API_length 58.3 67.2 0.0 -0.252 (small)
PPL 1.04 1.04 0.34 -0.006 (negligible)
Consistency 0.29 0.14 0.0 0.432 (medium)
Probing 0.22 0.11 9.06e-171 (negligible)
ChatGPT API_popularity 49,959.2 17,223.3 0.0 0.360 (medium)
API_length 62.4 75.4 0.0 0.281 (small)
PPL 1.10 1.12 2.28e-47 -0.093 (negligible)
Consistency 0.4545 0.1971 0.0 0.487 (large)
Probing 0.68 0.47 0.0 0.213 (small)
Model Factor Mean of Codenon-errornous Mean of Codeerrornous p-value Cliff’s D
Task 2 MagiCoder API_popularity 99,549.9 54,182.8 0.0 0.318(small)
API_length 51.3 57.6 2.49e-178 -0.221 (small)
PPL 1.11 1.23 8.11e-99 -0.164 (small)
Consistency 0.28 0.38 0.0 -0.398 (medium)
Probing 0.65 0.53 1.556e-77 0.124 (negligible)
ChatGPT API_popularity 93,920.6 67,180.0 1.59e-143 0.197 (small)
API_length 52.5 58.7 2.32e-74 -0.141 (negligible)
PPL 1.08 1.15 9.10e-48 -0.112 (negligible)
Consistency 0.21 0.23 5.63e-38 -0.100 (negligible)
Probing 0.79 0.47 0.0 0.321 (small)

For both tasks, all studied factors show a significant difference between the two groups of generated APIs/code examples. Table 5 presents the statistical test results on the studied factors for Task 1 and Task 2. For Task 1, in both LLMs, all factors exhibit significant differences between the two groups of APIs with non-negligible effect sizes with non-negligible effect size, except Probing on MagiCoder. For example, the popularity of APIincorrect is substantially lower than that of APIcorrect with a large effect size in both LLMs. This aligns with the expectation that a more popular API, which is likely to have more related usage in LLMs’ training data, increases the model’s likelihood of making correct recommendations. From the model’s perspective, consistency serve as a strong indicator for differentiating between the two API groups. For Task 2, all studied factors demonstrate significant differences between the groups of erroneous and non-erroneous code examples, although API_length, PPL, and Consistency exhibit negligible effect size on ChatGPT.

In addition, we compute the correlation between the value of IncorrectAPI and each studied factor for each package. For numerical factors (i.e., API_length, API_popularity, Consistency, and PPL), we use the median value of all APIs as the representative for the package. For the binary factor Probing, we calculate the percentage of classes where the LLM responds “Yes” as the measure of self-probing for the package. For ChatGPT, the correlations between the proportion of incorrect APIs and the factors are as follows: API_length (0.35), API_popularity (-0.43), Consistency (-0.59), PPL (0.22), and Probing (-0.49). Additionally, for API-related factors, we observe that packages with longer APIs and less popular packages are more likely to produce incorrect APIs. For Task 2, we compute the correlation between the total proportion of errors (TotalError) and the studied factors for ChatGPT. The correlations for Task 2 are as follows: API_length (0.13), API_popularity (-0.26), Consistency (-0.10), PPL (0.22), and Probing (-0.50). Compared to Task 1, the correlations for other factors are weaker, except for Probing. Due to space constraints, we do not present the plots for Task 2. We also observe similar patterns for MagiCoder.

Refer to caption
Figure 2. Partial dependency plots on Task1 when using ChatGPT

Our classifiers achieve F1-scores of 0.96/0.88 for Task 1 and 0.8/0.76 for Task 2 in predicting incorrect recommended API or erroneous generated code. Table 6 presents the classification performance for both tasks. The results suggest that our proposed factors are effective indicators for distinguishing between the two groups of APIs and code examples in both tasks. We also shows the feature importance of each factor for the trained classifiers in both tasks. Factors related to Model confidence such as Consistency and PPL and API-related factors API_propularity and API_length serve are all important for distinguishing two groups of APIs and code examples.

Figure 2 demonstrates the partial dependent plots on Task1 when using ChatGPT, the result echoes our correlation analysis above. For instance, we observe the likelihood of generating an incorrect API is positively associated with PPL, API_length, while negatively associated with API_popularity, Consistency, and Probing. We observe similar patterns for Task 2 on ChatGPT and MagiCoder.

Table 6. The classification result and feature importance on Task 1 and Task 2.
Task Model Feature Importance Results
API_popularity API_length PPL Consistency Probing F1-score
Task1 MagiCoder 0.216 0.297 0.341 0.133 0.013 0.96
ChatGPT 0.247 0.221 0.328 0.183 0.022 0.88
Task2 MagiCoder 0.264 0.183 0.228 0.309 0.016 0.80
ChatGPT 0.256 0.200 0.226 0.225 0.093 0.76
Factors such as API popularity and model confidence are strongly associated with API-oriented code quality. Using our proposed factors, we built highly accurate classifiers to detect incorrect API recommendations and unexecutable/uncompiable code examples (e.g., F1 scores of 0.96 and 0.8 for Task 1 and 2 on MagiCoder).

4.4. RQ4 - Error Mitigation

In general, RAG improves the quality of generated Code by LLMs, while RAG’s improvements differ for different LLMs. Table 7 compares the quality of code generated by LLMs with and without using RAG. Across both tasks, RAG improves code quality when MagiCoder and ChatGPT are used as the base LLMs. However, the magnitude of these improvements differs between the two models. Notably, RAG brings more substantial improvements for ChatGPT than for MagiCoder. For example, in Task 2, RAG reduces TotalError from 44.4% to 43.2% for MagiCoder, a modest improvement of 2.7%. In contrast, for ChatGPT, RAG decreases IncorrectAPI from 57.3% to 30.8%, representing a much larger improvement of 39.6%.

For Task 1, it is surprising that even when provided with a list of correct APIs in the context, the LLMs still fail to recommend APIs accurately. As shown in Table 7, despite having the correct APIs listed along with the context, LLMs still make a significant amount of errors in their recommendations. Specifically, 40.3% of the APIs recommended by MagiCoder and 27.9% by ChatGPT do not exist in the specified package. This is unexpected, as the task should be straightforward - selecting from the provided list of correct APIs. One possible explanation is that LLMs sometimes disregard the given context and rely instead on their internal knowledge that is encapsulated in the model (Su et al., 2024; Marjanović et al., 2024).

Table 7. Comparison of the quality of API-oriennted code generation by LLMs with/without RAG. The cells with better results are highlighted in bold.
Model RAG Task 1 Task 2
IncorrectAPI NoAPIInvoked Uncompilable Unexecutable TotalError
MagiCoder w/o RAG 85.3% 9.1% 21.0% 14.3% 44.4%
RAGT1/T2descsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝑇1𝑇2{}^{desc}_{T1/T2}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 / italic_T 2 end_POSTSUBSCRIPT 84.0% 7.2% 23.0% 13.0% 43.2%
RAGT1desc+APIsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝐴𝑃𝐼𝑇1{}^{desc+API}_{T1}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c + italic_A italic_P italic_I end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 end_POSTSUBSCRIPT 40.3% N/A N/A N/A N/A
ChatGPT w/o RAG 57.3% 4.0% 32.7% 14.2% 51.0%
RAGT1/T2descsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝑇1𝑇2{}^{desc}_{T1/T2}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 / italic_T 2 end_POSTSUBSCRIPT 54.3% 1.6% 18.6% 10.5% 30.8%
RAGT1desc+APIsubscriptsuperscriptabsent𝑑𝑒𝑠𝑐𝐴𝑃𝐼𝑇1{}^{desc+API}_{T1}start_FLOATSUPERSCRIPT italic_d italic_e italic_s italic_c + italic_A italic_P italic_I end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T 1 end_POSTSUBSCRIPT 27.9% N/A N/A N/A N/A
In general, RAG improves the quality of generated Code by LLMs, while RAG’s improvements differ for different LLMs. For Task 1, surprisingly, even when provided with a list of correct APIs in the context, the LLMs still fail to recommend APIs accurately.

5. Discussion

5.1. Implications of our findings

Our research identifies model-related indicators for predicting incorrect API-oriented code generation by LLMs. As shown in RQ4, our proposed factors could be used to build well-performed classifiers to identify low-quality API-oriented code generated by LLMs. More specifically, model-related factors (e.g., Consistency) are strongly correlated with the quality of APIs and code examples generated by LLMs in both tasks. The importance scores from the constructed models also highlight that PPL and Consistency are critical factors. Therefore, model-related factors could serve as indicators of an LLM’s capability for API-oriented code generation for a specific library. For example, developers could directly probe the LLM by asking if it knows the library or its APIs and observe the PPL of output. Actually, we test the models that are only built with model-related factors, it achieves an F1-score of 0.96 and 0.63 for Task 1 and Task 2, respectively.

Hallucinations are prevalent in API-oriented code generation, and future research is encouraged to mitigate these issues. As observed in RQ1 and RQ2, various hallucinations occur across both tasks. For example, RQ1 shows that 4.0% to 9.9% of the generated code examples by LLMs do not include the specified APIs, consistent with findings from previous studies (Spracklen et al., 2024). Factual Fabrication and Factual Inconsistency are the most frequent types of hallucinations, where fabricated APIs are generated. Our study suggests that these hallucinations may stem from factors such as a lack of training data and confusion over overloaded methods. Future research should explore methods to mitigate hallucinations in API-oriented code generation. Approaches from the NLP field, like RAG, fine-tuning, and self-reflection (Huang et al., 2023), could be adapted for this context. For instance, RAG appears promising, as indicated by the results in RQ5, where it reduced errors. Another direction is enhancing self-reflection with fact-checking, as RQ4 shows that self-probing can be a good indicator of poor code generation. Additionally, API documentation and runtime results could provide valuable information for quick fact-checking of LLM-generated code (Kabir et al., 2024). Future research could develop approaches that combine self-reflection and fact-checking to reduce hallucinations (e.g., Chain-of-Verification (Dhuliawala et al., 2023)).

Future research is strongly encouraged to develop more effective approaches to leverage the API-related information in RAG. As we observed in RQ5, even providing a list of APIs in the context, LLMs cannot recommend fully correct APIs, although it reduces the proportion of incorrect APIs. One possible reason is that LLMs sometimes disregard the given context (context knowledge), and only rely on their parametric knowledge, which is encapsulated in LLM’s parameters, when the context knowledge and parametric knowledge conflict as prior studies reported, typically when the prompt is long (Su et al., 2024; Marjanović et al., 2024; Shi et al., 2023). Future research is strongly encouraged to develop more effective RAG approaches to leverage external knowledge (e.g., API documentation) to mitigate errors. For instance, approaches that align with external knowledge and emphasize context prioritization, such as faithful-to-context strategies (Xu et al., 2024; Zhou et al., 2023), could be used to guide LLMs in prioritizing contextual information.

5.2. Threats to validity

Internal Validity Prompt engineering has a significant impact on the LLM’s performance (Grabb, 2023). Different prompts probably can lead to different results. However, as we discussed in Section 2, our tasks are basic and straightforward, LLMs usually can follow the instructions specified in prompts to complete our tasks easily as the results in Section 4. In our framework, we propose two basic tasks, i.e., API recommendation and code examples generation, to benchmark an LLM’s ability of API-oriented code generation. One threat is that an LLM’s ability in our designed two tasks probably does not closely align with the LLM’s ability to generate code for a specific task. However, as discussed in section 2, the goal of our framework is to evaluate LLMs on any given library with API documentation automatically. Therefore, we do not include tasks such as code generation with specific requirements which typically need test cases in our framework. We believe our framework provides a lower boundary to assess LLMs’ capability for API-oriented code generation. Nevertheless, we encourage future research to include more tasks to reflect the LLMs’ ability to generate code using specific libraries with specific requirements. Previous studies suggest that LLM settings, such as temperature and decoding strategies, can significantly affect the quality of generated content (Renze and Guven, 2024; Thakur et al., 2024). In this study, we use default settings for the studied LLMs for all RQs. However, our framework enables such analysis and we examined actually whether different LLM settings such as different temperatures and different decoding strategies (i.e., beam search, top K, and greedy search) have a measurable impact on the quality of generated code (due to space limit, we do not present here). In general, a lower temperature tends to produce code of similar or higher quality for both tasks and across both LLMs. Greedy Search, Beam Search, and Top-K share similar performance. Another threat is that certain APIs are version version-sensitive. We encourage future work to take this into consideration when using our framework for evaluation.

External Validity relates to the generalizability of our findings. Even though we conducted our empirical study on three different state-of-the-art LLMs (i.e., ChatGPT, MagiCoder, and DeepSeek Coder) and JRE 8, our findings may not generalize well to other LLMs and libraries. We propose a framework to enable automatic and systematical analysis on other LLMs and libraries and encourage future research on more LLMs and libraries.

6. Related work

Code recommendation and generation with LLM. In recent years, there has been increasing interest in using Large Language Models (LLMs) for generating code from natural language prompts (Lu et al., [n. d.]; Chen et al., 2021; Luo et al., 2023; Wei et al., 2023; Li et al., 2023). Lu et al. initiated this field with CodeGPT, based on GPT-2 and specifically trained on source code (Lu et al., [n. d.]). Chen et al. advanced this by fine-tuning GPT-3 models to create CodeX, which excels at generating both natural language and code (Chen et al., 2021). More recent models, such as starCoder (Li et al., 2023), WizardCoder (Luo et al., 2023), and MagiCoder (Wei et al., 2023), further enhance code generation capabilities. In addition to generating code from natural language, integrating Application Programming Interfaces (APIs) is crucial. Although a few research has explored API-oriented code generation for specific libraries (Zan et al., 2023; Liu et al., 2023), most of these efforts primarily focused on developing LLM-based approaches to generate code that interacts with APIs. Several studies explored API integration during code generation and revealed issues, such as license issues (Zan et al., 2022; Latendresse et al., 2024a) and hallucination (Spracklen et al., 2024). Different from prior studies, we focus on developing automated framework to evaluate LLMs on API-oriented code generation and enable further analysis, rather than analyzing the errors.

Benchmarking for code generation. To evaluate the functional correctness of generated code, the most effective method is to test its execution against predefined test cases. Several benchmarks have been developed to assess LLMs’ code generation abilities (chen2021evaluating; Zhuo et al., 2024; Zhang et al., 2023; Du et al., 2023; Yu et al., 2024). HumanEval (hum, [n. d.]) is widely used, testing code correctness through execution on Python problems. BigCodeBench (Zhuo et al., 2024) evaluates code generation across various languages and tasks, while RepoEval (Zhang et al., 2023) focuses on library-level code completion using unit tests. ClassEval (Du et al., 2023) challenges LLMs with class-level code generation. Unlike these benchmarks, which generally assess code generation and require test cases for evaluation, our focus is specifically on API-oriented code generation. More importantly, our proposed framework is fully automated and only relies on API documentation as the input.

Issues with Code Generation using LLMs. Despite advancements in LLM-based code generation, issues such as vulnerabilities (Nair et al., 2023; Fu et al., 2023; Majdinasab et al., 2024; Fan et al., 2023; Pearce et al., 2022), compile/runtime errors (Dou et al., 2024; Pan et al., 2023), copyright issues (Latendresse et al., 2024b) and hallucinations (Liu et al., 2024; Li et al., 2024; Ji et al., 2023; Spracklen et al., 2024; Tian et al., 2024) persist. For example, Pearce et al. found that around 40% of Copilot-generated programs are vulnerable, a finding echoed by Majdinasab et al., who reported 27.25% of code suggestions with vulnerabilities even in newer Copilot versions. Hallucinations, where LLMs generate factually incorrect content, pose challenges in producing reliable code snippets, leading to issues like intent conflicts and context deviations (Liu et al., 2024). Dou et al. observed that LLMs often produce shorter but convoluted code for complex tasks, based on error types and compiler feedback (Dou et al., 2024). Our study extends this analysis to API-oriented code generation, addressing not only hallucinations but also runtime and compilation errors.

7. Conclusion

We propose AutoAPIEval, a lightweight and automated framework for evaluating LLMs in API-oriented code generation. Compatible with any library that provides API documentation, our framework focuses on two unit tasks: API recommendation and code example generation, along with four evaluation metrics, including the proportion of incorrect API recommendations and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples.

To demonstrate the framework’s effectiveness, we conducted a case study with three LLMs ChatGPT, MagiCoder, and DeepSeek Coder on JRE 8. Our findings show notable variability in LLM performance across tasks, with ChatGPT generally following instructions better but generating more unexecutable code compared to the other models. We identify crucial factors that are associated with code quality, such as API popularity and model confidence. We develop classifiers that achieve high accuracy in detecting low-quality API recommendations and code examples. Additionally, while retrieval-augmented generation improves code quality, its effectiveness varies between different LLMs. Our findings offer valuable insights for future research directions.

8. Data Availability

We have made our replication package available, which contains all the code and datasets available here (anonymous, 2024).

References

  • (1)
  • sen ([n. d.]) [n. d.]. all-MiniLM-L6-v2. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. Accessed: 2024-04-22.
  • dee ([n. d.]) [n. d.]. Deepseek Coder. https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct. Accessed: 2024-04-22.
  • Git ([n. d.]) [n. d.]. GithubCopilot. https://github.com/features/copilot. Accessed: 2024-04-22.
  • hum ([n. d.]) [n. d.]. Human Eval. https://github.com/openai/human-eval. Accessed: 2024-04-22.
  • mag ([n. d.]) [n. d.]. Magicoder. https://huggingface.co/ise-uiuc/Magicoder-S-DS-6.7B. Accessed: 2024-04-22.
  • Ahmed et al. (2024) Shahla Shaan Ahmed, Shaowei Wang, Yuan Tian, Tse-Hsun Peter Chen, and Haoxiang Zhang. 2024. Studying and recommending information highlighting in Stack Overflow answers. Information and Software Technology 172 (2024), 107478.
  • anonymous (2024) anonymous. 2024. AutoAPIEval. https://anonymous.4open.science/r/AutoAPIEval-F8A5/README.md.
  • Chen et al. (2024) Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code search is all you need? improving code suggestions with code search. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  • Chen et al. (1998) Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998).
  • Ciniselli et al. (2023) Matteo Ciniselli, Luca Pascarella, Emad Aghajani, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. Source code recommender systems: The practitioners’ perspective. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2161–2172.
  • Cliff (1993) Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin 114, 3 (1993), 494.
  • Cutler et al. (2012) Adele Cutler, D Richard Cutler, and John R Stevens. 2012. Random forests. Ensemble machine learning: Methods and applications (2012), 157–175.
  • Dakhel et al. (2023) Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734.
  • Daneshvar et al. (2024) Seyed Shayan Daneshvar, Yu Nong, Xu Yang, Shaowei Wang, and Haipeng Cai. 2024. Exploring RAG-based Vulnerability Augmentation with LLMs. arXiv preprint arXiv:2408.04125 (2024).
  • DeepSeek-AI et al. (2024) DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y. K. Li, Wenfeng Liang, Fangyun Lin, A. X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R. X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv:2401.02954 [cs.CL] https://arxiv.org/abs/2401.02954
  • Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).
  • Dong et al. (2023) Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590 (2023).
  • Dou et al. (2024) Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al. 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024).
  • Du et al. (2023) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861 (2023).
  • Fan et al. (2023) Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53.
  • Fu et al. (2023) Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, and Jiaxin Yu. 2023. Security weaknesses of copilot generated code in github. arXiv preprint arXiv:2310.02059 (2023).
  • Ghotra et al. (2015) Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 789–800.
  • Google ([n. d.]) Google. [n. d.]. googlebigquery. https://cloud.google.com/bigquery/. Accessed: 2024-04-22.
  • Grabb (2023) Declan Grabb. 2023. The impact of prompt engineering in large language model performance: a psychiatric example. Journal of Medical Artificial Intelligence 6 (2023).
  • Guo et al. (2022) Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
  • Hajipour et al. (2023) Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, and Mario Fritz. 2023. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. arXiv:2302.04012 [cs.CR] https://arxiv.org/abs/2302.04012
  • Hao et al. (2024) Huizi Hao, Kazi Amit Hasan, Hong Qin, Marcos Macedo, Yuan Tian, Steven H. H. Ding, and Ahmed E. Hassan. 2024. An Empirical Study on Developers Shared Conversations with ChatGPT in GitHub Pull Requests and Issues. arXiv:2403.10468 [cs.SE] https://arxiv.org/abs/2403.10468
  • Hoffa ([n. d.]) Felipe Hoffa. [n. d.]. GitHub on BigQuery: Analyze all the open source code. https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code. Accessed: 2024-04-22.
  • Huang et al. (2023) Lei Huang, Weitao Ma Weijiang Yu, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, and Bing Qinand Ting Liu. 2023. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55, 12 (2023), 1–38.
  • Kabir et al. (2024) Azmain Kabir, Shaowei Wang, Yuan Tian, Tse-Hsun, Chen, Muhammad Asaduzzaman, and Wenbin Zhang. 2024. ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using ChatGPT. arXiv:2401.14279 [cs.SE] https://arxiv.org/abs/2401.14279
  • Khoury et al. (2023) Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Camara. 2023. How secure is code generated by chatgpt?. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2445–2451.
  • Latendresse et al. (2024a) Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2024a. Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations. arXiv:2408.05128 [cs.SE] https://arxiv.org/abs/2408.05128
  • Latendresse et al. (2024b) Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2024b. Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations. arXiv preprint arXiv:2408.05128 (2024).
  • Li et al. (2024) Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205 (2024).
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  • Liu et al. (2024) Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, and Li Zhang. 2024. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971 (2024).
  • Liu et al. (2023) Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. 2023. Codegen4libs: A two-stage approach for library-oriented code generation. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 434–445.
  • Lu et al. (2022) Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
  • Lu et al. ([n. d.]) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. [n. d.]. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  • Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023).
  • Majdinasab et al. (2024) Vahid Majdinasab, Michael Joshua Bishop, Shawn Rasheed, Arghavan Moradidakhel, Amjed Tahir, and Foutse Khomh. 2024. Assessing the Security of GitHub Copilot’s Generated Code-A Targeted Replication Study. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 435–444.
  • Marjanović et al. (2024) Sara Vera Marjanović, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina Lioma, and Isabelle Augenstein. 2024. From Internal Conflict to Contextual Adaptation of Language Models. arXiv preprint arXiv:2407.17023 (2024).
  • Menze et al. (2009) Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich, and Fred A Hamprecht. 2009. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC bioinformatics 10 (2009), 1–16.
  • Nair et al. (2023) Madhav Nair, Rajat Sadhukhan, and Debdeep Mukhopadhyay. 2023. Generating secure hardware using chatgpt resistant to cwes. Cryptology ePrint Archive (2023).
  • Nashid et al. (2023) Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).
  • OpenAI (2022) OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed: 2023-12-28.
  • Oracle ([n. d.]) Oracle. [n. d.]. Java API Specification. https://docs.oracle.com/javase/8/docs/api/overview-summary.html. Accessed: 2024-04-22.
  • Pan et al. (2023) Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. arXiv preprint arXiv:2308.03109 (2023).
  • Paredes et al. (2023) Cristian Mauricio Gallardo Paredes, Cristian Machuca, and Yadira Maricela Semblantes Claudio. 2023. ChatGPT API: Brief overview and integration in Software Development. International Journal of Engineering Insights 1, 1 (2023), 25–29.
  • Parvez et al. (2021) Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601 (2021).
  • Pearce et al. (2022) Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
  • Peng et al. (2023) Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590 (2023).
  • Qiao et al. (2020) Lei Qiao, Xuesong Li, Qasim Umer, and Ping Guo. 2020. Deep learning based software defect prediction. Neurocomputing 385 (2020), 100–110.
  • Rajaraman and Ullman (2011) Anand Rajaraman and Jeffrey D Ullman. 2011. Mining of massive datasets. Autoedicion.
  • Rajbahadur et al. (2017) Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, and Ahmed E Hassan. 2017. The impact of using regression models to build defect classifiers. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 135–145.
  • Rajbahadur et al. (2019) Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, and Ahmed E Hassan. 2019. Impact of discretization noise of the dependent variable on machine learning classifiers in software engineering. IEEE Transactions on Software Engineering 47, 7 (2019), 1414–1430.
  • Renze and Guven (2024) Matthew Renze and Erhan Guven. 2024. The effect of sampling temperature on problem solving in large language models. arXiv preprint arXiv:2402.05201 (2024).
  • Santos et al. (2020) Geanderson Santos, Eduardo Figueiredo, Adriano Veloso, Markos Viggiato, and Nivio Ziviani. 2020. Predicting software defects with explainable machine learning. In Proceedings of the XIX Brazilian Symposium on Software Quality. 1–10.
  • Seaman (1999) Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25, 4 (1999), 557–572.
  • Shi et al. (2023) Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih. 2023. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739 (2023).
  • Spracklen et al. (2024) Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, and Murtuza Jadliwala. 2024. We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs. arXiv preprint arXiv:2406.10279 (2024).
  • Su et al. (2024) Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. 2024. ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM. arXiv preprint arXiv:2408.12076 (2024).
  • Tan et al. (2024) Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, and Yuqun Zhang. 2024. Prompt-based Code Completion via Multi-Retrieval Augmented Generation. arXiv preprint arXiv:2405.07530 (2024).
  • Thakur et al. (2024) Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth Garg. 2024. Verigen: A large language model for verilog code generation. ACM Transactions on Design Automation of Electronic Systems 29, 3 (2024), 1–31.
  • Tian et al. (2024) Yuchen Tian, Weixiang Yan, Qian Yang, Qian Chen, Wen Wang, Ziyang Luo, and Lei Ma. 2024. CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification. arXiv preprint arXiv:2405.00253 (2024).
  • Wang et al. (2023) Tianlei Wang, Shaowei Wang, and Tse-Hsun Peter Chen. 2023. Study the correlation between the readme file of GitHub projects and their popularity. Journal of Systems and Software 205 (2023), 111806.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  • Wei et al. (2023) Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source Code Is All You Need. arXiv preprint arXiv:2312.02120 (2023).
  • Wu et al. (2019) Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How do developers utilize source code from stack overflow? Empirical Software Engineering 24 (2019), 637–673.
  • Xu et al. (2017) Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo. 2017. AnswerBot: Automated generation of answer summary to developers’ technical questions. In 2017 32nd IEEE/ACM international conference on automated software engineering (ASE). IEEE, 706–716.
  • Xu et al. (2024) Rongwu Xu, Zehan Qi, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge Conflicts for LLMs: A Survey. arXiv preprint arXiv:2403.08319 (2024).
  • Yang et al. (2024) Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, and Zhen Ming Jiang. 2024. SimClone: Detecting Tabular Data Clones using Value Similarity. ACM Transactions on Software Engineering and Methodology (2024).
  • Yang et al. (2023) Xu Yang, Shaowei Wang, Yi Li, and Shaohua Wang. 2023. Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2287–2298.
  • Yetiştiren et al. (2023) Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778 (2023).
  • Yu et al. (2024) Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
  • Zan et al. (2023) Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. 2023. Private-library-oriented code generation with large language models. arXiv preprint arXiv:2307.15370 (2023).
  • Zan et al. (2022) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint arXiv:2206.06888 (2022).
  • Zhang et al. (2023) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
  • Zhang et al. (2019) Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, Ying Zou, and Ahmed E Hassan. 2019. An empirical study of obsolete answers on stack overflow. IEEE Transactions on Software Engineering 47, 4 (2019), 850–862.
  • Zhou et al. (2023) Wenxuan Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. 2023. Context-faithful prompting for large language models. arXiv preprint arXiv:2303.11315 (2023).
  • Zhuo et al. (2024) Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. 2024. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877 (2024).