Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Jie JW Wu\orcidlink0000-0002-7895-2023 University of British Columbia3333 University WayKelownaB.C.V1V 1V7Canada jie.jw.wu@ubc.ca  and  Fatemeh H. Fard University of British Columbia3333 University WayKelownaB.C.V1V 1V7Canada fatemeh.fard@ubc.ca
Abstract.

Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trend is using LLM-based agents to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions to reduce Ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. For this purpose, we define the communication skills of LLMs as “being able to ask clarifying questions when the description of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions about the requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as unclear intent specification may be alleviated, resulting to a correct code in the initial iterations.

In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned above, Inconsistency, Ambiguity, Incompleteness. We then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, Code Clarification and Generation Agent (Okanagan), to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. In the evaluation, we introduced an LLM-based evaluator and created Communication Rate and Good Question Rate as the evaluation metrics to represent the ratio of questions asked and questions with good quality in responses. We found that more than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. The Pass@1 and Test Pass Rate of most Code LLMs drop by 35% similar-to\sim 52% and by 17% similar-to\sim 35% respectively, with statistical significance in each category for over 75% numbers. Okanagan, as an LLM agent approach that uses LLM such as ChatGPT 3.5, effectively increases the Communication Rate and Good Question Rate by an absolute 58% and 38%, respectively. Thus, Okanagan boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%, respectively, when the problem descriptions are modified based on given clarification categories. This result indicates the potential for achieving more effective communication capability using LLM agent.

“Asking a good question can be valuable in and of itself, irrespective of the answer. It communicates your respect for the other person.”
—— - Adapted from the Iowa Peace Institute Message

1. Introduction

Large language models (LLMs) (Vaswani et al., 2017; Svyatkovskiy et al., 2020; Wang et al., 2021; Feng et al., 2020), such as OpenAI’s Codex (Chen et al., 2021), AlphaCode (Li et al., 2022), and CodeGen (Nijkamp et al., 2022), possess a significantly capable ability to generate code snippets from natural language requirements. However, there are several reported issues in LLMs, including problems with intent specification, problem decomposition (Sarkar et al., 2022), code quality, and overconfidence (Liu et al., 2023c, b), as well as usability (Liang et al., 2023). These issues indicate that there is still a substantial gap between using LLM as a seasoned coder (Rabinovich et al., 2017; Ye et al., 2020; Alon et al., 2019; Bui et al., 2021; Tufano et al., 2020) and using LLM as a software engineer. As the responsibility of software developers encompasses more than just writing code, current LLMs cannot fully replace professional software developers (Sarkar et al., 2022; Borji, 2023). At a high level, the gap lies in several critical aspects of software development beyond coding, such as effective communications, requirements, design, domain knowledge, and the broader context of relevant projects and components (Nguyen and Nadi, 2022; Sobania et al., 2022; Vaithilingam et al., 2022; Siddiq et al., 2022). Although some LLM-based agent systems have got a lot of attention, e.g., Devin (Wu, 2024), there is no study that investigates the reported issues and systematically integrates them with the LLM agent approach for code generation. In this paper, we are interested in applying the communication lens to inspect the gap, given that we envision effective communication as a critical capability that ensures the necessary information is obtained for completing the coding tasks.

Let us take a step back to compare the communications of LLMs and software developers. The current LLMs are typically evaluated by generating code in one or multiple attempts from one-off problem descriptions, without further conversational inputs (Chen et al., 2021; Austin et al., 2021; Li et al., 2022). This means when the input problem description is error-prone or incomplete without full context, the model has to generate the code without the chance to clarify questions that are necessary to ensure the correctness of the code. In the literature, the communication capability (defined below) of Code LLM (Fan et al., 2023; Zan et al., 2023) and LLM agent (Rasheed et al., 2024; Xi et al., 2023) is underrepresented and thus rarely emphasized and evaluated in the field of code generation. On the contrary, given a software engineering task in real-world enterprises, professional developers use various ways of communication, such as asking more questions in 1:1 conversations, group meetings, and Slack channels to obtain more information and reduce Ambiguity about the detailed requirements, the context of the projects, and the design alternatives. Proactive and effective communication is a critical skill in practice for top-level software developers to accomplish their software engineering tasks reliably with high quality (Whitehead, 2007; Pressman, 2005; Mistrík et al., 2010; McChesney and Gallagher, 2004; Jazayeri, 2004).

Inspired by this behavior, our motivation in this work is to study and evaluate the potential of LLMs on code generation from the dimension of effective communication skills. We argue that the evaluation of the communication capability of Code LLMs is, although underrepresented in literature, essential for the long-term success of AI systems in completing the coding and software engineering tasks (Hassan et al., 2024). Thus, we intend to fill this literature gap in this research for the code generation task. For highly specialized task of code generation, we argue that the AI system should proactively recognize which information is missing, and find these missing pieces to be able to complete the task with high quality and rigorousness, instead of just executing the given task and generating low-quality code as a result. Formally, the communication capability, also referred to as communication skills or communication competency, in this study is defined as follows: when the requirements are incomplete, inconsistent, or ambiguous in a programming problem, and the model is prompted to either generate code or ask clarifying questions, how good the model is in asking clarifying questions to recover the requirements necessary for solving the problem correctly. We use the terms ‘LLMs’ and ‘Code LLMs’ interchangeably to represent LLMs for code generation tasks in this paper.

In this research, we conducted the first systematic empirical study on the communication skills of LLMs in code generation tasks. First, we created a benchmark dataset, HumanEvalComm, for evaluating the degree of communication skills when generating code, based on the widely studied HumanEval code generation benchmark (Chen et al., 2021). We constructed the benchmark by manually modifying the requirements in the original problem description based on concepts in Requirement Engineering (RE) (Tukur et al., 2021; Dermeval et al., 2016). To achieve this, we created a taxonomy of clarification types: Ambiguity, Inconsistency, and Incompleteness (See Section 2). Based on the taxonomy, we then changed each problem description by applying one or a combination of clarification types. Based on the new HumanEvalComm benchmark, we further evaluated different models to inspect the degree of their communication skills when certain information is manually modified to be ambiguous, inconsistent, or incomplete in the problem description. In the evaluation, we introduced an LLM-based evaluator and proposed new evaluation metrics to effectively measure the communication skills of the models. We also proposed a LLM agent approach, Code Clarification and Generation Agent (Okanagan), as an LLM-based agent with multi-round structure and customized prompt for code generation task. A key feature of Okanagan is the ability to ask clarifying questions about the input problem descriptions needed for generating correct code.

In terms of findings, for manual modifications using HumanEvalComm, more than 60% of responses from Code LLMs still generate code. Typically, the Pass@1 and Test Pass Rate of Code LLMs drop by 35% similar-to\sim 52% and by 17% similar-to\sim 35%, respectively. Among the three clarification types, the Incompleteness category results in higher communication rates and Good Question Rates, but lower Pass@1 and Test Pass Rate than the Ambiguity and Inconsistency categories for Code LLMs. Okaganan, the proposed LLM agent approach that uses ChatGPT 3.5 as LLM, effectively increased Communication Rate and Good Question Rate by an absolute 59% and 5%, respectively. This resulted in an increase in Test Pass Rate and Pass@1 by 25% and 15%, respectively. This result indicates the potential for more effective communication capability for LLM agent compared with Code LLMs.

To summarize, we have made the following contributions:

  • We created a new benchmark, HumanEvalComm, for evaluating the degree of communication skills of LLMs for code by manually modifying the requirements in the original problem description based on RE concepts: clarification types of Ambiguity, Inconsistency, Incompleteness.

  • We proposed an LLM-agent approach, Code Clarification and Generation Agent (Okanagan), to enhance the communication capability of the models, and thus lead to better code generation capability, in terms of Pass@1, based on asking clarifying questions when the problem description is ambiguous, inconsistent, or incomplete. The contribution of Okanagan is a multi-round structure with customized prompts for asking clarifying questions when needed in code generation tasks.

  • We conducted the first empirical study on the evaluation of communication competence in code generation task for both Code LLMs and Okanagan on HumanEvalComm. In the evaluation, we introduced LLM-based evaluator and proposed two new evaluation metrics, Communication Rate and Good Question Rate, to effectively measure communication skills of the models.

Our benchmark and replication package are made public at https://github.com/jie-jw-wu/human-eval-comm to support open data and open science principles.

The rest of the paper is structured as follows. Section 2 describes the benchmark construction of our research. Section 3 explains the design of our empirical study. Section 4 summarizes the results for RQs. Section 5 includes more analysis and discussions on the results. Threats to validity are explained in Section 6, followed by summarizing the related works in Section 7. Finally, Section 8 concludes this work.

2. Benchmark Construction

2.1. Benchmark Collection

Existing Benchmarks. We first start by examining the existing benchmarks for code generation. To the best of our knowledge, all of the existing benchmarks (e.g., HumanEval (Chen et al., 2021), CoNaLa (Yin et al., 2018), APPS (Hendrycks et al., 2021), and recent SWE-bench (Jimenez et al., 2024)) in code generation are tasked with letting the model generate the code directly as prediction, without giving the model the opportunity to ask for additional information. Notably, the input of these datasets is well-written and organized by professional human annotations. However, in real-world scenarios, the problem descriptions from humans could be a lack of computational thinking, unclear in the intent specification, or ambiguous in requirements (Liang et al., 2023).

HumanEvalComm Overview. To assess the communication ability of Code LLMs and LLM-based agent, we chose to hand-craft a new benchmark based on a widely used code generation dataset, HumanEval (Chen et al., 2021). Our objective is to modify the problem description based on RE concepts so that it should trigger clarifying questions, which are necessary for generating the correct code. HumanEval is composed of 164 hand-crafted coding problems in Python and was created to evaluate the coding capabilities of Codex. Each problem has a function signature, docstring, body, and unit tests. The average number of ground-truth test cases per problem is 7.77. HumanEval is chosen as it is a benchmark dataset with test cases and is widely used for evaluating LLMs (Ouyang et al., 2023; Min et al., 2023; Rasheed et al., 2024; Zan et al., 2023). Using HumanEval, we changed each problem description manually to develop HumanEvalComm, which we will use for evaluation in our work. This is done using a taxonomy of clarification types as described below.

Besides manual modification, it should be noted that we also tried to use LLM to modify the problem description, but we found that the modification by LLMs did not meet our standard. Specifically, the modification from using LLMs cannot guarantee that the modification will trigger clarifying questions. Similar limitations on using LLMs for Requirement Engineering have been also reported in (Arvidsson and Axell, 2023). Hence, we chose to manually modify all of the problem descriptions to provide this guarantee for HumanEvalComm.

Taxonomy of Clarification Types. To modify the problem description in an organized way, we propose the following clarification types based on both the literature in Requirement Engineering (RE) (Tukur et al., 2021; Dermeval et al., 2016) and our understanding of how feasible can the RE concepts be applied to problems in HumanEval:

  • Ambiguity: Some statements in the problem descriptions could be ambiguous and correspond to different concepts.

  • Inconsistency: Some statements in the problem descriptions show conflict or inconsistency between each other.

  • Incompleteness: Some concepts or conditions are missing in the problem descriptions.

Clarification Category Ambiguity Inconsistency Incompleteness
1a
1c
1p
2ac
2cp
2ap
Table 1. Problem descriptions with different combinations of clarification types being applied in HumanEvalComm.
Clarification Type Problem Description
N/A (Original) {python} def incr_list(l: list): ”””Return list with elements incremented by 1. ¿¿¿ incr_list([1, 2, 3]) [2, 3, 4] ¿¿¿ incr_list([5, 3, 5, 2, 3, 3, 9, 0, 123]) [6, 4, 6, 3, 4, 4, 10, 1, 124] ”””
Ambiguity {python} def incr_list(l: list): ”””Return list with elements incremented by a number. ¿¿¿ incr_list([1, 2, 3]) [2, 3, 4] ¿¿¿ incr_list([5, 3, 5, 2, 3, 3, 9, 0, 123]) [6, 4, 6, 3, 4, 4, 10, 1, 124] ”””
Incompleteness {python} def incr_list(l: list): ”””Return list with elements incremented ”””
Inconsistency {python} def incr_list(l: list): ”””Return list with elements incremented by 1. ¿¿¿ incr_list([1, 2, 3]) [3, 4, 5] ¿¿¿ incr_list([5, 3, 5, 2, 3, 3, 9, 0, 123]) [7, 5, 7, 4, 5, 5, 11, 2, 125] ”””
Inconsistency & Ambiguity {python} def incr_list(l: list): ”””Return list with elements incremented by a number. ¿¿¿ incr_list([1, 2, 3]) [3, 4, 5] ¿¿¿ incr_list([5, 3, 5, 2, 3, 3, 9, 0, 123]) [7, 5, 7, 4, 5, 5, 11, 2, 125] ”””
Table 2. Example of HumanEvalComm built upon HumanEval. The modified problem descriptions are shown in this table for problem number 42 of HumanEval. Specifically, the descriptions of the problem were modified to be inconsistent, ambiguous, or incomplete. The main goal of the HumanEvalComm dataset is to evaluate the degree of communication.

Modifying Problem Description. For each problem description, we manually change the problem description with regard to different clarification types. Modifying the problem descriptions is done manually by a software engineer with nearly a decade of experience in the industry. A second software engineer with more than 15 years of development experience reviewed the changed descriptions. The disagreements were marked and discussed among the two annotators until they reached an agreement about the changes, according to the definitions of Ambiguity, Inconsistency, and Incompleteness from RE. Each problem description was read carefully, and modifications were applied to the problem description. The definitions and examples of ambiguous, inconsistent, or incomplete requirements were reviewed by both people before conducting the manual modification of the problem descriptions, although both of them have requirements engineering expertise.

For each problem, we applied six different modifications: (1a) making the problem description ambiguous; (1c) modifying the description to be inconsistent; (1p) changing the problem description to make it incomplete. The next three modifications that we refer to as (2ac), (2cp), and (2ap) are a combination of the initial changes, being ‘ambiguous and inconsistent’, ‘inconsistent and incomplete’, and ‘ambiguous and incomplete’, respectively. For any of the above modifications, our standard is that applying the modification to the problem should trigger clarifying questions, which are necessary for generating the correct code.

Specifically, for 1a, to make the descriptions ambiguous, we tried to change the statement such that a human reader can interpret the statement in different ways. However, in practice, we found it very difficult to perform ambiguous modifications: adding Ambiguity in description only may not be enough to trigger clarifying questions, since we have additional information such as test examples, common sense reasoning, and function signatures. In other words, the description becomes ambiguous, but the test examples and function signatures are also given in the description, so that the right requirements can be inferred and thus correct code can be generated, without having to ask clarifying questions. Our solution to this issue is that we apply both Ambiguity and incorrectness to the description because it’s much easier and safer to trigger clarifying questions using Ambiguity and a certain level of incorrectness, instead of Ambiguity only. For example, changing the description from ”sort the array descendingly” to ”sort the array” may not trigger a question, because the function signature or test cases can imply the sorting is in descending order. However, changing the description to ”sort the array (descendingly or ascendingly)” can trigger questions.

For 1c, to make the descriptions inconsistent, we mainly changed the examples such that the output of the example does not match or contradict the problem description. It should be noted that most of the problem descriptions in the HumanEval benchmark contain examples of test cases with the input and output. When applying inconsistent modification, for each problem, we changed the output of the test examples in a meaningful way rather than randomly, to enhance the contradiction between test examples and text description.

In terms of 1p, we removed some parts of the description as incomplete modification. We made sure that after applying the incomplete modification, it’s not possible to generate the correct code, without asking questions to recover the missing content.

For 2ac, 2cp, and 2ap, we directly applied a combination of two clarification types from 1a, 1c, and 1p. For these cases, we create a new modification only if applying a combination of two types leads to a new description that is different from any of the two types. Therefore, for each problem, 1a, 1c, and 1p always exist, but 2ac, 2cp, or 2ap may not exist. Overall, the process of changing the descriptions took approximately 100 hours for initial modification and 30 hours to review and discuss the disagreements and come to a consensus.

Table 2 shows an example of the original problem description and three modified versions for problem number 42 in HumanEval. In this example, for Ambiguity, ”incremented by 1” is modified to ”incremented by a number”, forming an ambiguous description. For Incompleteness, a part of the text description and example test cases are removed. For Inconsistency, the output of examples is modified so that it contradicts the text description. For Inconsistency and Ambiguity, a combination of Inconsistency and Ambiguity is applied, making it a more challenging case to generate the correct code. It is worth mentioning that before constructing HumanEvalComm, we manually verified the problem descriptions in the HumanEval dataset and verified that the original problem descriptions do not have clarification issues (Ambiguity, Inconsistency, or Incompleteness), so we chose all of the 164 problems in HumanEval dataset in our evaluation.

2.2. Evaluation Measurement

We introduce the following metrics to effectively evaluate the communication competency of the models in code generation tasks.

Communication Rate. We propose the communication rate to evaluate the degree of communication skills for a given model. The communication rate is intended to capture the percentage of responses with clarifying questions instead of code for problems in HumanEvalComm. In the experiment, the prompt we use lets the model “either generate Python3 code (Respond directly with code only with markdown), or ask clarifying questions”. Therefore, in this work, we define the communication rate as the percentage of responses with no code snippets (non-code) for the initial modified problem descriptions:

communicaton_rate=#initialmodelresponseswithoutcode#initialmodelresponses𝑐𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑜𝑛_𝑟𝑎𝑡𝑒#𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑚𝑜𝑑𝑒𝑙𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑠𝑤𝑖𝑡𝑜𝑢𝑡𝑐𝑜𝑑𝑒#𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑚𝑜𝑑𝑒𝑙𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑠communicaton\_rate=\dfrac{\#initial\>model\>responses\>without\>code}{\#% initial\>model\>responses}italic_c italic_o italic_m italic_m italic_u italic_n italic_i italic_c italic_a italic_t italic_o italic_n _ italic_r italic_a italic_t italic_e = divide start_ARG # italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_m italic_o italic_d italic_e italic_l italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e italic_s italic_w italic_i italic_t italic_h italic_o italic_u italic_t italic_c italic_o italic_d italic_e end_ARG start_ARG # italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_m italic_o italic_d italic_e italic_l italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e italic_s end_ARG

In the experiment, we found that this simple metric that distinguishes whether the model returns code or non-code is already an effective approximation of communication skills.

Good Question Rate. In this research, we leverage a new LLM-based evaluator to give a question quality label for clarifying questions returned by the models. The labels are Good (The model asks insightful questions that help recover all the missing info), Fair (The model asks OK questions, but the questions do not fully cover the missing info), Bad (The model asks no questions or irrelevant questions that do not help at all to recover the missing/clarifying information). Given the question quality label, we define Good Question Rate as the percentage of responses with Good question quality labels:

good_question_rate=#initialmodelresponseswithGoodlabels#initialmodelresponses𝑔𝑜𝑜𝑑_𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛_𝑟𝑎𝑡𝑒#𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑚𝑜𝑑𝑒𝑙𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑠𝑤𝑖𝑡𝐺𝑜𝑜𝑑𝑙𝑎𝑏𝑒𝑙𝑠#𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑚𝑜𝑑𝑒𝑙𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑠{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}good\_question\_rate=% \dfrac{\#initial\>model\>responses\>with\>Good\>labels}{\#initial\>model\>% responses}}italic_g italic_o italic_o italic_d _ italic_q italic_u italic_e italic_s italic_t italic_i italic_o italic_n _ italic_r italic_a italic_t italic_e = divide start_ARG # italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_m italic_o italic_d italic_e italic_l italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e italic_s italic_w italic_i italic_t italic_h italic_G italic_o italic_o italic_d italic_l italic_a italic_b italic_e italic_l italic_s end_ARG start_ARG # italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_m italic_o italic_d italic_e italic_l italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e italic_s end_ARG

Pass@K. In evaluation, pass@k is a popular and widely used metric for evaluating the task of code generation (Chen et al., 2021; Min et al., 2023; Zhang et al., 2023a). Pass@k is defined as the ratio of ‘solved’ problems, in which as a problem is ‘solved‘ if any of the k𝑘kitalic_k code samples pass all the tests. Hence, we used Pass@1 in our evaluation.

Test Pass Rate. Besides the widely used pass@k, Test Pass Rate is also commonly used for evaluating code generation (Ouyang et al., 2023; Hendrycks et al., 2021). Specifically, the Test Pass Rate is defined as the proportion of successfully passed test cases in relation to the total number of test cases for LLM-generated code. This metric is useful in this work since it helps capture whether getting the right information by asking clarifying questions can indeed increase the correctness of generated code.

3. Empirical Study

3.1. Research Questions

In this section, we describe the research questions that we explore in this study.

RQ1: How do Code LLMs perform in communication competency when requirements in the problem descriptions are incomplete, inconsistent, ambiguous?
The rationale of RQ1 is centered around understanding and examining the current Code LLMs regarding their communication capabilities in code generation. The aim is to provide an initial understanding of the limitations and areas where Code LLMs may fall short in their communication skills. We evaluated different Code LLMs on carefully curated problems in the new benchmark, HumanEvalComm, where problem descriptions are manually modified to be incomplete, inconsistent, and ambiguous. We evaluated and compared the results of Code LLMs for different clarification categories, where one or two clarification types are applied to the original problems.

RQ2: How does Okanagan perform compared with Code LLMs in terms of communication skills?
Given the recent advances in LLM-based agent in addressing various applications (Xi et al., 2023), RQ2 aims to investigate the communication capabilities of our LLM agent approach, Okanagan, in comparison with Code LLMs. Therefore, we evaluated Okanagan which has a multi-round structure with customized prompts for code generation tasks. We analyzed and compared the results of Okanagan and Code LLMs.

3.2. Methodology Overview

Refer to caption
Figure 1. The visual illustration of the methodology on the evaluation of communication skills for Large Language Models of code.

Overview. Figure 1 shows the overview of our methodology for collecting the benchmark and conducting the empirical study. We first create the HumanEvalComm benchmark, by modifying 164 problem descriptions of the original HumanEval benchmark for code generation tasks using the taxonomy of clarification types, as described in Section 2. Second, we conduct evaluation and empirical studies based on HumanEvalComm to evaluate the communication competency of different models.

Figure 2 shows the flowchart for the evaluation of models, Code LLMs, and Okanagan. For each programming problem in the HumanEvalComm, there are up to six modified problem descriptions as described earlier in Table 1. For each modified problem, a prompt is used as the input of the model to either generate code or ask clarifying questions if needed. Then, if the model asks clarifying questions rather than generates code directly, the questions are sent to an LLM-based Evaluator, which evaluates the questions and generates a reply to answer the questions, based on all of the available information, including the modified problem, original problem, and the clarifying questions. Finally, the answers and the previous conversations are sent to the model to generate the code again directly.

Refer to caption
Figure 2. Flowchart for the evaluation of models, either Code LLMs or Okanagan (LLM agent), in communication capability.

LLM-based evaluator. With the advances of LLMs, a recent series of work has been proposed to use the powerful LLMs as the reference-free evaluators on Natural Language Generation (NLG) tasks (Kocmi and Federmann, 2023; Wang et al., 2023a; Kim et al., 2023; Kotonya et al., 2023; Gao et al., 2024). Given the expensive human efforts of human evaluations, we used the LLM-based evaluator to generate an answer to reply to the list of clarifying questions from the models (Kotonya et al., 2023; Gao et al., 2024). We prompted the LLM-based evaluator with the modified problem, original problem, and clarifying questions. The role of the LLM-based evaluator in this work is to 1) generate answers to the clarifying questions, and 2) calculate Good Question Rate, represented by an integer. The Good Question Rate is one of the evaluation metrics in our experiment. As for implementation, we used GPT 3.5 in the LLM-based evaluator in the evaluation. We tested both zero-shot and one-shot prompting, but in our evaluation, we found that one-shot prompting does not improve performance. This also aligns with the finding in literature (Kotonya et al., 2023; Gao et al., 2024). The detailed prompt of LLM-based evaluator is shown as follows.

Prompt for LLM-Based Evaluator:
The original description of a coding problem is modified so that the requirements become inconsistent, incomplete, or ambiguous. Given the modified description, some clarifying questions were raised to clarify the description. Given the original and modified problem description, evaluate the quality of the clarifying questions. Please provide an integer representing the quality of questions (3: Good questions that recover the modified requirements; 2: Fair questions but they cannot help recover the modified requirements; 1: No questions).

QUALITY=[your int]

Please also provide answers to the clarifying questions to recover the modified requirements in the original problem description compared to the modified one. If there are no clarifying questions at all, return empty answers.

ANSWERS=“‘[your answer]”’

Please strictly follow the format QUALITY=[the int] and ANSWERS=“‘[the answer]”’ in the response! Surround your answer with markdown!

### Questions: {clarifying_questions}

### Modified Problem Description: {problem}

### Original Description: {original_problem}

We manually checked the results of the LLM-based evaluator to see whether the output of the evaluator, the generated answers, and the evaluation of the questions, were correct. Overall, the generated answers and Good Question Rates are reasonable, but we do see some mistakes in both the generated answers and the Good Question Rates. For the Good Question Rates, there are some cases where the Good Question Rate from the LLM-based evaluator is 2 or 3 when there are no clarifying questions: For example, the ”questions” from the model are a combination of explanation and code, but no clarifying questions. This is somewhat related to the reported limitation (Gao et al., 2024) that LLM evaluators prefer to give high scores to responses that conflict with the facts in the dialogue history (Liu et al., 2023a). For generated answers, we sometimes observe that the provided answers do not recover the original requirements, due to either the evaluator itself or “no clarifying questions” mentioned above. To mitigate this issue, we have optimized the prompt for the LLM-based evaluator several times and checked the results manually. This includes adding sentences like “Please strictly follow the format QUALITY=[the int] and ANSWERS=“‘[the answer]”’ in the response!”, and “Surround your answer with markdown!” which eliminated many cases of format errors (answers and rates cannot be extracted correctly). Although the LLM-based evaluator shows effectiveness in our task, we understand that LLM-based NLG evaluation is still challenging (Gao et al., 2024; Liu et al., 2023a), and future work is required to address the errors mentioned above.

3.3. Code Large Language Models

We performed our evaluation on five widely used LLMs. This includes three open-sourced instruction-tuned Code LLMs, one open-sourced instruction-tuned LLM, and one commercial LLM. For open-source models, we used models with the largest possible model size within our limited computing resources in our evaluation.

  • CodeLlama (Instruction tuned version, 13B) (Roziere et al., 2023) is an open-source LLM released by Meta for coding, built on top of Llama 2, with foundation models and instruction models. CodeLlama was chosen because of its wide usage and top performance in HumanEval. We tested the instruction model CodeLlama-Instruct-13B in our experiment since we did not have the computing resources to run models with 34B. The same applies to the rest open-source models.

  • DeepSeek Coder (Instruction tuned version, 7B) (Guo et al., 2024) is an open-source Code LLM trained on both 87% code and 13% natural language. Each of the models was pre-trained on 2 trillion tokens. We selected this model because it achieved top 5 performance in Big Code Models Leaderboard (big, 2024) on the HuggingFace platform. The Big Code Models Leaderboard (big, 2024) evaluates the performance of base multilingual code generation models on the HumanEval benchmark and MultiPL-E. We used the model of 7 billion parameters in the evaluation.

  • DeepSeek Chat (Instruction tuned version, 7B) (Bi et al., 2024) is an open-source LLM released by DeepSeek AI, trained on datasets of 2 trillion tokens. We selected this model because we wanted to evaluate the communication skills of models trained from different sources such as natural languages, code, and a combination of both. We compared its performance with the DeepSeek Coder to understand whether more natural languages in pre-training are beneficial to communication skills. We used the model of 7 billion parameters in the evaluation.

  • CodeQwen1.5 Chat (Instruction tuned version, 7B) (Bai et al., 2023) is an open-souce Code LLM released by Qwen Team, trained on 3 trillion tokens of code data. CodeQwen1.5 Chat is the Code-Specific version of Qwen1.5. The model is a transformer-based decoder-only language model and includes group query attention (GQA) for efficient inference. We selected this model because it achieved top 5 performance in Big Code Models Leaderboard (big, 2024).

  • ChatGPT, released by OpenAI are powerful models for generation tasks. We used parameter-frozen versions of models (gpt-3.5-turbo-0125) to ensure the reproducibility of the evaluation results.

Note that all of the evaluated models above are instruction-tuned models because, in the evaluation, the ability to ask clarifying questions with the given prompts is needed for the models. Besides instruction-tuned models, there are also foundation models, but we didn’t report results for foundation models. We found that foundation models without instruction tuned are not suitable for our evaluation, because their task is only to complete code and are not capable of instructions such as ”either generate code or ask clarifying questions”.

Refer to caption
Figure 3. An illustration of the process of Okanagan, an LLM agent approach.

3.4. LLM-Agent Approach (Okanagan)

Following the recent works in LLM agent, including collaboration mechanisms for LLM agents (Zhang et al., 2023c) and self-correcting strategies of LLMs (Pan et al., 2023), we proposed and evaluated an LLM agent approach, Okanagan, that leverages multi-round structure and customized prompt format for asking clarifying questions in code generation tasks. We introduce three rounds in Okanagan:

  • Round 1: the agent generates code directly given the modified problem.

  • Round 2: the agent generates clarifying questions (if needed) given the modified problem and generated code. If no questions, directly return the code generated in Round 1.

  • Round 3: the agent generates code again, given the above conversation history (including the modified problem, clarifying questions, and their answers).

This structure is inspired by the existing LLM agents approach  (Zhang et al., 2023c), with three rounds and customized prompts for our task of code generation. In terms of actions in each round, the action in Round 1 is to generate code. The action in Round 2 is to ask clarifying questions. As mentioned above, the code in Round 1 is returned if no questions are asked. Otherwise, a reflection is conducted to generate code again with the previous conversation history that includes clarifying questions and answers provided by the LLM-based evaluator. We adopted this structure because it can be easily extended to different parameter values. For example, we stop at Round 3 (in other words, we set the total number of rounds to 3) in our evaluation, but we can set a different number of rounds in theory. Besides the number of rounds, other parameters can be changed as well. We describe the set of parameters for Okanagan as follows:

  1. (1)

    number of agents (default is 1).

  2. (2)

    number of rounds (default is 3).

  3. (3)

    Action in each round: Generate code or ask questions (default is: Round 1 - Generate code, Round 2 - Ask questions, Round 3 - Generate code with Reflection)

  4. (4)

    thinking pattern: Debate or Reflection (Zhang et al., 2023c). (default is Reflection111Debate can be used only when multiple agents are used. In the implementation of Okanagan, we use a single agent, and thus in Round 3, the single agent reflects on the generated code in Round 1 based on the additional information in Round 1 (generated code) and Round 2 (clarifying questions and their answers provided by LLM-based evaluator). )

Note that we tried to minimize the complexity of Okanagan using default parameters such as one agent and three rounds, but in future work, the structure can easily scale from single-agent to multi-agents by setting the parameter for the number of agents. If more than one agent is used, and when the thinking pattern is Debate in a given round, the agents would exchange their previous responses as a way of collaboration. Given our specific task, compared with (Zhang et al., 2023c), we added a new parameter in Okanagan: action in each round, to indicate the action for agents in a given round.

LLM Agent Implementation in Evaluation. In our evaluation, we tested Okanagan using the default parameters as mentioned. We used ChatGPT 3.5 as the LLM in each of the three rounds in Okanagan mainly for easier comparison with ChatGPT 3.5. For other LLM agent methods to compare in evaluation, we searched other publicly available LLM agent but did not find an appropriate open-sourced LLM agent implementation for code generation task with a focus and potential action to ask clarifying questions.

3.5. Experiment Setup

In experiments, we implemented our evaluation in Python 3.12. We partially used the code from  (Ouyang et al., 2023) on the Non-Determinism of ChatGPT and from (Min et al., 2023) on testing open-source models. All of the experiments for ChatGPT and Okanagan were conducted on a server with an Intel i7-6700K CPU (4.00 GHz), 32 GB RAM. The other experiments for open-source models that require GPUs were conducted on an Intel Xeon Gold 6130 CPU (2.1GHz), 44 GB RAM, and 4 GPUs (Tesla V100-SXM2-16GB). The names of the HuggingFace models we use in the experiments are deepseek-coder-6.7b-instruct, deepseek-llm-7b-chat, CodeQwen1.5-7B-Chat, and CodeLlama-13b-Instruct-hf.

4. Results and Analysis

4.1. Communication Competency of Code LLMs on HumanEvalComm (RQ1)

To answer RQ1, we conducted experiments to evaluate the communication capability for problems in HumanEvalComm. Since we focus on the results of Code LLMs in RQ1, the results of Okanagan will be discussed separately in RQ2. For each problem modified according to a category clarification type or combinations of clarification types, we followed the process in Figure 2. We calculated the following evaluation metrics: communication rate, good question rate, Pass@1, and Test Pass Rate. We compared Pass@1 and Test Pass Rate between the modified problem in HumanEvalComm and the original problem in HumanEval. Table 3 summarizes the overall results we generated for the evaluated models. Figure 4 rearranges the numbers in Table 3 in a visual illustration to facilitate a more direct comparison between different models. We first analyzed the overall results, then we looked into the results in each clarification category. For each category, we evaluated them with statistical testing using the Student’s t-test and obtained the p-value.

Let’s first look at the communication rate. From Table 3 and Figure 4, the communication rate for ChatGPT, CodeLlama, and CodeQwen1.5 Chat is below 20%, significantly less than the perfect score of 100%. This means that for a problem description in which clarifying questions are needed for generating correct code, these models raise questions with less than a 20% chance. The recently released DeepSeek Coder and DeepSeek Chat achieved higher communication rates of 30.76% and 37.93%. One hypothesis to explain this is that the general capability from DeepSeek Chat is important for a high communication rate.

Besides communication rate, Good Question Rate is also a useful metric, because it reports the percentage of questions labeled as Good questions based on the content of questions using an LLM-based evaluator. In terms of Good Question Rate, likewise, as shown Table 3 and Figure 4, ChatGPT, CodeLlama, and CodeQwen1.5 Chat have a lower average question quality than DeepSeek Coder and DeepSeek Chat. Particularly, ChatGPT has a much lower rate than other open-source models. From our manual inspection, one of the reasons is because the open-source models sometimes do not follow the instructions to return either code blocks or questions. They sometimes output code blocks together with some explanations. This type of response is not a clarifying question, but the LLM-based Evaluator sometimes labels them as “Good Question”, which we described in detail in section 3.2. Regardless, based on the numbers, there is still significant room to improve on Good Question Rate.

For pass rate measurements, ChatGPT, CodeLlama, and DeepSeek Chat achieve overall lower results than CodeQwen1.5 Chat and DeepSeek Coder for both Pass@1 and Test Pass Rate, based on Table 3 and Figure 4. The trend is similar for both Pass@1 and Test Pass Rate. One hypothesis is that this result is in part due to the higher Pass@1 and Test Pass Rate of CodeQwen1.5 Chat and DeepSeek Coder in the original HumanEval benchmark. On the relative change, we see an increase in Pass@1 and Test Pass Rate from original HumanEval to HumanEvalComm for DeepSeek Chat. According to our investigation, this is because of illegal response formats: many responses from DeepSeek Chat for the original HumanEval do not have code markup, so these responses without code markup failed all the tests. For the rest open-source models, the relative drop in the Pass@1 is between 35% and 52%. The relative drop in the Test Pass Rate is between 17% and 35%.

Model Pass@1 Test Pass Rate Comm. Good
HmEval HmEvalComm HmEval HmEvalComm Rate Question Rate
ChatGPT 65.58% 31.34% 76.42% 49.39% 14.21% 13.43%
CodeLlama 29.88% 19.35% 45.71% 37.79% 10.16% 37.55%
CodeQwen1.5 Chat 76.83% 47.61% 84.4% 62.89% 4.82% 41.68%
DeepSeek Coder 71.78% 45.68% 79.44% 62.25% 30.76% 61.42%
DeepSeek Chat 12.8% 26.32% 13.86% 44.52% 37.93% 58.71%
Okanagan 27.45% 39.62% 33.45% 56.98% 72.73% 52.24%
Table 3. Evaluation result across all clarification categories on Pass@1, Test Pass Rate, communication rate, and Good Question Rate with different models on HumanEvalComm (HmEvalComm in the table). Additionally, the Pass@1 and Test Pass Rate on the original problems in HumanEval (HmEval in the table) are also shown. Top 3 results are marked as bold.
Refer to caption
Figure 4. Comparison of the effectiveness of the models in Communication Rate, Good Question Rate (left), and Pass@1, Test Pass Rate (right). Note that in the right figure, the stars represent the original performance of the corresponding model with the same color in the HumanEval benchmark. This shows visually how the performance has changed when the problem description is modified.
Clarification Model Pass@1 Test Pass Rate Comm. Good
Category HmEval HmEvalComm (p-value) HmEval HmEvalComm (p-value) Rate Question Rate
1a ChatGPT 65.58% 33.77%*** 76.42% 54.98%*** 5.84% 4.55%
(0.000) (0.000)
CodeLlama 29.88% 16.46%*** 45.71% 36.24%** 13.64% 42.68%
(0.004) (0.037)
CodeQwen1.5 Chat 76.83% 46.34%*** 84.4% 62.62%*** 5.84% 43.29%
(0.000) (0.000)
DeepSeek Coder 71.78% 43.29%*** 79.44% 61.2%*** 25.97% 62.8%
(0.000) (0.000)
DeepSeek Chat 12.8% 21.95%** 13.86% 40.62%*** 39.61% 56.71%
(0.029) (0.000)
Okanagan 27.45% 44.81%*** 33.45% 64.22%*** 65.58% 52.60%
(0.001) (0.000)
1c ChatGPT 65.58% 53.25%** 76.42% 66.37%** 5.84% 6.49%
(0.027) (0.028)
CodeLlama 29.88% 32.93% 45.71% 52.14% 7.79% 32.32%
(0.553) (0.172)
CodeQwen1.5 Chat 76.83% 67.68%*** 84.4% 79.9%*** 7.79% 46.95%
(0.000) (0.000)
DeepSeek Coder 71.78% 61.59%* 79.44% 76.75% 15.03% 53.66%
(0.051) (0.501)
DeepSeek Chat 12.8% 39.63%*** 13.86% 56.89%*** 28.1% 61.59%
(0.000) (0.000)
Okanagan 27.45% 57.14%*** 33.45% 70.01%*** 55.19% 42.86%
(0.000) (0.000)
1p ChatGPT 65.58% 27.95%*** 76.42% 44.14%*** 31.68% 26.71%
(0.001) (0.000)
CodeLlama 29.88% 15.24%*** 45.71% 29.41%*** 7.74% 33.54%
(0.000) (0.000)
CodeQwen1.5 Chat 76.83% 46.95%*** 84.4% 59.36%*** 1.3% 38.41%
(0.000) (0.000)
DeepSeek Coder 71.78% 45.12%*** 79.44% 58.57%*** 48.7% 68.9%
(0.000) (0.000)
DeepSeek Chat 12.8% 21.95%** 13.86% 43.73%*** 37.66% 55.49%
(0.029) (0.000)
Okanagan 27.45% 36.65%* 33.45% 54.16%*** 93.17% 58.39%
(0.082) (0.000)
Table 4. Evaluation result for clarification categories 1a,1c,1p on Pass@1, Test Pass Rate for original problems in HumanEval and modified problems in HumanEvalComm, communication rate, and Good Question Rate with different models. *p¡0.1; **p¡=0.05; ***p¡0.01

Breakdown on Categories with One Clarification Type. Besides overall results, we would like to further understand the correlation between results and different clarification categories. Table 4 shows the results breakdown on the clarification categories 1a, 1c, and 1p, where only one level of clarification type (Ambiguity, Inconsistency, and Incompleteness) is applied to the problem. For ChatGPT, among the three clarification types, Incompleteness has the overall highest communication rate, suppressing the communication rates of Ambiguity and Inconsistency. This means that Incompleteness is relatively easier to detect and raise than Ambiguity and Inconsistency for models such as ChatGPT, DeepSeek Coder, and DeepSeek Chat. Inconsistency has the lowest communication rate among the three types. One hypothesis to explain that is that Inconsistency requires stronger reasoning capability to detect. Good Question Rate follows similar patterns as the communication rate, indicating that the quality of questions is proportional to the communication rate for Code LLMs.

Two exceptions to the above statements are that CodeLlama and CodeQwen1.5 Chat achieved the lowest communication rate in the Incompleteness category than in Ambiguity and Inconsistency. Similar trends can be found in Good Question Rates. This shows that some Code LLMs such as CodeLlama and CodeQwen1.5 Chat are trained and designed in a way so that they tend to complete code rather than ask questions even when requirements are incomplete. This reflects the generative nature of LLMs: given a prompt, the LLM as a generative model essentially generates and completes text (or code in our scenario) based on the statistical model (Shanahan, 2024). Thus, one hypothesis of the low result is that LLMs have disadvantages due to their generative nature when evaluating communication in coding tasks. This result also indicates that more intelligent AI agents such as LLM-based agents, where LLM as a generative model is a component, have the potential to outperform LLMs in the evaluation on communication capability (Shanahan, 2024).

For the testing performance of the generated code, interestingly, Incompleteness receives overall the lowest Pass@1 (12.8% similar-to\sim 46.95%) and Test Pass Rate (29.41% similar-to\sim 59.36%) for all of the models. One hypothesis is that if no clarifying questions were asked for problems with Incompleteness, the generated codes would be typically incorrect due to lack of information. Inconsistency has the highest Pass@1 and Test Pass Rate, because, for problems with Inconsistency, LLMs are sometimes able to generate correct code without asking clarifying questions. For 1a, 1c, and 1p categories, all except 3 changes in Pass@1 and Test Pass Rate are statistically significant, with p-values less than 0.1.

Breakdown on Categories with Two Clarification Types. Table 5 shows the results breakdown on the clarification category of 2ac, 2ap, and 2cp, where a combination of two clarification types is applied to the problem. Compared with applying one clarification type in Table 4, two clarification types have on average slightly higher communication rates than one clarification type. This makes sense as a combination of two clarification types naturally triggers more questions than one type. Consequently, we see a similar trend for the Good Question metric.

In terms of testing performance for the combination of two clarification types, both the Pass@1 and Test Pass Rate decreased significantly from one clarification type to two types. Therefore, compared with one clarification type, a combination of two clarification types further reduces the Test Pass Rate significantly, but only slightly enlarges the communication rate and the quality of clarifying questions on average. The slight increase in communication rate is reasonable given the increased clarification difficulty. The decreased pass rates show that it is hard for the models to get the necessary requirements for solving the task given the challenging situation for combinations of clarification types. 2cp, with a combination of both Inconsistency and Incompleteness, results in lower Pass@1 and Test Pass Rate compared with 2ap and 2ac. For 2ac, 2cp, and 2ap categories, 75% of the changes in Pass@1 and Test Pass Rate are statistically significant since the p-values are less than 0.05 in these changes.

Answer to RQ1: More than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Typically, the Pass@1 and Test Pass Rate of Code LLMs drop by 35% similar-to\sim 52% and by 17% similar-to\sim 35% respectively, with statistical significance in each category for over 75% numbers. Among the three clarification types, the Incompleteness category results in higher communication rates and Good Question Rates, but lower Pass@1 and Test Pass Rate than the Ambiguity and Inconsistency categories for Code LLMs. A combination of two clarification types leads to slightly higher communication rates but much lower Test Pass Rates than one clarification type.
Clarification Model Pass@1 Test Pass Rate Comm. Good
Category HmEval HmEvalComm (p-value) HmEval HmEvalComm (p-value) Rate Question Rate
2ac ChatGPT 65.79% 20.39%*** 76.77% 42.66%*** 5.26% 7.90%
(0.000) (0.000)
CodeLlama 29.63% 14.2%*** 45.65% 36.95%* 12.5% 42.59%
(0.001) (0.054)
CodeQwen1.5 Chat 77.16% 40.12%*** 84.28% 59.56%*** 7.24% 47.53%
(0.000) (0.000)
DeepSeek Coder 71.43% 40.74%*** 79.18% 61.72%*** 26.97% 58.64%
(0.000) (0.000)
DeepSeek Chat 12.96% 20.99%* 14.03% 39.09%*** 44.08% 64.2%
(0.055) (0.000)
Okanagan 27.15% 25.66% 33.23% 47.37%*** 64.47% 45.39%
(0.769) (0.004)
2cp ChatGPT 77.42% 15.63%*** 84.91% 34.79%*** 6.25% 9.38%
(0.000) (0.000)
CodeLlama 38.24% 14.71%** 60.9% 33.04%*** 9.68% 29.41%
(0.028) (0.003)
CodeQwen1.5 Chat 73.53% 38.24%*** 83.57% 55.82%*** 0% 29.41%
(0.003) (0.004)
DeepSeek Coder 70.59% 29.41%*** 80.93% 50.97%*** 12.9% 52.94%
(0.000) (0.002)
DeepSeek Chat 11.76% 26.47% 11.76% 48.38%*** 22.58% 52.94%
(0.127) (0.000)
Okanagan 32.26% 28.13% 34.98% 44.97% 84.38% 59.38%
(0.726) (0.375)
2ap ChatGPT 59.42% 16.67%*** 71.09% 32.39%*** 37.50% 29.17%
(0.000) (0.000)
CodeLlama 28.38% 17.57% 41.59% 31.27% 8.57% 39.19%
(0.120) (0.135)
CodeQwen1.5 Chat 74.32% 28.38%*** 82.71% 44.35%*** 1.45% 28.38%
(0.000) (0.000)
DeepSeek Coder 71.23% 36.49%*** 81.36% 48.17%*** 56.52% 70.27%
(0.000) (0.000)
DeepSeek Chat 9.46% 24.32%** 10.9% 35.99%*** 52.17% 56.76%
(0.016) (0.000)
Okanagan 27.94% 29.17% 34.52% 43.94% 94.44% 66.67%
(0.874) (0.196)
Table 5. Evaluation result for clarification categories 2ac,2cp,2ap on Pass@1, Test Pass Rate for original problems in HumanEval and modified problems in HumanEvalComm, communication rate, and Good Question Rate with different models. *p¡0.1; **p¡=0.05; ***p¡0.01

4.2. Comparing Okanagan with Code LLMs in communication skills (RQ2)

Overview. RQ2 aims to compare LLM agent approach, Okanagan, with the current Code LLMs in communication skills. From Table 3, while the communication rate for ChatGPT is below 20%, the communication rate of Okanagan with ChatGPT is over 70%, much higher than ChatGPT and all the other models. This shows that changing from LLM to LLM agent significantly increases the communication rate. For testing performance, Okanagan achieves better results than all models except CodeQwen1.5 Chat and DeepSeek Coder in both Pass@1 and Test Pass Rate. The trend is similar for both Pass@1 and Test Pass Rate as mentioned previously. This shows the effectiveness of Okanagan in obtaining the necessary information by asking clarifying questions.

However, one drawback is that Okanagan achieves a much lower Pass@1 and Test Pass Rate than ChatGPT. This is because the multi-round structure sometimes asks questions as an initial response even for original problems, but in original HumanEval, it is expected to directly return code in the initial responses, and evaluation is conducted on the code in the initial responses. This means for original problems that do not need asking questions, Okanagan sometimes still asks questions that appear to be unnecessary, since the original problem is known as complete and has no requirement issue. This is a valid limitation and future work is required (e.g., multi-agent debate (Du et al., 2023)) to address this limitation of asking unnecessary questions. If addressed, it indicates much stronger communication capability as LLM agent knows intelligently when to stop asking (Hassan et al., 2024). On the other hand, on HumanEvalComm, Okanagan shows that with LLM agent on top of base LLM (ChatGPT), we can get much better results in all metrics than ChatGPT. This shows the advantage in communication capability of LLM agent over LLM, as LLM agent can obtain needed info by asking questions to increase pass rates.

Breakdown on Categories with One Clarification Type. In terms of the results breakdown on the clarification category of 1a, 1c, and 1p, Okanagan shows a much higher communication rate than any other Code LLMs for 1a, 1c, and 1p. Incompleteness again has the highest communication rate among the three categories, with a rate of more than 90%. A similar trend holds for Good Question Rate, but DeepSeek Coder and DeepSeek Chat get higher Good Question Rate than Okanagan in 1a and 1c. Consequently, Okanagan achieves the top three performance for both the highest Pass@1 and Test Pass Rate among all the models with statistical significance.

Breakdown on Categories with Two Clarification Types. For the results breakdown on the clarification category of 2ac, 2ap, and 2cp, Okanagan has the highest communication rate among all the models, reaching a high 85% in 2cp and 94% in 2ap. A similar trend still holds in Good Question Rate. For Pass@1 and Test Pass Rate, Okanagan still ranks at the top 3 in most cases, but the statistical testing shows that most of the p-values are not small enough to show strong evidence of statistical significance, due to the relative smaller sample size of the categories with two types.

Answer to RQ2: Okanagan, as a LLM agent approach that uses LLM (specifically ChatGPT), effectively increases Communication Rate and Good Question Rate by an absolute 58% and 38% respectively, and thus boosts Pass@1 and Test Pass Rate by an absolute 8% and 7% respectively, when the problem descriptions are modified based on given clarification categories. This indicates headroom for achieving more effective communication capability using LLM agent.
1st Model Input You are an expert software developer who writes high-quality code. With below information, please either generate Python3 code (Respond directly with code only with markdown), or ask clarifying questions: {python} def how_many_times(string: str, substring: str) -¿ int: ””” Find how many times a given substring can be found in the original string. ”””
1st Model Response Sure, I can help you with that. However, I need more information to provide a precise solution. Could you please provide the following details? 1. What should be the behavior if the substring is not found in the string? 2. What should be the behavior if the substring is an empty string? 3. Are there any specific constraints or requirements you would like to consider? With these details, I can provide a more accurate solution.
LLM-based Evaluator Output Answer to Questions: 1. If the substring is not found in the string, the function should return 0. 2. If the substring is an empty string, the function should return 0. 3. There are no specific constraints or requirements other than counting the overlapping cases.
2nd Model Input [1st Model Input][1st Model Response][Answer to Questions]… Given the above conversations, generate Python code directly (Markdown) to solve the coding problem:
2nd Model Response Sure, based on the provided information, here is a Python function that should solve the problem: {python} def how_many_times(string: str, substring: str) -¿ int: ””” Count the number of non-overlapping occurrences of a substring in a string. Parameters: string (str): The string to search within. substring (str): The substring to search for. Returns: int: The number of times the substring occurs in the string, or 0 if the substring is empty. ””” if not substring: return 0 count = 0 start = 0 while start ¡ len(string): pos = string.find(substring, start) if pos != -1: count += 1 # Ensure non-overlapping occurrences start = pos + len(substring) else: break return count
Table 6. Example case for DeepSeek Coder in problem 18 of HumanEvalComm. The clarification type of this problem is 1p (Incompleteness). The communication rate is 100%, the Good Question Rate is Good, and Test Pass Rate is 100% for this example.
1st Model Input You are an expert software developer who writes high quality code. With below information, please either generate Python3 code (Respond directly with code only with markdown), or ask clarifying questions: {python} def triangle_area(a, h): ”””Given length of a side and high return area for a triangle. ¿¿¿ triangle_area(5, 3) 15 ”””
1st Model Response Here is the Python code for the function ‘triangle_area’: {python} def triangle_area(a, h): return (a * h) / 2 This function takes two arguments, ‘a’ and ‘h’, which represent the length of a side and the height of a triangle, respectively. It returns the area of the triangle, which is calculated by multiplying the length of the side by the height and then dividing by 2. Here is an example of how to use the function: {python} print(triangle_area(5, 3)) # Output: 15 This will print the area of a triangle with a side length of 5 and a height of 3, which is 15.
LLM-based Evaluator Output Answer to Questions: N/A
2nd Model Input [1st Model Input][1st Model Response][Answer to Questions]… Given the above conversations, generate Python code directly (Markdown) to solve the coding problem:
2nd Model Response Given the conversations, here is the Python code to solve the coding problem: {python} def triangle_area(a, h): return (a * h) / 2 print(triangle_area(5, 3)) # Output: 15 This code defines a function ‘triangle_area’ that takes two arguments, ‘a’ and ‘h’, which represent the length of a side and the height of a triangle, respectively. The function returns the area of the triangle, which is calculated by multiplying the length of the side by the height and then dividing by 2. The code also includes a call to the ‘triangle_area’ function with the arguments ‘5’ and ‘3’, which prints the area of the triangle, which is ‘15’.
Table 7. Example case for CodeLlama in problem 45 of HumanEvalComm. The clarification type of this problem is 1c (Inconsistency). The communication rate is 100% given both code block and text are returned rather than code block only. The Good Question Rate is Bad, indicating no clarifying questions are detected.

5. Analysis and Discussion

In this section, we provide some example cases to understand the results of Code LLMs and Okanagan. Then we also add discussions on analyzing the influence of 1) the extent of incomplete requirements and 2) the hyperparameters.

5.1. Example Cases

Besides the numeric results, we also study two cases to better understand the evaluation steps. Table 6 illustrates the result for a successful example problem (HumanEval/18) in HumanEvalComm. The clarification type of this problem is 1p (Incompleteness), so the requirement of ”Count overlapping cases.” is missing in the problem description. Given this 1st model input, DeepSeek Coder was able to detect the incomplete requirement in the problem and ask clarifying questions rather than generating code directly. Therefore, the communication rate is 100% for this single case. With the 1st model input and response, the LLM-based evaluator outputs the Good Question Rate of Good for our evaluation and answers to the clarifying questions. This answer provides the missing requirement of ”counting overlapping cases” from the original problem.

In the 2nd model input, the model is given the previous conversations, including the 1st model input, 1st model response (clarifying questions), and the answer from the LLM-based evaluator to the clarifying questions. With the requirement of “counting overlapping cases”, the model correctly solved the problem with 100% Test Pass Rate. Note that we can again let the model either ask questions or generate code, same as the instruction in the 1st model input, but we chose to perform only one question-answer round for simplicity of our evaluation.

Another case is shown in Table 7 for another example problem (HumanEval/45) in HumanEvalComm using CodeLlama. The clarification type of this problem is 1c (Inconsistency): in the test case, triangle_area (5, 3) returns 15 instead of 7.5. This causes Inconsistency between the test case and the problem description. Given the 1st model input, CodeLlama did not return a code block only according to the instruction, but a mix of code block and text explanation. Therefore, the communication rate is 100% for this case, but the LLM-based evaluator outputs the Good Question Rate of 1 (No questions). In this case, the Inconsistency issue was not captured and no clarifying questions were asked. For the evaluation metrics, the Good Question Rate of Bad successfully punished this case, but Test Pass Rate and communication rate failed to capture and punish the issue.

% Removed in Description Example Problem Description
0% def encode_cyclic(s: str): returns encoded string by cycling groups of three characters. # split string to groups. Each of length 3. groups = [s[(3 * i):min((3 * i + 3), len(s))] for i in range((len(s) + 2) // 3)] # cycle elements in each group. Unless group has fewer elements than 3. groups = [(group[1:] + group[0]) if len(group) == 3 else group for group in groups] return ””.join(groups) def decode_cyclic(s: str): takes as input string encoded with encode_cyclic function. Returns decoded string.
30% def encode_cyclic(s: str): returns encoded string by cycling groups of three characters. split string to in each group. Unless group has fewer elements than 3. groups = [(group[1:] + group[0]) if len(group) == 3 else group for group in groups] return ””.join(groups) def decode_cyclic(s: str): takes as input string encoded with encode_cyclic function. Returns decoded string.
50% def encode _cyclic(s: str): returns encoded string by cycling groups of three characters. split string to groups. Each of length 3. groups = [s[(3 * i):min((3 * i + takes as input string encoded with encode_cyclic function. Returns decoded string.
90% def encode_cyclic(s: str): encode_cyclic function. Returns decoded string.
Table 8. Example of randomly removing parts of the problem description from problem number 38 of HumanEval.

5.2. Investigating Different Extent of Incomplete Modification (1p)

To dig deeper into the results of incomplete modification (1p), we did further investigations to understand the correlation between the ratio of removed content in the problem description for 1p and the corresponding results. We investigated the results by removing a random text block in the problem descriptions. Specifically, for each problem description in words, we randomly remove a list of consecutive words where the size of the list is X𝑋Xitalic_X percentage of the total number of words in the problem description. We empirically choose the values X=30%,50%,90%𝑋percent30percent50percent90X=30\%,50\%,90\%italic_X = 30 % , 50 % , 90 % in this experiment. Table 8 shows an example of problem descriptions with different X𝑋Xitalic_X. We can see that even when X𝑋Xitalic_X is as small as 30%, it would be ideal for LLM to ask questions to get the missing piece of information to ensure the LLM fully understands the problem and generates high-quality code. When X𝑋Xitalic_X becomes larger, such as 90%, the problem description becomes almost impossible to conduct the code generation task with high accuracy.

To mitigate the risk of randomness in this investigation, for each problem, we ran the experiment 5 times, and reported the metrics used in (Ouyang et al., 2023), including Mean, Variance, Max Diff, and Ratio of Worst. We calculated the mean and variance of the 5 Test Pass Rates and communication rates for each problem, and reported the average among all problems, as Mean and Variance. The Max Diff is the maximum value of the maximum diff among all problems. “Ratio of Worst (Cases)” is the ratio of problems with the maximum diff of test pass rate being 1. Please refer to (Ouyang et al., 2023) for complete descriptions of these metrics. We report the Test Pass Rates and communication rates in Table 9 based on the percentage of removed descriptions (0%, 30%, 50%, and 90%). We used ChatGPT 3.5 as the model in the investigation. We can see that, in terms of Test Pass Rates, as the percentage of removed information increases, there is a noticeable decrease in the mean Test Pass Rate, indicating that incomplete problem descriptions negatively affect the ability to pass tests. This trend is further supported by the variance and maximum difference metrics, which show increasing variability and differences in Test Pass Rates as information is removed. The ratio of the worst case also suggests that a higher percentage of removed information leads to a lower Test Pass Rate.

For communication rates, incomplete problem descriptions lead to an increase in the mean communication rate as the percentage of removed information increases. This is expected because when there is more missing information, LLM tends to ask more questions rather than directly generating code. The variance and maximum difference metrics also reflect higher variability and differences in communication rates with incomplete problem descriptions. The ratio of the worst case indicates that a higher percentage of removed information results in a more significant increase in communication rates. Figure 5 shows a visual comparison of these two metrics using numbers from Table 9 when different X% of content is removed in the problem description. This shows visually that LLM tends to ask more questions as more content in the description is removed, but this starts to happen only after half of the descriptions are removed. When 90% of the description is removed, LLM asks questions for only 54% of problem descriptions. To summarize, 95% of responses from Code LLMs still generate code even when 50% of problem descriptions are randomly removed. When the removed percentage of description increases to 90%, 46% of responses from Code LLMs still generate code. This shows a rather weak ability of Code LLMs to ask clarifying questions when information is randomly removed and therefore indicates plenty of research opportunities in pushing the curves of LLM or LLM agent toward the human software engineers.

Refer to caption
Figure 5. The chart of communication rate and Test Pass Rate when different X% of content is removed in the problem description.

5.3. Investigating Different Hyperparameter

To evaluate the impact of the hyperparameter in the experiments, we also investigated using different temperatures as the hyperparameter of ChatGPT. Temperature is a hyperparameter that controls the randomness of ChatGPT-generated text. The default temperature of ChatGPT is 1.0 and we tested the result of using temperature as 0, 1.0, 2.0. Same as in the previous incomplete modification investigation, we ran the experiment 5 times and checked the metrics including mean and variance. We ran with the percentage of removed information being 50%.

Table 10 presents results on the impact of the temperature hyperparameter in ChatGPT in HumanEvalComm. We found that the mean Test Pass Rate dropped from 49.6% (variance=0.088, max diff=0.523, ratio of worst=0.390) to 40.7% (variance=max diff=ratio of worst=0), and the communication rate changed from 1.8% (variance=0.011,max diff=0.067, ratio of worst=0.067) to 3.7% (variance=max diff=ratio of worst=0) when the temperature changed from 1 to 0. We also tested 2.0 as temperature, but found that most of these requests timed out due to longer processing time on the OpenAI server end due to high temperature. We can see that the variance, max diff, and ratio of worst become 0 when the temperature is 0. This means lower temperature does indicate much more deterministic and focused results. Interestingly, as the temperature dropped from 1 to 0, the mean Test Pass Rate dropped, but the communication rate increased. This could be because lower temperature leads to less creative and diverse output to “guess” the code, therefore somehow forcing the model to ask questions to seek additional information. Given the temperature of 1.0 is the default setting, and the results do not show a significant impact of varying temperature, in the evaluation, we use the temperature of 1.0 for both ChatGPT and Okanagan.

% Removed Metric Mean Variance Max Diff Ratio of Worst
0% Test Pass Rate 66.5% 0.122 0.690 0.561
0% Communication Rate 3.3% 0.019 0.110 0.110
30% Test Pass Rate 49.6% 0.088 0.532 0.390
30% Communication Rate 1.8% 0.011 0.067 0.067
50% Test Pass Rate 41.1% 0.085 0.523 0.317
50% Communication Rate 5.5% 0.028 0.165 0.165
90% Test Pass Rate 44.8% 0.124 0.684 0.543
90% Communication Rate 54.1% 0.120 0.604 0.604
Table 9. Results on the average values of Test Pass Rates and communication rates with different percentages of content removed in the problem descriptions. Due to the randomness involved, the experiment was run 5 times and metrics used in (Ouyang et al., 2023) were reported.
% Removed Temperature Category Mean Variance Max Diff Ratio of Worst
50% 1 Test Pass Rate 0.496 0.088 0.532 0.390
50% 1 Communication Rate 0.018 0.011 0.067 0.067
50% 0 Test Pass Rate 0.407 0 0 0
50% 0 Communication Rate 0.037 0 0 0
Table 10. Results on the average values of test pass rates and communication rates with different temperatures as the hyperparameter of LLM in HumanEvalComm.

5.4. Implications and Future Work

Based on the results and analysis, we summarized the following implications as suggestions for future work:

  • Go beyond generative model. In the evaluation, we found that for Incompleteness category, some Code LLMs have extremely low results, potentially due to their generative nature that prefers to generate and complete code based on a statistical model, even when the description is obviously incomplete. Future work should go beyond the generative nature of LLMs to “AI agent” or “AI assistants” (Hassan et al., 2024) to further enhance communication capabilities.

  • Stronger reasoning capability. In the evaluation, Inconsistency category has the lowest communication rate among the three types. This indicates that more future work is needed to develop models with stronger reasoning capability to address the low performance in Inconsistency category.

  • Better ability to know when to stop asking for LLM agent. Although the LLM agent approach, Okanagan, showed promising initial results in improving the metrics in the evaluation, one limitation of Okanagan is that it reduced the pass rates in the original HumanEval benchmark, due to asking unnecessary questions. Therefore, one future work in LLM agent is to address this shortcoming. This will potentially lead to much stronger communication capability as the model will know intelligently when to stop asking.

  • Stronger ability to obtain information especially for increased clarification difficulty. In the experiments of two clarification types and investigation of different incomplete modifications (1p), we noticed reduced Pass@1 and Test Pass Rates as the difficulty increases in the description, indicating a bottleneck in failing to fetch needed information for solving the coding tasks. Future work is needed in both model and evaluation setup to increase the model’s ability to get necessary information in these challenging situations.

6. Threats to Validity

Construct validity. This threat relates to the potential incorrectness in manual modifications of problems in HumanEvalComm. To mitigate this threat, we have manually checked and verified all of the problems more than three times, and each time they discussed the problems they didn’t reach a consensus. Although we have tried our best efforts, there still may be some corner cases where the modified problems do not match the definition of Ambiguity and Inconsistency.

Internal validity. This threat relates to the internal parameters such as the parameters in open-source Code LLMs and ChatGPT that could potentially affect the results. To mitigate this threat, we use most of the default parameters when running open-source Code LLMs and ChatGPT. For open-source models, we set max_new_tokens as 512 to save computing resources and used default values for other parameters. For ChatGPT, we used temperature as 1.0 and n𝑛nitalic_n as 1 in the OpenAI API. Another threat relates to the effectiveness of the LLM-based evaluator used in the evaluation. As mentioned previously, to mitigate this issue, we have optimized the prompt for the LLM-based evaluator several times and checked the results manually. Besides, the LLM-based evaluator is used equally for all models in the evaluation, so this threat does not affect the relative ranking of the results for all models.

External validity. This relates to the generality of the communication capability of the models on other benchmarks. To mitigate this issue, we extensively report and analyze the results with statistical testing that reports p-value. To reduce the risk introduced by randomness in our investigation, we also added metrics such as mean, variance, max diff in the results. Thus, these results can be potentially adapted for other datasets. However, since we have not tested this, we cannot make a sound claim regarding the communication capability of the models on another dataset. Another threat is related to the implementation of evaluated models. We directly call OpenAI API to get ChatGPT results. We implemented Okanagan in Python that calls OpenAI API. For CodeLlama and other open-source models, we downloaded the model from HuggingFace and perform model inference on UBC ARC Sockeye. From the evaluation results, we believe that our implementation reflects the original methods. To ensure the reproducibility of the evaluation results, we report the result of the case study extensively and release our complete code and dataset. This can allow other researchers to reproduce and extend our experiments in the case study.

7. Related Work

Code Generation with Large Language Models. In recent years, the field of code generation has seen a significant shift with the large language models. For example, Codex (Chen et al., 2021), fine-tuned on GPT-3 (Brown et al., 2020) on a large corpus of source code data, is capable of generating code for 47/164 problems in the HumanEval dataset in a single run, a benchmark for code generation task. Codex became the core model for the Copilot (Ziegler et al., 2022), an AI-powered coding assistant developed by GitHub. After Codex, a couple of models similar to Codex but with smaller size were then developed, including GPT-J (Wang and Komatsuzaki, 2021), CodeParrot (Face, 2023), PolyCoder (Xu et al., 2022). AlphaCode (Li et al., 2022), with size comparable to Codex, was trained on Github data and fine-tuned on competition-level programming problems. It exceeded half of the competitors in coding competitions of CodeForces, a well-known online competitive programming platform. CodeGen (Nijkamp et al., 2022) was trained on both natural language and programming language data for code generations with multi-turn prompts. Recently, newer models such as CodeLlama (Roziere et al., 2023), DeepSeek Coder (Guo et al., 2024) and CodeQwen1.5 Chat (Bai et al., 2023) continued to achieve higher performance in benchmark such as HumanEval. However, the level of communication skills of these models is not emphasized and evaluated. These models are evaluated by generating code in one or multiple attempts from one-off problem descriptions, without further information from conversations. Therefore, when the input problem description is error-prone or incomplete, the model still has to generate the code without the chance to clarify critical questions. Our work serves as an exploration to address this usability problem.

Self-Correct LLMs and LLM Agent in AI. Recently, a promising approach to improve the output efficiency of large language models is self-correction (Pan et al., 2023). In the self-correction approach, the LLM uses the feedback guided or prompted by itself to refine its results. One popular category of work uses human feedback to refine their results directly (Kreutzer et al., 2018; Glaese et al., 2022; Ouyang et al., 2022; Scheurer et al., 2023; Fernandes et al., 2023). Other studies employed different strategies to self-correct LLMs using automated feedback such as self-training (Huang et al., 2022; Bai et al., 2022), generate-then-rank (He et al., 2023; Weng et al., 2023), feedback-guided decoding (Yang et al., 2022; Xie et al., 2023), iterative post-hoc revision (Zhang et al., 2023b; Jiang et al., 2023), etc. Furthermore, the advances in LLM have also brought much progress in LLM-based agents, with different modules including Planning, Memory, Profile, and Action (Xi et al., 2023; Wang et al., 2023b), and various agent categories, such as Tool Agent, Simulation Agent, Web Agent, Game Agent, etc. Our work also includes the evaluation of the LLM agent approach, Okanagan, which has an additional round with reflection as the thinking pattern.

LLM Agent for Code Generation. Although still at an early stage, recently, there has been a rising stream of research efforts to employ LLM agents for the task of code generation. RepairAgent (Bouzenia et al., 2024) is the first to use an LLM-based agent for program repair and code generation in the field of software engineering. This work follows the previous work in augmenting LLMs with API tools. Recently, CoRE (Xu et al., 2024) has been proposed as a system that enables agent programming by using LLM as interpreters to process and execute natural language instructions. Following a similar spirit of LLM agent, TICODER (Fakhoury et al., 2024) is proposed as a test-driven interactive workflow for more accurate code generation. Similarly, De-Hallucinator (Eghbali and Pradel, 2024) is proposed as a code completion method that combines retrieval-based code generation and iterative querying of the model. Different from the above works, the proposed Okanagan in our work focuses on enhancing the communication capabilities of LLM for code generation tasks. To the best of our knowledge, we are the first to study and compare the communication capabilities of LLM agent and Code LLMs in code generation tasks.

8. Conclusions

In this paper, we showed an initial step in the empirical study of the communication skills of LLMs in evaluating code clarification and code generation. We argue that the proficiency of communication skills of LLMs is necessary for AI systems to generate code with high standards, and, in the long term, to ask questions to acquire information that is just enough to complete their tasks. We believe that elevated communication skills should be viewed as an important factor in bridging the gap between LLMs and top-notch software developers. Although it needs additional conversational inputs, we believe it is still necessary and worthwhile to evaluate this communication capability for coding tasks.

As a first step toward this effort, we created HumanEvalComm to evaluate the degree of communication skills. Based on the new benchmark, we comprehensively evaluated different Code LLMs with the communication lens, where certain information is manually modified in the original problem description. Furthermore, we proposed an LLM-based agent approach, Okanagan, to identify and ask questions in ambiguous parts of code and descriptions for further refining the generated code. We found that modifying the problem description greatly reduced Test Pass Rates and Pass@1 with statistical significance. In terms of communication skills, more than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified. We also find that, compared with LLM such as ChatGPT 3.5, Okanagan, as a LLM agent approach, can effectively increase Communication Rate and Good Question Rate, and thus boost Test Pass Rate and Pass@1 when the problem descriptions are modified based on a clarification type.

Besides benchmarks, techniques to further improve the communication skills of LLMs can be the next steps in future work. Another interesting angle is to study how to tune the model to switch between under-communicating, effective-communicating, and over-communicating. We envision that different AI programming agents in the future will have various levels and styles of communication ability. This work can be seen as the first step toward evaluating the communication skills of Code LLMs and LLM agents. Our benchmark and replication package are made public at https://github.com/jie-jw-wu/human-eval-comm.

References

  • (1)
  • big (2024) Hugging Face Accessed 2024. Big Code Models Leaderboard. Hugging Face. https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard Accessed on April 29, 2024.
  • Alon et al. (2019) U. Alon, M. Zilberstein, O. Levy, and E. Yahav. 2019. Code2Vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
  • Arvidsson and Axell (2023) Simon Arvidsson and Johan Axell. 2023. Prompt engineering guidelines for LLMs in Requirements Engineering. (2023).
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemí Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. CoRR abs/2212.08073 (2022).
  • Bi et al. (2024) Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024).
  • Borji (2023) Ali Borji. 2023. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494 (2023).
  • Bouzenia et al. (2024) Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv preprint arXiv:2403.17134 (2024).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Bui et al. (2021) N. D. Bui, Y. Yu, and L. Jiang. 2021. InferCode: Self-supervised Learning of Code Representations by Predicting Subtrees. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1186–1197.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  • Dermeval et al. (2016) Diego Dermeval, Jéssyka Vilela, Ig Ibert Bittencourt, Jaelson Castro, Seiji Isotani, Patrick Brito, and Alan Silva. 2016. Applications of ontologies in requirements engineering: a systematic review of the literature. Requirements engineering 21 (2016), 405–437.
  • Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023).
  • Eghbali and Pradel (2024) Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Iterative Grounding for LLM-Based Code Completion. arXiv preprint arXiv:2401.01701 (2024).
  • Face (2023) Hugging Face. 2023. Codeparrot. https://huggingface.co/codeparrot/codeparrot
  • Fakhoury et al. (2024) Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. 2024. LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation. arXiv preprint arXiv:2404.10100 (2024).
  • Fan et al. (2023) Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
  • Feng et al. (2020) Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, and D. Jiang et al. 2020. CodeBERT: A Pre-trained Model for Programming and Natural Languages. arXiv preprint arXiv:2002.08155 (2020).
  • Fernandes et al. (2023) Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. de Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, and André F. T. Martins. 2023. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. CoRR abs/2305.00955 (2023).
  • Gao et al. (2024) Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. Llm-based nlg evaluation: Current status and challenges. arXiv preprint arXiv:2402.01383 (2024).
  • Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin J. Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Sona Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. 2022. Improving Alignment of Dialogue Agents via Targeted Human Judgements. CoRR abs/2209.14375 (2022).
  • Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024).
  • Hassan et al. (2024) Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al. 2024. Rethinking Software Engineering in the Foundation Model Era: From Task-Driven AI Copilots to Goal-Driven AI Pair Programmers. arXiv preprint arXiv:2404.10225 (2024).
  • He et al. (2023) Hangfeng He, Hongming Zhang, and Dan Roth. 2023. Rethinking with Retrieval: Faithful Large Language Model Inference. CoRR abs/2301.00303 (2023).
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
  • Huang et al. (2022) Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large Language Models Can Self-Improve. CoRR abs/2210.11610 (2022).
  • Jazayeri (2004) Mehdi Jazayeri. 2004. The education of a software engineer. In Proceedings. 19th International Conference on Automated Software Engineering, 2004. IEEE, xviii–xxvii.
  • Jiang et al. (2023) Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language Models. CoRR abs/2306.02907 (2023).
  • Jimenez et al. (2024) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66
  • Kim et al. (2023) Joonghoon Kim, Saeran Park, Kiyoon Jeong, Sangmin Lee, Seung Hun Han, Jiyoon Lee, and Pilsung Kang. 2023. Which is better? exploring prompting strategy for llm-based metrics. arXiv preprint arXiv:2311.03754 (2023).
  • Kocmi and Federmann (2023) Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520 (2023).
  • Kotonya et al. (2023) Neema Kotonya, Saran Krishnasamy, Joel Tetreault, and Alejandro Jaimes. 2023. Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task. arXiv preprint arXiv:2311.00686 (2023).
  • Kreutzer et al. (2018) Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018. Can Neural Machine Translation Be Improved with User Feedback?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  • Liang et al. (2023) Jenny T Liang, Chenyang Yang, and Brad A Myers. 2023. Understanding the Usability of AI Programming Assistants. arXiv preprint arXiv:2303.17125 (2023).
  • Liu et al. (2023a) Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. 2023a. Evaluate What You Can’t Evaluate: Unassessable Generated Responses Quality. arXiv preprint arXiv:2305.14658 (2023).
  • Liu et al. (2023b) Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. 2023b. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. arXiv preprint arXiv:2307.12596 (2023).
  • Liu et al. (2023c) Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023c. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv preprint arXiv:2308.04838 (2023).
  • McChesney and Gallagher (2004) Ian R McChesney and Seamus Gallagher. 2004. Communication and co-ordination practices in software engineering projects. Information and Software Technology 46, 7 (2004), 473–489.
  • Min et al. (2023) Marcus J Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, and Baishakhi Ray. 2023. Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. arXiv preprint arXiv:2310.14053 (2023).
  • Mistrík et al. (2010) Ivan Mistrík, John Grundy, Andre Van der Hoek, and Jim Whitehead. 2010. Collaborative software engineering: challenges and prospects. Springer.
  • Nguyen and Nadi (2022) N. Nguyen and S. Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5.
  • Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS).
  • Ouyang et al. (2023) Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
  • Pan et al. (2023) Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2023. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188 (2023).
  • Pressman (2005) Roger S Pressman. 2005. Software engineering: a practitioner’s approach. Palgrave macmillan.
  • Rabinovich et al. (2017) M. Rabinovich, M. Stern, and D. Klein. 2017. Abstract Syntax Networks for Code Generation and Semantic Parsing. arXiv preprint arXiv:1704.07535 (2017).
  • Rasheed et al. (2024) Zeeshan Rasheed, Muhammad Waseem, Mika Saari, Kari Systä, and Pekka Abrahamsson. 2024. Codepori: Large scale model for autonomous software development by using multi-agents. arXiv preprint arXiv:2402.01411 (2024).
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Sarkar et al. (2022) Advait Sarkar, Andrew D Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, and Ben Zorn. 2022. What is it like to program with artificial intelligence? arXiv preprint arXiv:2208.06213 (2022).
  • Scheurer et al. (2023) Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. 2023. Training Language Models with Language Feedback at Scale. CoRR abs/2303.16755 (2023).
  • Shanahan (2024) Murray Shanahan. 2024. Talking about large language models. Commun. ACM 67, 2 (2024), 68–79.
  • Siddiq et al. (2022) M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. Santos. 2022. An Empirical Study of Code Smells in Transformer-Based Code Generation Techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM). 71–82.
  • Sobania et al. (2022) D. Sobania, M. Briesch, and F. Rothlauf. 2022. Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of GitHub Copilot and Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference. 1019–1027.
  • Svyatkovskiy et al. (2020) A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan. 2020. Intellicode Compose: Code Generation Using Transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433–1443.
  • Tufano et al. (2020) M. Tufano, D. Drain, A. Svyatkovskiy, S. Deng, and N. Sundaresan. 2020. Unit Test Case Generation with Transformers and Focal Context. arXiv preprint arXiv: Software Engineering (2020).
  • Tukur et al. (2021) Muhammad Tukur, Sani Umar, and Jameleddine Hassine. 2021. Requirement engineering challenges: A systematic mapping study on the academic and the industrial perspective. Arabian Journal for Science and Engineering 46 (2021), 3723–3748.
  • Vaithilingam et al. (2022) P. Vaithilingam, T. Zhang, and E. L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–7.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems, Vol. 30.
  • Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model.
  • Wang et al. (2023a) Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023a. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048 (2023).
  • Wang et al. (2023b) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2023b. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
  • Wang et al. (2021) Y. Wang, W. Wang, S. Joty, and S. C. Hoi. 2021. CodeT5: Identifier-Aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv preprint arXiv:2109.00859 (2021).
  • Weng et al. (2023) Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. 2023. Large Language Models Are Better Reasoners with Self-Verification. CoRR abs/2212.09561 (2023).
  • Whitehead (2007) Jim Whitehead. 2007. Collaboration in software engineering: A roadmap. In Future of Software Engineering (FOSE’07). IEEE, 214–225.
  • Wu (2024) Scott Wu. Accessed 2024. Introducing Devin, the first AI software engineer. Cognition. https://www.cognition-labs.com/introducing-devin Accessed on April 8, 2024.
  • Xi et al. (2023) Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864 (2023).
  • Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, MinYen Kan, Junxian He, and Qizhe Xie. 2023. Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding. CoRR abs/2305.00633 (2023).
  • Xu et al. (2022) Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  • Xu et al. (2024) Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. 2024. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents. arXiv preprint arXiv:2405.06907 (2024).
  • Yang et al. (2022) Kaiyu Yang, Jia Deng, and Danqi Chen. 2022. Generating Natural Language Proofs with Verifier-Guided Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 89–105.
  • Ye et al. (2020) W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang. 2020. Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning. In Proceedings of The Web Conference 2020. 2309–2319.
  • Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories. 476–486.
  • Zan et al. (2023) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7443–7464.
  • Zhang et al. (2023c) Jintian Zhang, Xin Xu, and Shumin Deng. 2023c. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124 (2023).
  • Zhang et al. (2023b) Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023b. Self-Edit: Fault-Aware Code Editor for Code Generation. CoRR abs/2305.04087 (2023).
  • Zhang et al. (2023a) Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023a. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023).
  • Ziegler et al. (2022) Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29.