INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

Hung Le , Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo
Salesforce Research Corresponding author: hungle@salesforce.com

Abstract

Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes ( $+10\%$ absolute improvements in all models).

1 Introduction

Extending from the natural language domain, Large Language Models (LLMs) like (Koubaa, 2023; Wang and Komatsuzaki, 2021; Radford et al., 2019) have demonstrated great potential in code generation tasks (Svyatkovskiy et al., 2020; Chen et al., 2021; Hendrycks et al., 2021). However, when instructed with tasks containing malicious intentions or ambiguous requirements, LLMs are subject to generating code that could facilitate harmful attacks or code that contains obscure security problems (Khoury et al., 2023; Bhatt et al., 2023; Siddiq et al., 2022). For instance, in a study of Github’s Copilot, Pearce et al. (2022) observed that about $40\%$ of generated programs are vulnerable.

Despite previous efforts in addressing the safety of LLMs through finetuning (Bai et al., 2022; Korbak et al., 2023; Dai et al., 2024), this strategy alone is often not sufficient and scalable enough against prompts that are increasingly optimised for highly sophisticated attacks (Zhuo et al., 2023; Mazeika et al., 2024; Bhatt et al., 2024). Furthermore, in the domain of code generation, creating quality safety-related data for finetuning often incurs great costs, involving programming experts with a deep understanding of code and related cyber-security and vulnerability concerns.

Note that code itself is often not inherently malicious. For example, as noted by Bhatt et al. (2023), a program for an encryption method could be very useful to create a secure personal file system. Yet the encryption method can also be exploited for a ransomware attack. Therefore, it is important to develop an efficient method for LLMs to achieve the intricate balance between helpfulness and safety in the code domain. We introduce INDICT, Internal Dialogues of Critiques, a novel framework for LLMs to generate code that is not only helpful but also safe and secure (see Figure 1 for an example code generation task and Figure 2 for the method overview).

Refer to caption — Figure 1: INDICT (Internal Dialogues of Critiques) enables two different critics to interact with each other autonomously and collaboratively, improving code generation by both security and helpfulness. In this example, INDICT iteratively resolves the security weakness CWE-78 (Improper Neutralization in an OS Command) and improves the code functionality with relevant supporting modules.

First, instead of a single critic for a specific code quality (Le et al., 2022; Welleck et al., 2023; Chen et al., 2023c), we consider both helpfulness-driven critic and safety-driven critic. Instead of activating these critics independently, we propose to position them in an autonomous agent system like (Huang et al., 2022; Dong et al., 2023; Li et al., 2024). Although the critics are configured with orthogonal goals, we let them interact with each other autonomously to collaboratively and simultaneously optimise both security and correctness of LLM-generated responses.

Extending from retrieval-augmented generation (Guu et al., 2020; Ram et al., 2023; Asai et al., 2024), we also equip the critics with external knowledge retrieved by relevant code snippets and natural language queries. Just like how human developers typically select and examine one small piece of code at a time, the critics use a code snippet together with a text query to call relevant tools like web search and code interpreters. The resulting outputs from the external tools are used by the critics to generate more knowledge-grounded critiques for the “actor” LLM generator.

Finally, we engage our critics during two stages: (1) preemptive critic feedback is obtained during the initial code generation stage; and (2) post-hoc critic feedback is activated after the code is observed in an execution environment. Albeit more commonly used in prior work like (Li et al., 2022; Chen et al., 2023c; Le et al., 2024), post-hoc feedback alone is not proactive enough for security-sensitive tasks. In these tasks, unexpected damage may likely occur and create systematic impacts on execution environments in practice (Hendrycks et al., 2023; Mazeika et al., 2024). Our strategy facilitates a “preemptive” layer of protection, creating a more holistic critic framework for code generation LLMs.

We conducted a comprehensive evaluation of INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks. On LLMs ranging from 7B to 70B parameters, we observed consistent performance improvement by both safety and helpfulness of generation outputs. We found that INDICT can provide useful critiques to LLMs, leading to new SoTA performance by security measures while maintaining or improving the helpfulness of generated code. Our approach also generalises well to open-ended tasks, demonstrating the broader potential of a cooperative autonomous critic system for helpful yet responsible AI models ¹¹1code release: https://github.com/SalesforceAIResearch/indict_code_gen.

2 Related Work

Our research is broadly related to the research of large language models (LLMs) (Koubaa, 2023; Team et al., 2023; Brown et al., 2020; Radford et al., 2019; Touvron et al., 2023a). Pretrained on a massive amount of text data on very deep Transformer-based architectures, these models have shown impressive performance in many natural language tasks. Going beyond the text domain, LLMs have been extended to learn from the code data and applied to many coding tasks (Rozière et al., 2023; Li et al., 2023; Lozhkov et al., 2024; Gunasekar et al., 2023; Wang et al., 2023; Nijkamp et al., 2023; Luo et al., 2023). One major application of LLMs in the code domain is code generation, a long-standing challenge of many conventional AI models (Manna and Waldinger, 1971; Gulwani et al., 2012; Kurach et al., 2015; Devlin et al., 2017; Parisotto et al., 2016). In this task, an AI model is required to generate proper code solutions for different programming problems, ranging from basic daily code completion tasks to more advanced algorithmic problems (Chen et al., 2021; Austin et al., 2021; Hendrycks et al., 2021; Shinn et al., 2023; Lai et al., 2023).

In the research for code generation, we have witnessed emerging studies focusing on the security and safety aspects of AI-generated code. Hammond Pearce et al. (2021); Schuster et al. (2021); Pearce et al. (2022) found that commercially successful systems like Github’s Copilot still led to obscure yet major vulnerability and security issues in code. More recently, Perez et al. (2022); Zhuo et al. (2023); Khoury et al. (2023) demonstrated highly complex prompting methods that can “jailbreak” advanced LLMs like ChatGPT into generating malicious code. To benchmark LLMs against code safety and security, (Siddiq and Santos, 2022; Tony et al., 2023) evaluated LLMs against common coding scenarios based on CWE ²²2Common Weakness Enumeration (CWE) is a community-developed list of common software and hardware weaknesses. More details in https://cwe.mitre.org/about/index.html. More recently, Bhatt et al. (2023, 2024) introduced CyberSecEval, a large-scale benchmark containing different types of security-aware evaluations. They observed that the code outputs by powerful LLMs like Llama and GPT models are often not perfectly secure.

More relevant to our work is the research to improve the safety or helpfulness of LLMs. A common strategy is finetuning LLMs with appropriate preference data with specific reward models to differentiate among ranked data samples (Bai et al., 2022; Korbak et al., 2023; Wu et al., 2024; Sun et al., 2024; Dai et al., 2024). In the code domain, He and Vechev (2023); He et al. (2024) proposed to finetune LLMs with prompt prefixes or masking strategies conditioned by the safety of corresponding code samples. Chen et al. (2023a) requires human annotators to provide natural language feedback of training samples. Different from prior approaches, we propose a more efficient method to generate better codes by both safety and helpfulness. Our approach can complement the research of autonomous LLM agents (Huang et al., 2022; Yao et al., 2023; Dong et al., 2023) and AI-generated feedback (Bahdanau et al., 2017; Le et al., 2022; Welleck et al., 2023; Gou et al., 2024).

3 INDICT Framework

3.1 Problem Definition

Typically, in a code generation task, an LLM $\theta$ receives an input $X$ , consisting of a natural language instruction and optionally, a related code context. Treating code generation as a sequence-to-sequence task, the LLM autoregressively generates a response $\hat{Y}$ as a sequence of tokens. Each token $\hat{y}_{t}$ is sampled from the parameterized condition distribution $p_{\theta}(.|\hat{y}_{1:t-1},X)$ where $\hat{y_{t}}\in\mathcal{V}$ . The output can contain either natural language segments (e.g. explanation of the output code or refusal of the user request) as well as code programs (e.g. code snippets to complete the given input code context).

3.2 Safety-driven and Helpfulness-driven Critics

Pretrained with a massive amount of data, LLMs are found to be capable of providing insightful feedback to self-improve their own responses in many downstream tasks (Shinn et al., 2023; Zhang et al., 2023; Welleck et al., 2023; Madaan et al., 2023). Rather than just a single critic for a specific code attribute, we propose to engage two critics with independent goals: a safety-driven critic $\sigma$ and a helpfulness-driven critic $\omega$ . We initialize the critics as LLMs configured by specific system prompts ( $P_{s}$ and $P_{h}$ respectively) to establish the critics’ corresponding roles.

For instance, for the safety-based critic, we instruct the model to focus solely on the security and risks of the code, and prioritise these aspects over other code qualities. Vice versa, for the helpfulness-based critic, we request the model to investigate the helpfulness of the code, i.e. whether the output aligns fully with the intentions and requirements in the given task. Denoting $\hat{C}_{s}$ and $\hat{C}_{h}$ as the complete outputs of the critics, we can define the critic output distributions (per token) as:

	$\displaystyle\hat{c}_{s,t}$	$\displaystyle\sim p_{\sigma}(.\|\hat{c}_{s,1:t-1},X,\hat{Y},P_{s})\hskip 8.0000% 3pt$	$\displaystyle\Rightarrow\hskip 8.00003pt$	for safety-driven critic		(1)
	$\displaystyle\hat{c}_{h,t}$	$\displaystyle\sim p_{\omega}(.\|\hat{c}_{h,1:t-1},X,\hat{Y},P_{h})\hskip 8.0000% 3pt$	$\displaystyle\Rightarrow\hskip 8.00003pt$	for helpfulness-driven critic		(2)

Subsequently, we let the code generation LLM (“actor”) revise their solutions conditioned by the generated critiques: $\hat{y}_{s}\sim p_{\theta}(\hat{y}_{s,1:t-1}|X,\hat{Y},\hat{C}_{s})$ for safety-conditioned solutions and $\hat{y}_{h}\sim p_{\theta}(\hat{y}_{h,1:t-1}|X,\hat{Y},\hat{C}_{h})$ for helpfulness-conditioned solutions. Refer to Appendix F for the detailed instruction prompts we used on our critics to assess safety or helpfulness of model outputs.

3.3 Autonomous Collaboration between Critics

LLMs are often finetuned to follow natural language instructions (Ouyang et al., 2022; Korbak et al., 2023; Dai et al., 2024) and subsequently, can engage in natural language interactions with humans or even among other LLMs. In the latter, recent studies (Huang et al., 2022; Dong et al., 2023; Li et al., 2024; Chen et al., 2024) observed significant performance gains when enabling LLMs to interact autonomously to solve complex tasks. We are motivated by this observation and propose an autonomous agent system of critic models to generate helpfulness-and-safety-aware critiques.

Note that an alternative strategy is to use a single critic model for both helpfulness and safety. However, such a critic model often needs complex alignment finetuning or prompt engineering to generate critiques that are not significantly biased towards a single code property. In our approach, from 1 and 2, given an interaction turn $r$ between critics, we can redefine the output distributions as:

	$\displaystyle\hat{c}^{r}_{s,t}$	$\displaystyle\sim p_{\sigma}(.\|\hat{c}_{s,1:t-1},X,\hat{Y},P_{s},\hat{I}_{1:r-% 1})\hskip 8.00003pt$	$\displaystyle\Rightarrow\hskip 8.00003pt$	for safety-driven critic		(3)
	$\displaystyle\hat{c}^{r}_{h,t}$	$\displaystyle\sim p_{\omega}(.\|\hat{c}_{h,1:t-1},X,\hat{Y},P_{h},\hat{I}_{1:r-% 1}\oplus\hat{C}^{r}_{s})\hskip 8.00003pt$	$\displaystyle\Rightarrow\hskip 8.00003pt$	for helpfulness-driven critic		(4)

Where $\oplus$ denotes concatenation and $\hat{I}_{1:r-1}=\hat{C}^{1}_{s}\oplus\hat{C}^{1}_{h}\oplus\ldots\hat{C}^{r-1}_% {s}\oplus\hat{C}^{r-1}_{h}$ contains all the past interactions between the safety-driven and helpfulness-driven critics.

Practically, to avoid computation overhead, we can limit $\hat{I}$ to only the last few turns of interactions. Alternatively, in this work, we summarize the critic dialogue after each turn of interactions and only use the corresponding summary in each turn: $\hat{\mathcal{I}}_{r}=f(\hat{I}_{1:r})$ where $f(.)$ is parameterized as an LLM-based summarizer model. To revise the solutions from “actor” LLM by both safety and helpfulness, we can then conveniently reuse the summary in the last interaction turn $R$ between the critics (thus, also reducing the computation cost on the “actor” LLM). To generate safety-and-helpfulness-aware outputs, we revise the output distributions of the LLM code generator as:

\displaystyle\hat{y}_{s+h,t}\sim p_{\theta}(.|\hat{y}_{s+h,1:t-1},X,\hat{Y},% \hat{\mathcal{I}}_{R})

(5)

3.4 Knowledge-grounded Critics with External Tools

Depending on how well LLMs can perceive and resurface relevant knowledge from pretraining data, these models might still cause serious hallucination problems by generating factually incorrect responses (McKenna et al., 2023; Rawte et al., 2023; Xu et al., 2024). These hallucination problems are exacerbated when LLMs play the critic roles, required to provide reliable and grounded responses against code generation outputs. In this work, we extend prior tool-enhanced LLMs like (Yao et al., 2023; Peng et al., 2023; Lu et al., 2024) and retrieval-augmented generation strategies (Guu et al., 2020; Ram et al., 2023; Asai et al., 2024) to improve our critics.

Specifically, we equip our critics with access to external tools and incorporate the tools’ query results as additional knowledge to generate more grounded critiques (see Figure 3 for an overview). For instance, for the safety-driven critic, from 3, we decompose the critic generation process to the following steps:

1. Critic’s thought $\hat{W}^{r}_{s}$ :	$\displaystyle\hat{w}^{r}_{s,t}\sim p_{\sigma}(.\|{w}^{r}_{s,1:t-1},X,\hat{Y},P_% {s},\hat{\mathcal{I}}_{r-1})$	(6)
2. Critic’s action $\hat{Q}^{r}_{s}$ :	$\displaystyle\hat{Q}^{r}_{s}\sim p_{\sigma}(\langle\hat{Q}^{r}_{s,\text{text}}% ,\{\emptyset,\hat{Q}^{r}_{s,\text{code}}\}\rangle\|\hat{Y},P_{s},\hat{W}^{r}_{s})$	(7)
3. Critic’s observation $\hat{O}^{r}_{s}$ :	$\displaystyle\hat{O}^{r}_{s}=g(\hat{Q}^{r}_{s})$	(8)

First, we obtain the critic’s initial thought $\hat{W}^{r}_{s}$ , following the same formulation as in 3. In the critic’s action step, we parameterize critic “actions” as the generation of unique textual keywords $\hat{Q}^{r}_{s,\text{text}}$ , optionally accompanied by code snippets $\hat{Q}^{r}_{s,\text{code}}$ . These are used subsequently as search queries to call external tools and obtain search results in the critic’s observation step. Denoting function $g(.)$ as the tool calling functions, we introduce two types of functions: code search and code review. Refer to Figure 3 for the specifications and examples of these functions and Figure 1 for demonstrations.

Note that the above extension can be applied identically to the helpfulness-driven critic. We also then revise $\mathcal{I}$ as the summary of all past critics’ initial thoughts concatenated with corresponding observations: $\hat{\mathcal{I}}_{r}=f(\{\hat{W}\oplus\hat{O}\}^{1:r-1}_{s}\oplus\{\hat{W}% \oplus\hat{O}\}^{1:r-1}_{h})$ .

3.5 Preemptive and Post-hoc Critic Feedback

Different from the text domain, code generation outputs could be additionally observed/ interpreted in relevant environments e.g. through code interpreters (“executor”). Shi et al. (2022); Le et al. (2022); Chen et al. (2023b, c) demonstrated the benefits of execution-based feedback to improve the functional correctness of code. However, in security-sensitive scenarios, directly engaging the executing environment might cause unintentional systematic damage, e.g. deleted data directories or modified access to privileged user accounts.

We propose to deploy our critic system for both preemptive feedback (after the initial code generation step) and post-hoc feedback (after the generated code is observed by the executor). To obtain posthoc critic feedback, we simply incorporate the execution results (e.g. error messages, unit test outcomes) as the conditioning factors in 1, 2, 3, 4, and 6. Note that we maintain a persistent dialogue context between safety and helpfulness critics throughout preemptive and post-hoc iterations. We can define the output distributions of the LLM code generator conditioned by the posthoc feedback as:

\displaystyle\hat{y}^{\text{posthoc}}_{s+h,t}\sim p_{\theta}(.|\hat{y}^{\text{% posthoc}}_{s+h,1:t-1},X,\hat{Y}^{\text{peempt}}_{s+h},\hat{\mathcal{I}}^{\text% {posthoc}}_{R})

(9)

where $\hat{\mathcal{I}}^{\text{posthoc}}_{r}=f(\hat{\mathcal{I}}^{\text{preempt}}_{1% :R}\oplus\hat{{I}}^{\text{posthoc}}_{1:r-1})$ is the summarized posthoc critic feedback.

4 Experiments

Base Language Models. We applied INDICT on CommandR (Cohere, 2024) which was specifically optimized for external tool augmentation, making the model suitable for our framework. In challenging adversarial tests like red-teaming attacks, we additionally employed popular preference-tuning models from the Llama and Codellama families (Touvron et al., 2023b; Rozière et al., 2023; Meta, 2024), ranging from 7B to 70B parameters. All models were designed for long-context tasks as well as conversational interactions, making them suitable for experiments with INDICT. To fairly compare the performance across models, given a model choice, we initialized our actors and critics with the same model checkpoint. For all base LLMs, we utilized the Huggingface-hosted model parameters (Wolf et al., 2019) and vLLM (Kwon et al., 2023) to generate the responses.

Configurations. To fairly compare between base models, given a task, we maintained the instruction prompts as similarly as possible across all models. Models such as CommandR (Cohere, 2024) which is already finetuned for tool enhancement, are prompted according to their prompting strategies. We adopted a maximum output length of up to $2048$ tokens on actor or critic models. We also fixed the generation budget to $1$ sample in each generation by actor or critic models. For a given actor-generated sample, we applied our INDICT framework for up to $5$ rounds to improve this sample iteratively. Please refer to Appendix D and F for more detailed experimental setups e.g. external tools, model and generation configurations, compute resources, and example prompt instructions.

4.1 Insecure coding practice tasks

Benchmarks. We first evaluated our approach on insecure code generation tasks in which LLMs were found to generate outputs with significant security concerns. We considered the Insecure Coding Practice test from CyberSecEval-1 (Bhatt et al., 2023), which includes two sub-tasks: “Autocomplete” where LLMs are provided a code context and predict subsequent code segments to complete this code context; and “Instruct” where LLMs fulfill natural language instructions of coding problems. Additionally, following an instruction-following setup, the CVS benchmark (Code Vulnerability and Security) (CyberNative, 2024) provides a pair of ground-truth secure and insecure code outputs given a coding problem. Please refer to Appendix D for more details of the benchmarks.

Evaluation. To measure the safety of model outputs, we followed Bhatt et al. (2023) by using their detector model which contains comprehensive rules defined in weggli (weg, 2023) and semgrep (sem, 2023) to detect more than 180 patterns related to 50 Common Weakness Enumerations (CWEs). The safety metric is defined as the percentage of test samples where output codes do not contain any insecurities. To measure the helpfulness, we followed prior work like Bai et al. (2022); Zheng et al. (2024); Li et al. (2024) to adopt GPT3.5 as the AI evaluator (Achiam et al., 2023) to rank the helpfulness of model outputs. In our experiments, given a test problem, we computed the winning rate of a model output against the output of a known SoTA model (e.g. Llama2-7b-chat in CyberSecEval-1) or the corresponding ground-truth outputs (for the CVS benchmark).

Results. From Figure 4, we observed consistent performance improvements of our approach, outperforming prior strong LLM baselines such as Llama and GPT models (Touvron et al., 2023b; Achiam et al., 2023). Specifically, by applying INDICT with CommandR and LLama3 models (Meta, 2024; Cohere, 2024), we obtained SoTA performance by safety (more than $80\%$ and $90\%$ output codes are safe on CyberSecEval-1 and CVS respectively) as well as helpfulness (up to $70\%$ output codes are more helpful than the prior SoTA model or ground-truth outputs). Figure 4 also demonstrates the consistency of our approach by both safety and helpfulness across different programming languages. There are only a few exceptional cases of helpfulness performance (specifically with Javascript in the CyberSecEval benchmark and C++ in the CVS benchmark).

4.2 Security attack tasks

Benchmarks. We also evaluated our approach against malicious coding tasks in which the instruction prompts contain obscure yet dangerous intentions to perform security attacks. We considered three major tasks: the Cyberattack Helpfulness test from CyberSecEval-1 (Bhatt et al., 2023), and the Interpreter Abuse and Prompt Injection tests from CyberSecEval-2 (Bhatt et al., 2024). The first tasks contain test samples of attack methods that are well studied in industry-standard MITRE ATT&CK ontology ³³3https://attack.mitre.org/. The second task was proposed recently to instruct LLMs to abuse a code interpreter to carry on unauthorized actions e.g. data overriding. Finally, the last task is designed to simulate injection attacks by synthetically injecting harmful rules to prompts e.g. disclosing a given password in the generation output. Please refer to Appendix D for more details of the benchmarks.

Model	Direct	ZS	PAP	JB	TAP	PAIR	Avg.
CommandR	33.1	23.4	25.0	23.1	18.4	18.4	23.6
CommandR+INDICT	65.3	52.5	63.1	37.5	46.9	43.4	51.5
Llama3-8b-instruct	77.5	63.4	67.8	83.1	60.6	58.1	68.4
Llama3-8b-instruct+INDICT	90.6	79.4	81.9	89.1	75.9	77.8	82.4
Llama3-70b-instruct	68.4	60.0	68.1	90.9	61.9	57.5	67.8
Llama3-70b-instruct+INDICT	85.9	75.3	74.7	90.0	75.9	75.3	79.5

Evaluation. In these tasks, we focused on measuring the safety measurement by determining whether the model outputs assist the given instructions e.g. by suggesting supporting code snippets or by providing natural language explanation for a solution. Following Bhatt et al. (2023, 2024), we used GPT3.5 (Achiam et al., 2023) and adopted the expansion-then-judge evaluation pipeline: first, expand the generation output with reasoning against safety criteria, and subsequently, judge if the output is indeed benign. The safety metric is the percentage of outputs that are considered benign.

Results. From Figure 5, we observed the significant performance improvement by safety measures on all three types of security attacks. Specifically, by using models from CodeLlama (Rozière et al., 2023) and Llama3 (Meta, 2024) families, we achieved new SoTA safety performance: $76\%$ on Cyber Attack task and more than $90\%$ on Interpreter Abuse and Prompt Injection tasks. Notably, despite a weaker model, when enhanced with INDICT, CommandR can achieve significant boosts and become more secure against harmful task instructions. The results also demonstrate the efficacy of our method on models of different sizes, from 8B to 70B model parameters.

4.3 Open-ended generation tasks

Benchmarks. Although we focused on the code domain in this work, our method can be easily adapted to generation tasks in other domains. In these cases, we can simply remove the execution environment (and accordingly posthoc feedback step) and activate INDICT with appropriate domain-agnostic contexts in our instruction prompts (see Appendix F for example prompts). We adapted our method to two major open-ended generation benchmarks: HarmBench (Mazeika et al., 2024), which evaluates LLMs against various red teaming optimization methods, and CAMEL (Li et al., 2024), which contains a wide variety of GPT-generated complex problems in diverse domains. Please refer to Appendix D for more details of the benchmarks.

Evaluation. For HarmBench, we followed Mazeika et al. (2024) and adopted their AI evaluator, which is a classifier finetuned from Llama2-13b model to assess the safety and biases of model outputs. For CAMEL, we adopted a similar strategy but used GPT3.5 as the AI evaluator. Following Li et al. (2024), we defined the safety and helpfulness measures as the average winning rate over the direct generation approach by the corresponding base LLM.

Results. Table 1 demonstrates the benefit of INDICT in combination with CommandR and Llama3 models. Consistent with our observations in prior experiments, albeit a weaker model by safety, CommandR+INDICT still improves significantly across all red-teaming optimization methods (from $23\%$ to $51\%$ by average safety metric). For the CAMEL benchmark, Figure 6 shows that INDICT can iteratively improve the model outputs with at least $70\%$ model outputs are better by both safety and helpfulness than the direct generation approach. We noted the minor performance drops after 4 rounds of INDICT, suggesting further study to address open-ended tasks beyond the code domain.

4.4 Ablation analysis

To perform ablation analysis, we randomly sampled a subset from the CyberSecEval-1 (Bhatt et al., 2023), including both Insecure Coding Practice and Cyber Attack tasks. For each task, we randomly sampled $20\%$ of the full dataset such that the sampled subset had similar distributions as the original dataset by programming languages or types of attack methods. We reported the averaged safety metric following the evaluation of the corresponding tasks (see 4.1 and 4.2). For helpfulness, we adopted GPT3.5 as the AI evaluator and computed the percentage of outputs that are considered more helpful than the direct generation approach of the corresponding base model.

Table 2: We conducted an ablation analysis of INDICT when removing the proposed dual critic system and/or external tool enhancement. We conducted our experiments on Codellama(CL) models from 7B to 34B parameters and the CommandR model.

Base model

Critics

Tools

Safety

Helpful

Avg.

CL-7b-instruct

56.3

50.0

53.1

✓

64.9

61.4

63.1

✓

65.3

62.1

63.7

CL-13b-instruct

59.1

50.0

54.6

✓

78.0

59.0

68.5

✓

78.8

60.3

69.6

CL-34b-instruct

56.7

50.0

53.4

✓

68.8

63.4

66.1

✓

73.8

63.1

68.5

CommandR

54.0

50.0

52.0

✓

76.8

59.2

68.0

✓

78.3

60.7

69.5

Table 3: We conducted an ablation analysis of INDICT with different combinations of our critics, during either preemptive or posthoc feedback stage or both. To fairly compare these variants, we excluded any access to external tools, and used CommandR as the base model in all experiments.

Safety

Critic

Helpful.

Critic

Preempt.

Posthoc

Safety

Helpful

Avg.

63.0

50.0

56.5

✓

76.6

51.4

64.0

✓

66.0

62.1

64.0

✓

78.1

59.8

68.9

✓

72.7

55.3

64.0

✓

70.5

59.8

65.2

✓

71.3

72.0

71.6

✓

73.6

61.4

67.5

✓

66.8

66.6

66.7

✓

81.8

68.9

75.3

From Table 2 and 3, we have the following observations. First, INDICT can lead to performance gains in both safety and helpfulness with all base models, including Codellama models from 7B to 34B and CommandR models. The framework achieves the optimal performance when integrating external tools with our critics. Secondly, we found that this tool enhancement strategy improves the safety quality of the outputs more than the helpfulness, indicating that current LLMs significantly benefit from external grounding to be more safe and secure. Thirdly, we observed that using safety critic alone or helpfulness critic alone is not sufficient, often optimizing the outputs significantly by either only safety or only helpfulness qualities respectively. Finally, we noted that when adopting our critics in both preemptive and posthoc stages, we achieved more well-rounded results, with the best overall average of safety and helpfulness metrics.

We also conducted ablation analysis by multiple rounds of INDICT applications. To obtain the results of the direct generation approach (i.e. “base”) in multiple rounds, we simply concatenated previously generated samples into our prompt and iteratively instructed the model to regenerate better outputs (without any critics or tool enhancement). From Figure 7, we noted the significant and consistent improvements from INDICT, using CommandR and Codellama-13b-instruct as base models. Interestingly, we still observed some performance improvement, albeit very marginal, of the direct generation approach over multiple generation rounds. We also noticed that without using external tools, the performance curves tend to converge faster than the tool-enabled approach. For more detailed experimental results, please refer to Appendix E.

5 Conclusion

We present INDICT, a novel framework to improve code generation by both safety and helpfulness. INDICT essentially facilitates an autonomous agent system between two critic models, each of which focuses on either the safety or helpfulness quality of outputs from the “actor” code generation LLM. Given access to external tools, the two critics interact with each other autonomously to generate grounded critiques, collaboratively improving the model outputs. We conducted comprehensive experiments of INDICT on diverse downstream coding tasks across different programming languages and attack tactics. Our results demonstrated the benefits of INDICT on code-related tasks and beyond, highlighting the promising direction of an autonomous and tool-enhanced multi-critic system.

References

sem (2023) Semgrep - Make shift left work — semgrep.dev. https://semgrep.dev/, 2023. [Accessed 19-05-2024].
weg (2023) GitHub - weggli-rs/weggli: weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases. — github.com. https://github.com/weggli-rs/weggli, 2023. [Accessed 19-05-2024].
lan (2024) GitHub - langchain-ai/langchain: Build context-aware reasoning applications — github.com. https://github.com/langchain-ai/langchain, 2024. [Accessed 21-05-2024].
sea (2024) search-engine-parser — pypi.org. https://pypi.org/project/search-engine-parser/, 2024. [Accessed 21-05-2024].
Achiam et al. (2023) J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Asai et al. (2024) A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hSyW5go0v8.
Austin et al. (2021) J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Bahdanau et al. (2017) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJDaqqveg.
Bai et al. (2022) Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
Bhatt et al. (2023) M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, et al. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724, 2023.
Bhatt et al. (2024) M. Bhatt, S. Chennabasappa, Y. Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y. Chen, D. Kapil, et al. Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models. arXiv preprint arXiv:2404.13161, 2024.
Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chao et al. (2023) P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
Chen et al. (2023a) A. Chen, J. Scheurer, T. Korbak, J. A. Campos, J. S. Chan, S. R. Bowman, K. Cho, and E. Perez. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023a.
Chen et al. (2023b) B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=ktrw68Cmu9c.
Chen et al. (2021) M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Chen et al. (2024) W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, Y. Qin, X. Cong, R. Xie, Z. Liu, M. Sun, and J. Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=EHg5GDnyq1.
Chen et al. (2023c) X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023c.
Cohere (2024) Cohere. Command-R — docs.cohere.com. https://docs.cohere.com/docs/command-r, 2024. [Accessed 18-05-2024].
CyberNative (2024) CyberNative. CyberNative/Code_Vulnerability_Security_DPO · Datasets at Hugging Face — huggingface.co. https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO, 2024. [Accessed 19-05-2024].
Dai et al. (2024) J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.
Devlin et al. (2017) J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A.-r. Mohamed, and P. Kohli. Robustfill: Neural program learning under noisy i/o. In International conference on machine learning, pages 990–998. PMLR, 2017.
Dong et al. (2023) Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590, 2023.
Gou et al. (2024) Z. Gou, Z. Shao, Y. Gong, yelong shen, Y. Yang, N. Duan, and W. Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Sx038qxjek.
Gulwani et al. (2012) S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105, 2012.
Gunasekar et al. (2023) S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
Guu et al. (2020) K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
Hammond Pearce et al. (2021) B. A. Hammond Pearce, B. Tan, B. Dolan-Gavitt, and R. Karri. An empirical cybersecurity evaluation of github copilot’s code contributions. arXiv preprint arXiv:2108.09293, 2021.
He and Vechev (2023) J. He and M. Vechev. Large language models for code: Security hardening and adversarial testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1865–1879, 2023.
He et al. (2024) J. He, M. Vero, G. Krasnopolska, and M. Vechev. Instruction tuning for secure code generation. arXiv preprint arXiv:2402.09497, 2024.
Hendrycks et al. (2021) D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps. NeurIPS, 2021.
Hendrycks et al. (2023) D. Hendrycks, M. Mazeika, and T. Woodside. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023.
Huang et al. (2022) W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
Khoury et al. (2023) R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara. How secure is code generated by chatgpt? In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2445–2451. IEEE, 2023.
Korbak et al. (2023) T. Korbak, K. Shi, A. Chen, R. V. Bhalerao, C. Buckley, J. Phang, S. R. Bowman, and E. Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
Koubaa (2023) A. Koubaa. Gpt-4 vs. gpt-3.5: A concise showdown. 2023.
Kurach et al. (2015) K. Kurach, M. Andrychowicz, and I. Sutskever. Neural random-access machines. arXiv preprint arXiv:1511.06392, 2015.
Kwon et al. (2023) W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Lai et al. (2023) Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
Le et al. (2022) H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
Le et al. (2024) H. Le, H. Chen, A. Saha, A. Gokul, D. Sahoo, and S. Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vYhglxSj8j.
Li et al. (2024) G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. (2023) R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
Li et al. (2022) Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
Lozhkov et al. (2024) A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
Lu et al. (2024) P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
Luo et al. (2023) Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
Madaan et al. (2023) A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
Manna and Waldinger (1971) Z. Manna and R. J. Waldinger. Toward automatic program synthesis. Commun. ACM, 14(3):151–165, mar 1971. ISSN 0001-0782. doi: 10.1145/362566.362568. URL https://doi.org/10.1145/362566.362568.
Mazeika et al. (2024) M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
McKenna et al. (2023) N. McKenna, T. Li, L. Cheng, M. Hosseini, M. Johnson, and M. Steedman. Sources of hallucination by large language models on inference tasks. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2758–2774, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.182. URL https://aclanthology.org/2023.findings-emnlp.182.
Mehrotra et al. (2023) A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
Meta (2024) Meta. Meta Llama 3 — llama.meta.com. https://llama.meta.com/llama3/, 2024. [Accessed 18-05-2024].
Nijkamp et al. (2023) E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023.
Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
Parisotto et al. (2016) E. Parisotto, A.-r. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
Pearce et al. (2022) H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022.
Peng et al. (2023) B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
Perez et al. (2022) E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.225. URL https://aclanthology.org/2022.emnlp-main.225.
Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Ram et al. (2023) O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y. Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
Rawte et al. (2023) V. Rawte, A. Sheth, and A. Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
Rozière et al. (2023) B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
Schuster et al. (2021) R. Schuster, C. Song, E. Tromer, and V. Shmatikov. You autocomplete me: Poisoning vulnerabilities in neural code completion. In 30th USENIX Security Symposium (USENIX Security 21), pages 1559–1575, 2021.
Shen et al. (2023) X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
Shi et al. (2022) F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and S. I. Wang. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.231. URL https://aclanthology.org/2022.emnlp-main.231.
Shinn et al. (2023) N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.
Siddiq and Santos (2022) M. L. Siddiq and J. C. Santos. Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, pages 29–33, 2022.
Siddiq et al. (2022) M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. Santos. An empirical study of code smells in transformer-based code generation techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 71–82. IEEE, 2022.
Sun et al. (2024) Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
Svyatkovskiy et al. (2020) A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443, 2020.
Team et al. (2023) G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Tony et al. (2023) C. Tony, M. Mutas, N. E. D. Ferreyra, and R. Scandariato. Llmseceval: A dataset of natural language prompts for security evaluations. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), pages 588–592. IEEE, 2023.
Touvron et al. (2023a) H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. (2023b) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Wang and Komatsuzaki (2021) B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Wang et al. (2023) Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
Welleck et al. (2023) S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hH36JeQZDaO.
Weyssow et al. (2024) M. Weyssow, A. Kamanda, and H. Sahraoui. Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. arXiv preprint arXiv:2403.09032, 2024.
Wolf et al. (2019) T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
Wu et al. (2024) Z. Wu, Y. Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, and H. Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024.
Xu et al. (2024) Z. Xu, S. Jain, and M. Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024.
Yao et al. (2023) S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X.
Zeng et al. (2024) Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
Zhang et al. (2023) T. Zhang, T. Yu, T. Hashimoto, M. Lewis, W.-t. Yih, D. Fried, and S. Wang. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR, 2023.
Zheng et al. (2024) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Zhuo et al. (2023) T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, 2023.

Appendix A Limitations

Despite the strong performance of INDICT on a wide variety of tasks, there are some limitations that we want to emphasize. First, our framework relies on the instruction-following ability of LLMs to perform different specific roles, i.e. code generation actors, safety-driven critics, and helpfulness-driven critics. Depending on how well LLMs are able to understand the requirements of these roles, we would need to carefully create well-written prompts with specific instructions for the models to follow. In our framework, we would need to describe the requirements of helpfulness and safety that the critics would need to follow and check against code generation outputs. While we try to cover as many as possible different safety and helpfulness criteria, these attributes are not trivial to be defined in the code domain. Hence, given a code generation output, our critics might not always be able to detect the right safety or helpfulness concerns.

Parts of our approach can be used to remediate the above issue. Our tool enhancement strategy can equip the critics with necessary knowledge which can steer the critics towards more grounded and potentially correct recommendations. When a critic cannot detect the right issues initially, it can still improve its critiques after several rounds of interactions and tool use. Subsequently, if the extracted knowledge from external tools is relevant, the critic might be able to correctly revise and improve its final critique before passing it to the actor LLM.

Another limitation of our approach is the computation cost. Compared to the direct generation approach, our framework incurs higher computation costs, activating more than one LLM and requiring access to external tools. However, we consider our approach still more affordable than relevant finetuning methods. These methods often require (1) high computation to sufficiently finetune LLMs to balance between safety and helpfulness alignment; and (2) significant annotation effort to collect quality code data by these attributes.

Appendix B Ethical Statement

We want to highlight that our work is specifically developed to address the safety and security concerns of AI code generation models. Any adaptation or application of our work should be used for this purpose, ultimately to create stronger yet more responsible AI systems. Moreover, as our method adopts a framework for autonomous agent systems between two independent LLMs, during any adaptation or application, it is important to control and monitor how much autonomy such systems can possess. It is good practice to limit how these agents could perform actions like web search (for example, by number of queries) and code interpreter (for example, using a sandbox execution environment, isolated from the local system). Any “thoughts” or “actions” and their outcomes from these agents have to be carefully checked to make sure they do not lead to unethical consequences.

Secondly, as our work aims to address both safety and helpfulness aspects of code generation, defining and quantifying such qualities is not trivial. Within the scope of this paper, we tried to conform as much as possible to the definitions commonly used in prior related work in the code domain or the AI safety domain. In practice, there are many ethical concerns that should be considered to define these qualities, especially on the safety of code generation. For instance, in this work, we did not consider the conventional safety concerns like social biases and offensive content in code. However, these safety concerns could still be observed in many real-life practical scenarios (e.g. in generated code comments or variable names). More study is needed to address and measure safety in such scenarios.

Appendix C Broader Impacts

C.1 Societal Impacts

Since we aim to address the safety and helpfulness in code generation, our work can have significantly positive societal impacts. Since coding applications by LLMs are getting more and more popular, the consequences of generating harmful or insecure code can be very serious, especially in high-risk application domains like military, medicine, and banking systems. Our work can be deployed as an extra layer of mitigation, reducing the probability of potential harm while not compromising the helpfulness of AI systems. As we demonstrated in our results, our framework can also benefit open-ended generation tasks beyond the code domain.

On the other hand, our framework can also be misused, assisting human users with harmful intentions to create more sophisticated attacks against LLMs. Our proposed critic models could be engineered with reverse goals, e.g. recommending ways to make the output codes more insecure or less helpful. Since these critic models are positioned in an autonomous system with freedom to interact and collaborate with each other, the resulting critiques can negatively affect the “actor” LLMs towards generating more insecure or useless code outputs.

C.2 Safeguards

There are several safeguard strategies we can adopt to mitigate the above negative societal impacts. First, we can limit how much autonomy our critics can have e.g. by the types of queries they can generate and by the types of external tools they can have access to. In tools like web search, we can include a simple filter to exclude any illegal or unauthorized websites or content that might negatively impact the critics. Another safeguard strategy is to adopt more powerful external tools like code static analyzers or AI evaluators to provide more useful feedback to the critic models. While we did not use them in our experiments to fairly evaluate our approach against baselines, in practice, these tools should be used as safeguards for any practical application of INDICT.

Appendix D Details of Experimental Setups

Generation budget.

Technically, we can integrate INDICT on top of any LLMs for any number of application rounds (i.e. outer action loops), each of which can contain any number of dialogue interactions between the safety and helpfulness critics (i.e. inner critic loops). Due to the limitation of computation resources, we have to trade-off between the number of outer action loops and the number of inner critic loops. In our experiments, we fixed the number of outer action loops to 5 rounds and the inner critic loops to 1 interaction per action loop. We also maintained a persistent interaction context throughout all outer action loops so that the critics could always refer to previously generated critiques. With the above generation budget, our strategy can offer more diverse and richer input samples to the critics over time, while controlling the compute cost at an affordable level.

Tools.

In this work, we used 4 different types of external tools for the critics to query relevant knowledge for their arguments. For Wikipedia and code interpreter, we adopted the Langchain library [lan, 2024] with built-in functions to call these tools given the input text queries or code snippets. For web search, we employed the Search Engine Parser library [sea, 2024] to query and scrape search engine pages for different snippets such as titles and descriptions. Depending on the access constraints from commercial search engines, we mainly employ Yahoo Search as our primary search engine. Finally, to use OpenAI as an external tool, we query GPT3.5 using our paid API access ⁴⁴4gpt-3.5-turbo on https://platform.openai.com/docs/models/overview. All the above tools are appropriately licensed to be used for academic purposes.

Note that while we are treating OpenAI as an external tool in INDICT, we try to minimize contamination of test data by GPT models [Achiam et al., 2023]. Specifically, we do not directly pass the original task instructions $X$ to OpenAI public API but only use critic-generated text or code snippets as queries (see 7 and 8). Also note that during the preemptive feedback stage, we assume no access to the execution environments / code interpreters and only employ CodeSearch as the applicable critic actions. During the posthoc feedback stage, we enable access to the code interpreters, and hence, the critics can select and perform CodeReview (with execution results as parts of the queries) to extract relevant external knowledge.

Benchmarks.

We evaluated INDICT on different downstream applications, including 3 major types of tasks: insecure coding practice, security attacks, and open-ended generation. Please refer to Table 4 for a summary of tasks and benchmarks used in this work. All the benchmarks considered are licensed with permission to be used for academic purposes.

Insecure coding practice tasks [Bhatt et al., 2023, CyberNative, 2024] refer to standard code generation tasks where a model receives an input containing a coding problem description, optionally with an input code context. The model is then required to generate output code to solve the input coding problem and/or finish the given code context. The test samples in this task were curated by the potential security and vulnerability concerns commonly seen in code e.g. Common Weakness Enumeration (CWE) ⁵⁵5https://cwe.mitre.org/about/index.html.

We also conducted experiments on security attack tasks [Bhatt et al., 2024]. In these tasks, input instructions are designed to directly or indirectly elicit harmful generation from LLMs. For instance, one example task is to request LLMs to generate code to simulate a DDoS attack in a Linux environment. More indirectly, this request could be injected into a very long list of complex requirements in the prompt. The model is required to detect such harmful intentions in the instructions and generate appropriate responses (e.g. ones not complying with the given request).

The last type of downstream task we used in this work is open-ended generation tasks beyond the code domain. These tasks include both standard generation tasks [Li et al., 2024] as well as adversarial generation tasks [Mazeika et al., 2024]. In the latter, recent work has focused on prompt engineering methods to optimize the instructions and ultimately, elicit harmful behaviors from LLMs. We tested against several recent prompt optimization methods curated by Mazeika et al. [2024], covering diverse domains like social engineering, harassment, bio-weapons, etc.

Note that for CVS and CAMEL benchmarks, since they do not have an official test split, we randomly sampled a subset from the corresponding benchmarks such that the sampled data has a similar data distribution as the original dataset e.g. by programming languages. For HarmBench, from the dataset of $320$ raw task instructions (“Direct” split), we augmented the data by using CommandR [Cohere, 2024] as the attacker and applying the following red-teaming optimization methods: zero-shot (“ZS”) [Perez et al., 2022], PAP [Zeng et al., 2024], JailBreak (“JB”) [Shen et al., 2023], TAP [Mehrotra et al., 2023], and PAIR [Chao et al., 2023]. This results in 5 more test splits, each containing 320 augmented prompts.

Table 4: Summary of evaluation tasks and corresponding benchmarks: CyberSecEval-1 [Bhatt et al., 2023], CyberSecEval-2 [Bhatt et al., 2024], CVS [CyberNative, 2024], CAMEL [Li et al., 2024], and HarmBench [Mazeika et al., 2024]

Type of tasks	Benchmark	Task Split	# samples
Insecure Coding Practice	CyberSecEval-1	Autocomplete	1,916
	CyberSecEval-1	Instruction	1,916
	CVS	-	500
Security Attacks	CyberSecEval-2	Cyber Attack	1,000
	CyberSecEval-2	Interpreter Abuse	500
	CyberSecEval-2	Prompt Injection	251
Open-ended Generation	CAMEL	AI Society	100
Open-ended Generation	HarmBench	-	320

Evaluation.

To evaluate safety and helpfulness performance, we followed similar evaluation tools used in the corresponding benchmark papers and related work [Bhatt et al., 2023, 2024, Li et al., 2024, Mazeika et al., 2024, Zheng et al., 2024, Bai et al., 2022]. These papers showed that evaluation tools like security-based code analyzers and AI detectors can achieve decent levels of accuracy, correlating with human evaluation results on subsampled datasets. In addition, to minimize potential biases in AI evaluators, we anonymized all model names and randomly positioned the model responses to be evaluated in the evaluation prompts. In code generation tasks with expected output code, we also extracted only the code snippets and excluded any text segments in the model outputs to prevent biases from long-context outputs or from simply concatenating text. Also note that we follow Mazeika et al. [2024]’s evaluation principle by not including access to evaluators (e.g. static analyzers, AI classifiers) in our proposed framework. In practice, it is possible to use these as additional tools for more insightful feedback to the critics.

Base Language Models.

All the models used in the work, including CommandR [Cohere, 2024], LLama-2 [Touvron et al., 2023b], Codellama [Rozière et al., 2023], and Llama-3 [Meta, 2024], are open-sourced LLMs. We accessed these models through HuggingFace, which includes model licenses with permission to be used for academic purposes. We describe the HuggingFace model IDs and their corresponding licenses below:

•

CohereForAI/c4ai-command-r-v01 for CommandR, licensed under cc-by-nc-4.0
•

meta-llama/Llama-2-[x]b-chat-hf for Llama2 of x-B parameters, licenced under llama2
•

codellama/CodeLlama-[x]b-Instruct-hf for Codellama of x-B parameters, licenced under llama2
•

meta-llama/Meta-Llama-3-[x]B-Instruct for Llama3 of x-B parameters, licenced under llama3

For the Lllama and Codellama families, we fully agreed and complied with the license conditions enforced by Meta before accessing the models.

Baselines.

In this work, we mainly compared with prior baselines that were reported in the corresponding benchmarks. More recently, He and Vechev [2023], He et al. [2024] introduced finetuning approaches to finetune LLMs towards safer code generation. Almost concurrently to this work, Weyssow et al. [2024] also introduced a preference dataset of complex instructions to finetune LLMs to coding preferences. However, these approaches were not adapted and tested against the evaluation tasks and benchmarks we used in this work. Due to the limited computation cost (and also partly due to unreleased model checkpoints from He et al. [2024] at the time of submission), we were not able to evaluate the above models and compare with INDICT. We will attempt to replicate these methods and compare with our work in the future.

Compute Resources.

We conducted all experiments in this paper with our CPU and GPU resources provided through the Google Cloud Platform. Depending on the sizes of the base LLMs, we adopted GPU clusters of 2 to 8 GPUs of Nvidia A100 40GB type and assigned a CPU memory of up to 600GB. For some very large models such as Llama3-70B, we observed in some cases that the above hardware resulted in out-of-memory problems. For such cases, we recommend running the experiments with larger CPU memory allocation i.e. more than 600GB, or larger GPU clusters.

Appendix E Details of Experimental Results

We reported the full experimental results in this section. For results of insecure coding practice tasks, please refer to Table 5, 6, 7 for the CyberSecEval-1 benchmark, and 8 and 9 for the CVS benchmark. For results of security attack tasks, please refer to Table 10, 11, and 12 for Cyber Attack, Interpreter Abuse, and Prompt Injection tasks respectively.

Table 5: Test results of CyberSecEval-1 - Insecure Coding Practice (Autocomplete): we reported the % output codes that are considered secure (determined by a rule-based detector). Using INDICT, CommandR can achieve very comparable performance to the prior SoTA, i.e. Llama2-7b-chat. In programming languages C#, Java, and Python, CommandR+INDICT achieves the best safety performance. The results of the baseline models are from Bhatt et al. [2023].

Model	C	C#	C++	Java	JavaScript	PHP	Python	Rust	Avg.
GPT-3.5-turbo	66.5	83.0	79.2	63.3	77.5	77.2	59.0	63.2	70.5
GPT-4	61.2	70.6	75.3	59.4	65.5	71.0	49.9	62.8	63.5
Codellama-13b-instruct	70.0	83.4	79.5	70.7	81.5	75.9	70.7	76.0	75.8
Codellama-34b-instruct	65.6	81.3	78.4	69.0	77.5	76.5	66.1	70.6	72.8
Llama2-7b-chat	85.9	93.2	93.1	88.7	93.6	88.9	76.4	90.7	88.1
Llama2-13b-chat	77.5	90.6	84.2	76.4	91.6	81.5	72.9	85.8	82.1
Llama2-30b-chat	71.8	84.3	84.6	68.1	84.7	82.1	74.1	86.8	79.2
Llama2-70b-chat	67.0	75.3	87.3	71.6	85.9	80.9	67.2	78.4	76.2
CommandR	70.0	92.6	82.9	77.6	87.8	82.8	64.9	71.9	78.4
CommandR+INDICT	83.7	96.7	90.3	89.7	91.9	88.8	81.9	75.4	87.2

Table 6: Test results of CyberSecEval-1 - Insecure Coding Practice (Instruction): we reported the % output codes that are considered secure (determined by a rule-based detector). Using INDICT, CommandR can achieve new SoTA safety measures, with significant improvements in many programming languages. The results of the baseline models are from Bhatt et al. [2023].

Model	C	C#	C++	Java	JavaScript	PHP	Python	Rust	Avg.
GPT-3.5-turbo	53.3	69.8	71.0	46.7	59.0	62.4	61.3	64.7	61.1
GPT-4	52.0	70.2	70.3	47.6	53.0	60.5	62.7	60.3	59.9
Codellama-13b-instruct	60.8	68.9	71.8	54.6	60.2	66.1	67.2	68.6	64.9
Codellama-34b-instruct	57.7	54.5	73.8	51.5	61.0	64.8	66.1	69.1	62.5
Llama2-7b-chat	63.4	70.6	77.2	60.7	69.5	70.4	69.2	78.9	69.9
Llama2-13b-chat	64.3	71.5	75.7	57.2	71.5	64.8	68.4	76.5	68.9
Llama2-30b-chat	56.8	62.6	71.8	52.0	65.9	61.1	65.0	77.5	64.2
Llama2-70b-chat	61.2	63.8	73.4	50.7	65.1	60.5	65.5	72.6	64.4
CommandR	58.6	80.4	70.6	58.1	63.1	78.2	71.8	64.2	68.2
CommandR+INDICT	72.1	84.6	81.4	86.3	75.1	86.1	87.6	71.6	81.0

Table 7: Test results of CyberSecEval-1 - Insecure Coding Practice (Autocomplete and Instruction): we reported the % output codes that are considered more helpful than the prior SoTA model i.e. Llama-7b-chat. Using INDICT, we found significant improvements by helpfulness measure on both Autocomplete and Instruct splits. On the Autocomplete split, CommandR+INDICT are found to be more helpful and even better than the GPT models.

Instruct
Model	C	C#	C++	Java	JavaScript	PHP	Python	Rust	Avg.
GPT3.5	54.2	58.7	66.4	57.2	59.1	57.2	70.5	69.1	61.6
GPT4	70.0	65.5	73.4	65.5	68.7	66.0	78.1	77.5	70.6
\hdashlineCommandR	51.4	48.5	48.9	46.6	54.3	49.3	54.6	54.7	51.0
CommandR+INDICT	57.5	55.4	55.3	41.9	55.7	58.6	62.0	55.8	55.3
Autocomplete
GPT3.5	44.9	41.3	44.4	56.3	36.7	37.9	40.7	44.1	43.3
GPT4	57.3	65.5	60.6	63.8	52.4	55.9	59.8	64.2	60.0
\hdashlineCommandR	54.7	54.0	49.5	52.8	47.3	45.1	45.1	50.6	49.7
CommandR+INDICT	68.2	67.2	64.2	66.3	66.3	70.4	69.2	65.5	67.2

Table 8: Test results of CVS: we reported the % output codes that are considered secure (determined by a rule-based detector). We applied INDICT with 3 base LLMs: CommandR, Llama3-8b-instruct and Llama3-70b-instruct. We observed that with INDICT, all 3 models are consistently improved by safety measure, even better than the given ground-truth secure code solutions.

Model	C++	C#	Java	Javascript	PHP	Avg.
GT Secure Code	83.0	93.0	86.0	90.0	92.0	88.8
GT Unsecure Code	35.0	63.0	84.0	88.0	93.0	72.6
CommandR	31.0	90.0	87.0	85.0	93.0	77.2
CommandR+INDICT	75.0	100.0	90.0	96.0	91.0	90.4
Llama3-8b-instruct	43.0	84.0	91.0	93.0	98.0	81.8
Llama3-8b-instruct+INDICT	91.0	95.0	90.0	97.0	90.0	92.6
Llama3-70b-instruct	63.0	83.0	97.0	88.0	95.0	85.2
Llama3-70b-instruct+INDICT	98.0	94.0	90.0	98.0	94.0	94.8

Table 9: Test results of CVS: we reported the % output codes that are considered more helpful than the corresponding ground-truth secure code solutions. While all 3 base language models are found to be slightly less helpful or comparable to the ground-truth outputs, when integrated with INDICT, we noted consistent performance gains. We obtained the best performance with Llama3-70b-instruct+INDICT, with more than

65\%

of outputs are more helpful than the corresponding ground-truth code solutions.

Approach	C++	C#	Java	Javascript	PHP	Avg.
GT Secure Code	50.0	50.0	50.0	50.0	50.0	50.0
GT UnSecure Code	35.0	33.0	40.0	52.0	44.0	40.8
CommandR	38.0	38.0	42.0	51.5	43.0	42.5
CommandR+INDICT	28.0	61.0	63.0	59.6	74.0	57.1
Llama3-8b-instruct	50.0	50.0	49.0	49.0	39.0	47.4
Llama3-8b-instruct+INDICT	62.0	58.0	60.0	64.0	68.0	62.4
Llama3-70b-instruct	52.0	55.0	53.0	53.0	42.0	51.0
Llama3-70b-instruct+INDICT	66.0	57.0	67.0	69.0	70.0	65.8

Table 10: Test results of CyberSecEval-1 - Cyber Attack tasks: we reported the % model outputs that are considered benign. Using INDICT, we found that Llama3-8b-instruct can achieves new SoTA performance with more than

76\%

of outputs are benign, i.e. not complying with malicious task prompts. In this table, we also included the results of the top 5 most challenging types of attack tactics (categorized by the industry standard MITRE ATT&CK). The results of the baseline models are from Bhatt et al. [2023].

Model	C2	Collection	Discovery	Evasion	Lateral Movement	Avg.
GPT-3.5-turbo	36	41	26	53	59	46.2
GPT-4	44	63	30	79	76	59.9
Codellama-13b-instruct	28	41	20	50	49	40.9
Codellama-34b-instruct	27	37	22	48	46	37.5
Llama2-7b-chat	52	57	35	79	64	61.4
Llama2-13b-chat	40	55	38	71	58	55.8
Llama2-30b-chat	24	24	21	35	30	27.5
Llama2-70b-chat	54	69	48	86	71	69.0
CommandR	17	30	37	12	11	19.8
CommandR+INDICT	66	73	73	72	70	72.8
Llama3-8b-instruct	59	61	45	69	64	62.4
Llama3-8b-instruct+INDICT	72	66	76	74	73	76.7

Table 11: Test results of CyberSecEval-2 - Interpreter Abuse tasks: we reported the % model outputs that are considered benign. On both base language models CommandR and Codellama-13b-instruct, we found consistent performance improvement from INDICT, with more than

80\%

and

90\%

of outputs respectively are benign. In this table, we also included the results by different types of attacks: Container Escape, Privilege Escalation, Post Exploitation, Reflected Attack, and Social Engineering. The results of the baseline models are from Bhatt et al. [2024].

Model

Cont.

Escape

Privil.

Escalt.

Post

Exploit.

Reflected

Attack

Social

Engineer.

Avg.

Mistral-small

53.2

Mistral-medium

61.6

Mistral-large

62.6

Llama3-8b-instruct

71.4

Llama3-70b-instruct

61.0

GPt3.5-turbo

77.8

GPT4

80.8

Codellama-13b-instruct

86.6

Codellama-34b-instruct

83.6

Codellama-70b-instruct

76.4

CommandR

24.2

CommandR+INDICT

84.4

Codellama-13b-instruct

86.2

Codellama-13b-instruct+INDICT

92.0

Table 12: Test results of CyberSecEval-2 - Prompt Injection tasks: we reported the % model outputs that are considered benign. Llama3-70b-instruct+INDICT can achieve the SoTA performance, with more than

92\%

of outputs are classified as benign. In this table, we also included the ablation results of the top 5 most challenging types of prompt injection attacks: Many Short Attack, Ignore Previous Instruction, Indirect Reference, Persuasion, and Hypothetical Scenario. The results of the baseline models are from Bhatt et al. [2024].

Model

Many short

attack

Ignore prev.

instructions

Indirect

reference

Persuasion

Hypo.

scenario

Avg.

Llama3-8b-instruct

86.0

46.0

64.0

54.0

77.0

51.9

Llama3-70b-instruct

71.0

62.0

64.0

73.0

77.0

68.0

GPT-3.5-turbo

43.0

67.0

64.0

81.0

69.0

61.8

GPT-4

71.0

79.0

57.0

85.0

77.0

78.2

Codellama-13b-instruct

43.0

71.0

79.0

73.0

85.0

63.1

Codellama-34-instruct

29.0

54.0

64.0

69.0

85.0

64.2

Codellama-70b-instruct

100.0

67.0

79.0

65.0

92.0

82.8

CommandR

42.9

45.8

57.1

76.9

61.5

57.4

CommandR+INDICT

42.9

75.0

85.7

73.1

92.3

74.9

Llama3-70b-instruct

71.4

58.3

64.3

80.8

76.9

72.1

Llama3-70b-instruct+INDICT

71.4

87.5

92.9

80.8

92.3

92.8

Appendix F Instruction Prompts

We described the example instruction prompts we used in this section. For each prompt template, depending on the model roles and tasks, we replace the following placeholders with applicable input components: {question} and {answer} are replaced with the corresponding task description and latest model output from the actor LLM. During the posthoc feedback stage, {answer} is also concatenated with any execution results (e.g. test outcomes, error messages) after executing the corresponding extracted code output with a code interpreter. {scratchpad} is typically used as a placeholder to contain past interactions between the two critics.

Listing 1: Prompt with instruction for a critic model to focus on safety-related concerns in code generation tasks.

⬇

You are an advanced intelligent agent with a strong focus for system security and risk management. You will be given a previous potential solution in which you were given a task to solve. In a few sentences, diagnose whether the solution is completely safe in every execution step and will not pose any potential harmful effect to the world, including any damage to cybersecurity or physical world.

You are also required to debate and discuss with another critic who focuses on system performance and accuracy. Always advocate and focus on system security and risk management over system performance or correctness.

Limit your critic to a few sentences.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 2: Prompt with instruction for a critic model to focus on helpfulness-related concerns in code generation tasks.

⬇

You are an advanced intelligent agent with a strong focus for system performance and accuracy. You will be given a previous potential solution in which you were given a task to solve. In a few sentences, diagnose whether the solution is completely correct in every execution step and will satisfy all the requirements in the given task and pass any corner test cases.

You are also required to debate and discuss with another critic who focuses on system security and risk management. Always advocate and focus on system performance and accuracy over system security or risk management.

Limit your critic to a few sentences.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 3: Prompt with instruction for a summarizer model to summarize the interactions between safety and helpfulness critics for code generation tasks.

⬇

You are an advanced intelligent agent with equal focuses to both system performance and security. You will be given a discussion about a system solution between two critics against system safety and correctness. In a few sentences, summarize the discussion and determine whether the solution is completely correct and secure in every execution step. The perfect solution should satisfy all the requirements in the given task while not violating any security standards or causing any harmful effect to the society or cybersecurity.

There might be some supporting facts in the discussion between the critics. Incorporate as much as possible those details into your response to make your analysis informative and convincing.

{scratchpad}

Listing 4: Prompt with instruction for critic models to generate relevant parameters (text queries and/or code snippets) to perform CodeSearch actions.

⬇

You are given a task and an example solution and relevant analysis against the solution’s security or functional correctness.

Read the task, solution, and analysis and find ways to improve the analysis with relevant evidence and supporting fact. You may also improve the analysis with missing information relevant to the task or solution.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 5: Prompt with instruction for critic models to generate relevant parameters (text queries and/or code snippets) to perform CodeReview actions.

⬇

You are given a task and an example solution and relevant analysis against the solution’s security or functional correctness. Read the task, solution, and analysis and find ways to improve the analysis with relevant evidence and supporting fact.

You also have access to a code interpreter that can execute many code snippets. Based on the solution and analysis, you can create many code snippets and unit test cases to evaluate them and support the arguments in the analysis.

These code snippets should be indepedent (does not refer to external operating systems, databases, repositories, or custom libraries) and limited to few lines of codes only. Use ‘print‘ or ‘assert‘ statements in the code snippets if needed.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 6: Prompt with instruction for a critic model to focus on safety-related concerns in open-ended generation tasks.

⬇

You are an advanced intelligent agent with a strong focus for security and risk management. You will be given a previous potential solution in which you were given a task to solve. In a few sentences, diagnose whether the solution is completely safe in every execution step and will not pose any potential harmful effect to the world. If the solution includes any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content, describe and explain your relevant arguments.

You are also required to debate and discuss with another critic who focuses on system performance and accuracy. Always advocate and focus on security and risk management over system performance or correctness.

Limit your critic to a few sentences.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 7: Prompt with instruction for a critic model to focus on helpfulness-related concerns in open-ended generation tasks.

⬇

You are an advanced intelligent agent with a strong focus for performance and accuracy. You will be given a previous potential solution in which you were given a task to solve. In a few sentences, diagnose whether the solution is completely correct in every execution step and will satisfy all the requirements in the given task. If the solution includes any content that is not aligned with the requirements and intentions of the task, describe and explain your relevant arguments.

You are also required to debate and discuss with another critic who focuses on security and risk management. Always advocate and focus on performance and accuracy over security or risk management.

Limit your critic to a few sentences.

Task: {question}

Solution: {answer}

{scratchpad}

Listing 8: Prompt with instruction for a summarizer model to summarize the interactions between safety and helpfulness critics for open-ended generation tasks.

⬇

You are an advanced intelligent agent with equal focuses to both performance and security. You will be given a discussion about a solution between two critics against safety and correctness. In a few sentences, summarize the discussion and determine whether the solution is completely correct and safe. The perfect solution should satisfy all the requirements and align with all intentions in the given task as much as possible while also not containing any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

There might be some supporting facts in the discussion between the critics. Incorporate as much as possible those details into your response to make your analysis informative and convincing.

{scratchpad}