Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?):
A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah Boston University
saadu@bu.edu
   Mingji Han Boston University
mjhan@bu.edu
   Saurabh Pujar IBM Research
saurabh.pujar@
ibm.com
   Hammond Pearce UNSW Sydney
hammond.pearce@
unsw.edu.au
   Ayse Coskun Boston University
acoskun@bu.edu
   Gianluca Stringhini Boston University
gian@bu.edu
Abstract

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like ‘PaLM2’ and ‘GPT-4’: by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26%percent2626\%26 % and 17%percent1717\%17 % of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

1 Introduction

Large Language Models (LLMs), such as OpenAI’s Codex [1], Google’s PaLM2 [2], Meta’s Codellama [3], and StarCoder [4], etc., have demonstrated great potential in performing programming-language related tasks such as code generation, code documentation , and debugging. In 2022, around 1.2 million developers used Copilot, and since then we have witnessed the release of increasingly capable LLM models at a quick pace [2, 5, 4, 3]. LLMs could be particularly useful to help developers with their cybersecurity needs, as humans typically produce and miss many security relevant bugs. This issue was highlighted in the 2022 GitLab Survey [6], noting that “developers do not find enough bugs early enough” and “do not prioritize the bug remediation” when developing. It is pertinent then to investigate if LLMs could be an aid towards early identification of security problems, especially as LLMs have already been suggested for use in automated bug repair [7].

In this paper, we aim to answer the following question: Can LLMs be used as helpful security assistants for vulnerability detection? This is an important question, especially as LLMs are not infallible in security-related tasks, for example introducing vulnerabilities into source code [8, 9] and software testing [10]. Unfortunately, there is no standardized and automated approach to evaluate the performance of LLMs at identifying vulnerable code. We fill this gap by introducing SecLLMHolmes, a generalized, fully automated, and scalable framework to systematically evaluate the performance (i.e., accuracy and reasoning capabilities) of LLMs for vulnerability detection. Our framework tests the capabilities of a given LLM as a security assistant across eight distinct dimensions: (1) deterministic response, (2) performance over range of parameters, (3) diversity of prompts, (4) faithful reasoning, (5) evaluation over variety of vulnerabilities, (6) assessment of various code difficulty levels, (7) robustness to code augmentations, and (8) use in real-world projects.

We apply our framework to eight of the most capable LLMs across 228 code scenarios spanning over 8 most critical vulnerabilities in C and Python, and show that: (a) LLM performance varies widely depending on the model and the prompting technique used, however all models analyzed have a high false positive rate (FPR), and flag code where vulnerabilities have been patched as still vulnerable. (b) the output of LLMs is non-deterministic, with all models changing their answers over multiple runs for one or more of our tests. (c) even when they correctly identify a vulnerability, the reasoning that LLMs provide for this decision is often incorrect, questioning their trustworthiness. (d) LLM chain-of-thought reasoning [11] is not robust, and can be ‘confused’ by even simple code augmentations such as whitespace modification, changing function names, or using different but related library functions. Also, (e) LLMs fail at detecting vulnerabilities in real-world projects. Our study provides significant evidence that LLMs are not yet ready to be used for automated vulnerability detection, and the successful usage of our framework as a benchmark suite by future models would demonstrate meaningful progress in this space.

This paper makes the following contributions:

  • We develop SecLLMHolmes, a comprehensive framework to test LLMs for their ability to identify and reason about software vulnerabilities. Our framework is fully automated and includes a set of 228 code scenarios, and 17 prompting techniques. We publicly release our framework and dataset111https://github.com/ai4cloudops/SecLLMHolmes, allowing the community to test newly developed LLMs and easily keep track of their progress in being able to identify vulnerabilities.

  • We use our framework to test eight state-of-the-art LLMs for the task of vulnerability detection, showing that as of today no LLM achieves satisfacory performance at it.

  • We identify and enumerate a set of shortcomings that current LLMs show (as outlined above). Our observations provide a checklist for researchers working in this space, showing aspects that need to be addressed before LLMs can be considered ready to be used in the wild for the task of vulnerability detection.

2 Background and Related Work

Large Language Models (LLMs). All language models work on the basic principle of next word (token) prediction; i.e., given a sequence of words (tokens) x1,x2,,xn1subscript𝑥1subscript𝑥2subscript𝑥𝑛1x_{1},x_{2},...,x_{n-1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT select a word (token) xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with the highest probability to appear next in the sequence

xn=argmaxwVP(w|x1,x2,,xn1),subscript𝑥𝑛subscript𝑤𝑉𝑃conditional𝑤subscript𝑥1subscript𝑥2subscript𝑥𝑛1x_{n}=\arg\max_{w\in V}P(w|x_{1},x_{2},\dots,x_{n-1}),italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_w ∈ italic_V end_POSTSUBSCRIPT italic_P ( italic_w | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,

where V𝑉Vitalic_V is the vocabulary of the model. Language models learn to perform this task by training on a large amount of text data (i.e., natural language text or code) and use various techniques (e.g., attention mechanism [12]) to learn to focus on certain parts of the input for better output prediction. Language models have shown excellent proficiency in NLP tasks, as well as good results for programming language tasks such as code generation, code suggestion, natural language querying for code, etc.

Refer to caption
Figure 1: LLM chat input format. LLMs operate on a three-part input format: (1) a system prompt, (2) few-shot examples presented as chat history to guide the model’s learning, and (3) the specific user input/task to be processed.

The recent drastic increase in the number of parameters of models has enabled several remarkable capabilities, the most prominent of which being zero-shot and few-shot learning [13, 14]. LLMs are typically prompted (i.e., queried) by the user and provide a response—these advances enable the prompt to provide new knowledge or instructions that the model was not trained over.

Approaches like instruction-tuning ‘teach’ LLMs how to follow instructions in their prompt responses, and reinforcement learning from human feedback is used to ‘teach’ them how to answer, converse, and reason like humans. This has led to the creation of several chat-based LLMs, which can interact conversationally with human inputs (see Figure 1). The chat-based LLMs can be prompted using various techniques:

  • In Zero-Shot (ZS) scenario, the user asks a model to perform a task that the model might not have observed during pre-training.

  • In Few-Shot (FS) scenario, the user adds a few examples demonstrating input space and expected output to perform a specific task (as shown in Figure 1).

  • Task-Oriented (TO) scenarios explicitly assign a task to the model (either in ‘system’ or ‘user’ prompt) in the form of a statement or a question, which encourages the model to generate a task-specific response.

  • Role-Oriented (RO) scenarios assign a role to the model, e.g., helpful assistant, security expert, etc., and the model implicitly understands the expected behavior. This role is mostly assigned in the ‘system’ prompt.

Table I shows the details of the currently most capable chat-based LLMs that we investigate in this paper.

TABLE I: Studied LLMs. We select a number of capable chat-based LLMs, both Remote and Local, for our study, with diverse ranges of number of parameters, max. input tokens limits, and different training knowledge cut-offs. Oct. 7, 2023 is the date of access for all LLMs.
Model API Base Model # Params Max. Tokens Type Knowledge Cut Off
gpt-4 GPT-4 1.76T 8,192 Rem 09/2021
gpt-3.5-turbo-16k GPT-3.5 175B 16,385 Rem 09/2021
codechat-bison@001 PaLM2 340B 6,144 Rem mid-2021
chat-bison@001 PaLM2 340B 6,144 Rem mid-2021
codellama-7b-instruct Llama2 7B 100k Loc 01/2022
codellama-13b-instruct Llama2 13B 100k Loc 01/2022
codellama-34b-instruct Llama2 34B 100k Loc 01/2022
starchat-beta StarCoder+ 15.5B 8,192 Loc 09/2022

Vulnerability Detection. One of the biggest challenges during the development and maintenance of software systems is the process of detecting security bugs and identifying their root cause [6]. OWASP [15] presents a list of static and run-time analysis tools and techniques that work in this space. Most of these tools transform source code into a specific representation, e.g., abstract syntax tree, program dependency graph, code property graph, etc., and scan them to identify pre-defined insecure patterns. For instance, Yamaguchi et al. introduce Joern, a tool that uses code property graphs [16] to identify several vulnerability patterns in C/C++ code.

Unfortunately, these techniques require considerable manual effort and curation of security bug datasets, especially if they are based on trained supervised Machine Learning (ML) models. Examples of methods that use supervised ML include VulChecker [17], VulDeePecker [18], Poster [19], and SySeVR [20]. As the ratio of vulnerable to non-vulnerable examples is very low in the real world, the vulnerable examples in these datasets are not sufficient for ML models to learn the necessary information. Suggested improvements in this category include using pre-trained LLMs for code and ‘fine-tuning’ them on security bug datasets for the downstream task of vulnerability detection. Notable approaches here are UniXCoder [21], VulBERTa [22], and CoText [23].

Evaluating for Code Security. Evaluating approaches for identifying security weaknesses in code requires testing their performance on multiple axes. For instance, what are their accuracy and false positive rate (FPR)? Are reasons/root causes for a vulnerability provided by the tool, and if so, at what quality? Is the tool robust to noise in testing data? How many types of vulnerabilities can it detect? Static analysis tools have always struggled with the trade-off between accuracy and coverage. Tools either focus on high accuracy and low coverage (e.g., Pysa [24] by Meta, which identifies data-flow issues in Python applications), or (b) low accuracy (false positives) and high coverage (e.g., [25, 26, 27]), which can lead to developer frustration and wasted time.

ML-based static analysis tools [28, 18] not only face the accuracy-coverage trade-off but also struggle with robustness. Such tools have demonstrated good performance on research datasets but can fail to generalize and perform well over real-world datasets [29]. Risse et al. [30] identify the same issue of non-robustness in LLM-based vulnerability detection tools [23, 22, 31], and showcase the drop in accuracy with even trivial code augmentations. Moreover, Microsoft’s leader-board for LLM-based vulnerability detection tools [32] shows that the even the best tool has an accuracy of less than 70%percent7070\%70 % and 30%+limit-frompercent3030\%+30 % + FPR, which shows that these systems cannot be trusted in real-world cases.

TABLE II: Evaluation studies for code security. This table describes if the evaluation is ( semi or fully) Automated, Evaluates Reasoning (root cause) or only performs binary evaluation , tests Code-level Robustness, evaluates both or just one of the Vulnerable and Patched code scenarios, evaluates on Real-World CVEs or github mined potential defect commits [33] , and # of Code Scenarios included in the study, which are not generated by synthetic AI methods or labeled using out-dated research tools.
Study Auto Eval Reas. Code Robust Vuln- Patch Real World # Scen.
Vuln. Code Gen. [8] 89
Vuln. Code Repair [7] syn.
Limits of ML [30] res.
Transf. Vuln. Det. [34] res.
SecLLMHolmes 228

Recent works have evaluated LLMs for vulnerable code generation [8], code repair [7], and vulnerability detection [34]. However, these approaches are either limited by the number of LLMs, coverage of vulnerabilities, diversity of prompts, range of code complexities, robustness testing, or require manual labor for evaluating LLMs. Most importantly, these studies only evaluate binary responses, checking whether the LLM gives the right label to code snippets (e.g., ‘vulnerable’ or ‘not vulnerable’). In this paper, we present the first comprehensive evaluation framework to evaluate LLMs on the task of vulnerability detection, providing a multi-faceted analysis of the capabilities of LLMs, including going beyond binary decisions and evaluating their reasoning abilities. Table II summarizes evaluation studies of LLMs, highlights their short-comings, and compares against our framework.

In addition to presenting a new state-of-the-art framework, our study uncovers previously unidentified challenges associated with LLMs, including their inability to understand complex code data-flows, the fragility of their COT or step-by-step reasoning to even trivial augmentations, a bias towards security-related functions and variable names that overlooks actual vulnerabilities, a reasoning process that diverges from the methodology of real-world human security experts, particularly in accurately identifying correct root cause of a vulnerability.

3 SecLLMHolmes

Figure 2 presents an overview of our fully automated framework, SecLLMHolmes, which is designed to be applicable to any chat-based LLM. Section 3.1 describes how users can configure an LLM for integration. The core of our framework consists of five pre-defined key components: (i) a set of parameters (Section 3.2), (ii) an extensive set of prompt templates (Section 3.3), (iii) datasets (Section 3.4), and (iv) code augmentations (Section V), all of which facilitate the generation of test prompts. Each prompt is then passed to the LLM to generate a response, and the quality of the response is then analyzed by the (v) ‘Evaluator’ module (Section 3.6).

Refer to caption
Figure 2: SecLLMHolmes Architecture.

3.1 LLM configuration

To scale SecLLMHolmes for any chat-based LLM, we have established three configurable user inputs:

LLM-Specific Best ‘Prompting Practices’ and Rules. As each LLM may be tuned differently with respect to instruction following, this configurable input enables users to configure optimal prompting practices for the specific LLM they are integrating. This may be done following the LLM’s documentation. For example, OpenAI’s GPT documentation recommends the use of three quotes before and after the given content to separate it from the instruction [35], while Google’s PaLM2 documentation recommends the use of keywords before the content that describe its semantics such as ‘Code,’ ‘Text,’ ‘Question’ [36].

LLM ‘Initialization and Configuration.’ All LLMs require an initial setup; for example, remote models may require an API key (e.g. OpenAI) or to initialize a session with a specific project in the cloud (e.g. Google). Local models require loading the model, tokenizer checkpoints, etc. This input allows user to specify such configuration details.

LLM ‘Chat Structure and Inference Function.’ To generate a response, each chat-based LLM receives three inputs: ‘system,’ ‘few-shot examples,’ and ‘task’ prompts (as shown in Figure 1). In this function, users specify how these three inputs are passed to the configured LLM (via API for remote models or ‘text-generation’ pipeline for local models).

3.2 LLM Parameters

LLM responses are significantly impacted by two parameters which control token sampling: (1) temperature controls the determinism of an LLM’s output. A higher value of temperature generates more ‘creative’/‘random’ text by adding noise to potential token scores (2) top_p controls the nucleus sampling, where the LLM considers the results of the tokens with top_p probability mass. We discuss how we select the values of these parameters in Section 4.

3.3 Prompt Templates

TABLE III: Prompt templates.
ID Type Description
S1 ZS-TO Code snippet is added to the input prompt with a question about a specific Common Weakness Enumeration (CWE) (e.g., out-of-bound write, path traversal).
S2 ZS-RO Same as S1, but the LLM is assigned the role of a ‘helpful assistant’.
S3 ZS-RO Similar to S1, with the LLM acting as a ‘security expert’.
S4 ZS-RO The LLM is defined as a ‘security expert’ who analyzes a specified security vulnerability, without the question being added to the input prompt.
S5 FS-TO Similar to S1, but includes a vulnerable example, its patch, and standard reasoning from the same CWE.
S6 FS-RO Like S4, but also includes a vulnerable example, its patch, and standard reasoning from the same CWE.
R1 ZS-TO Similar to S1, but begins with ”Lets think step by step” [37] to encourage a methodical approach.
R2 ZS-RO The LLM plays the role of a security expert with a multi-step approach to vulnerability detection, following a chain-of-thought reasoning.
R3 ZS-TO A multi-round conversation with the LLM, starting with a code snippet and progressively analyzing sub-components for a security vulnerability like human security-experts.
R4 FS-RO Similar to S6, but the reasoning for answers involves step-by-step analysis developed by the first author.
R5 FS-RO Like R2, but includes few-shot examples (from the same CWE) with step-by-step reasoning for detecting vulnerabilities.
R6 FS-TO Similar to R5, but does not assign a specific role to the LLM in the system prompt.
D1 ZS-TO Adds the definition of a security vulnerability to the input prompt, followed by a related question.
D2 ZS-RO The LLM is a security expert analyzing code for a specific vulnerability, with the vulnerability’s definition included.
D3 FS-RO Similar to S6, but includes the definition of the security vulnerability in the system prompt.
D4 FS-RO Like R4, with the addition of the security vulnerability’s definition in the system prompt.
D5 FS-TO Similar to D4, but does not assign a specific role to the LLM in the system prompt.

SecLLMHolmes explores four techniques for LLM prompting: (1) zero-shot task-oriented (ZS - TO), (2) zero-shot role-oriented (ZS - RO), (3) few-shot task-oriented (FS - TO), and (4) few-shot role-oriented (FS - RO). Moreover, we divide our set of prompts in the following three categories (all prompts are described in Table III):

Standard (S) prompting asks the model to directly give an answer to the problem.

Step-by-Step Reasoning-based (R) prompting asks the model to solve the problem in a step-by-step manner using chain-of-thought (COT) reasoning [11, 38, 37]. In addition to evaluating the intrinsic step-by-step reasoning process of LLMs, we also create prompts that emulate the multi-step vulnerability detection method followed by human security-experts, as identified in prior qualitative studies [39, 40]. These studies observed that security experts follow a general multi-step vulnerability detection approach, i.e., (1) get an overview of the code, (2) based on the overview, identify the critical sub-components that can lead to a security vulnerability in code, e.g., copying user provided information into a buffer, etc., (3) perform a detailed analysis of these sub-components, e.g., if a user input is being copied into a buffer the security experts will check if in any scenario the user input can overflow the buffer, and then, (4) based on the detailed experiments, provide the final answer on whether the given code contains any instances of the given security vulnerability.

Definition-based (D) prompting provides additional information like the definition of a security vulnerability from MITRE’s official website [41] to the model, while asking the model to detect that vulnerability in the given code.

3.4 Datasets

We design 228 code scenarios (48 hand-crafted, 30 real-world, and 150 with code augmentations) to test various aspects of the capabilities of LLMs to detect software vulnerabilities in code. We use these scenarios to craft prompts by including code, examples, definitions, and step-by-step reasoning as shown in Table III. In the following, we describe how we built our code scenarios in detail.

Hand-Crafted CWE Scenarios. We curate a dataset of 48 hand-crafted code scenarios, containing vulnerable and patched pairs from 8 most critical and diverse Common Weakness Enumerations (CWEs) from the MITRE Top 25 Most Dangerous Software Weaknesses for the year 2023 [41], as shown in Table IV. To investigate the ability of LLMs to analyze multiple programming languages, we include examples from both C and Python.

TABLE IV: Hand-crafted dataset.
CWE ID Description MITRE Rank Lang.
787 Out-of-bounds Write 1 C
79 Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) 2 Py
89 Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’) 3 Py
416 Use After Free 4 C
22 Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’) 8 C
476 NULL Pointer Dereference 12 C
190 Integer Overflow or Wraparound 14 C
77 Improper Neutralization of Special Elements used in a Command (‘Command Injection’) 16 C

Similar to previous work [8], we create six code scenarios (three pairs of vulnerable and patched scenarios) for each CWE. Moreover, we design our code scenarios with three difficulty levels, (1) easy, (2) medium, (3) hard. Code scenario ‘2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT’ in a specific CWE represents the vulnerable scenario with ‘medium’ difficulty level and ‘2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT’ its patch. The difficulty levels assess how LLMs interact with code of increasing complexity. Easy scenarios consist of simple programs containing only one function and less than 30 lines of code. Medium level scenarios increase the complexity by making the program longer, using different library functions, and adding more than one user input. Hard level introduces scenarios with multiple functions in which functions can be safe on an individual level but when they work together they make the program vulnerable.222Appendix A shows examples of all difficulty levels for ‘CWE-22’.

Real-World CVE Senarios. We leverage a set of real-world Common Vulnerabilities and Exposures (CVEs) from public open source projects to investigate if LLMs are able to identify vulnerabilities in them. Note that existing benchmarks for vulnerability detection [33, 18, 42, 43], cannot be used for this project, as they were released before the cut-off training date of current LLMs, and it is therefore likely that the models saw that data during training. To avoid this potential confounder, we curate 30 code scenarios containing vulnerable and patched versions of 15 CVEs from four open source projects, all published and fixed in 2023, after current LLMs were trained (see Table V).

TABLE V: Real-world CVEs and their details including Original and Truncated Lines of Code (LoC).
Project CVE ID CWE Description (ID) Orig. LoC Trun. LoC Pub. Date (2023) Fix Date (2023)
gpac 2023-1452 Out-of-Bound Write (787) 4.5k 243 Mar May
2023-3012 NULL Pointer Deref (476) 2.5k 398 May Nov
2023-23143 Out-of-Bound Write (787) 12.3k 117 Jan May
2023-23144 Integer Overflow (190) 439 389 Jan May
libtiff 2023-2908 NULL Pointer Deref (476) 2.3k 629 Jan Nov
2023-3316 NULL Pointer Deref (476) 159 159 Jan Nov
2023-26966 Out-of-Bound Write (787) 1.8k 238 Jun Nov
2023-40745 Integer Overflow (190) 2.2k 757 Oct Nov
2023-41175 Integer Overflow (190) 779 748 Oct Nov
linux 2023-40283 Use-After-Free (416) 1.9k 515 Aug Nov
2023-42753 Integer Overflow (190) 628 623 Sept Nov
2023-42754 NULL Pointer Deref (476) 3.7k 177 Oct Nov
2023-45863 Out-of-Bound Write (787) 1.1k 565 Oct Nov
2023-45871 Out-of-Bound Write (787) 10.1k 386 Oct Nov
pjsip 2023-27585 Out-of-Bound Write (787) 784 737 Mar Aug

As the length of the code increases significantly for the real-world code scenarios, it can exceed the maximum number of tokens a model can take as an input. To solve this problem, we shorten code files by removing comments and functions that are neither called by nor call the vulnerable (or patched) function. Also to maintain fairness among the LLMs, we make sure that all CVEs after truncation have a number of tokens less than or equal to 6,144, which is the maximum token limit supported by Palm2, and the lowest among all LLMs in our study.

Code Augmentations. While standard frameworks exist to evaluate the robustness of LLMs for NLP tasks [44, 45], there is no standard framework to evaluate the robustness on code security related tasks. To fill this gap, we design a set of 150 augmented code scenarios, meticulously crafted and reviewed to preserve the ability of human security experts to identify vulnerabilities. These augmentations are organized in two distinct categories:

  1. 1.

    Trivial Augmentations. This class of code augmentations measure the robustness of LLMs to random noise. We select 12 CWE scenarios from two classes, i.e., CWE-787 (C) (#1 MITRE) and CWE-89 (Py) (#3 MITRE) of our hand-crafted dataset, and apply seven trivial augmentations (Table VI) on them and create a total of 84 different code scenarios (12 per augmentation). We choose these two CWEs as (1) they can lead to the most catastrophic impacts like root privilege escalation, and data loss, etc, (2) they belong to two completely different levels of abstractions i.e., “lower-level” (C) and “higher-level” (Python) languages, and (3) the presence of their instances can be determined directly from code without any need for external information.

  2. 2.

    Non-Trivial Augmentations. We design non-trivial code augmentations to perform stress-tests on LLMs to measure their robustness and bias towards semantics of function or variable names, specific library functions, or code security practices. We use combinations of all CWEs defined in Table IV and the six non-trivial code augmentations from Table VI and design 66 code scenarios (12 per NT1-NT4 and 9 per NT5 and NT6) 333See Appendix LABEL:sec:app-robust for details on creation of non-trivial augmented code scenarios.

TABLE VI: Code Augmentations.
Trivial Non-Trivial
ID Description ID Description
T1 Rename function parameters randomly NT1 Change variable names to vulnerability related keywords
T2 Rename function randomly NT2 Change the name of a safe function to ‘vulnerable’ function
T3 Add random unreachable code NT3 Change the name of an unsafe function to ‘non_vulnerable’ function
T4 Add random code in comments NT4 Add a potentially dangerous library function (e.g., ‘strcpy’ or ‘strcat’) but use it in a safe way
T5 Insert whitespaces NT5 Use sanitizing functions (e.g., ‘realpath’) in vulnerable code but in a way that it does not resolve the vulnerability
T6 Add a useless function NT6 Add hash-defined expressions for safe functions names (e.g., ‘fgets’) but add vulnerable library functions in its body (e.g., ‘gets’)
T7 Add next-line character

3.5 Ground-Truth Reasoning Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

In addition to ground truth labels indicating if a code snippet contains a vulnerability, we also need explanations for these vulnerabilities, as we aim to evaluate the reasoning capabilities of LLMs and assess whether they can justify their decisions. To this end, we randomly sample 48 code scenarios out of the 228 total scenarios, and have three security experts, including the first author of the paper equally divide the sampled code scenarios amongst each other and create a 100-word ground truth reason for each scenario using MITRE’s official CWE documentation [41] as a guide. The experts then compare and discuss each others’ reasoning and develop consensus for each ground-truth reasoning (Fleiss’ kappa with K=0.93𝐾0.93K=0.93italic_K = 0.93, meaning almost perfect agreement [46]). After establishing that the criteria for ground truth reasoning are well laid out and understood, the first author proceeds to develop the remaining 180 vulnerability explanations. This ground truth reasoning Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is then used by the ‘Evaluator’ module in the next step.

3.6 Evaluator

The output of an LLM for a specific test is passed to the Evaluator (see Figure 2). As SecLLMHolmes is fully automated, we leverage GPT-4 to analyze the response. First, the 1 response is passed to GPT-4 444Our manual evaluation shows that GPT-4 performs the best for this extraction process., with an additional role-based instruction prompt Pesubscript𝑃𝑒P_{e}italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (shown in Figure 7 in the Appendix) in the ‘system’ input to extract two pieces of information from the raw response. The first one is the 2 binary answer, which is “yes/no” based on whether the LLM found a vulnerability in the given code or not. We find that in some cases the LLM does not provide a definite answer, therefore we include a third verdict, i.e., “n/a.”555For example, in some cases LLMs provide some variation of the following response: “As an AI model I cannot answer this question.” The second part of the information extracted is the 3 textual reasoning provided by the LLM (Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT). To extract it, we ask GPT-4 to summarize the reason described by the model output on why a vulnerability is present in the code or not, in 100 words. We extract the summary of root cause of vulnerability from raw responses of LLMs to maintain consistency for further evaluation methods, and to avoid contents like suggestions, fixed code, etc. to be analyzed. In the rest of this section we provide more details on the evaluation metrics used to 4 evaluate the final answer, and 5 its reasoning provided by LLM for the vulnerability detection task.

Accuracy Score. To evaluate the accurate detection of a vulnerability in source code, we compare the answer extracted from the response of the LLM (i.e., “yes/no/n/a”) with the ground truth labels and use the “accuracy” metric, i.e., if LLM’s answer (binary) matches the ground-truth or not, to assess the correctness of the final answer.

Reasoning Score. Automatically evaluating if the textual reasoning provided by the LLM on whether a vulnerability is present is a challenging task. To solve this task we use GPT-4 to summarize the reason (Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) provided by an LLM and compare it with the ground-truth reasoning (Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) generated by the authors, using the following three metrics:

  1. 1.

    Rouge [47] score is traditionally used in NLP to measure the similarity between a machine-translated summary and reference summaries using overlapping n-grams. In our case, we use it to measure similarity between the summaries Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We first sample 50505050 pairs of Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and their corresponding Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, as our reasoning score validation set (Rvalsubscript𝑅𝑣𝑎𝑙R_{val}italic_R start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT), and manually check the consistency and alignment of their reasonings. We find that at the optimal threshold Rougethres𝑅𝑜𝑢𝑔subscript𝑒𝑡𝑟𝑒𝑠Rouge_{thres}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_t italic_h italic_r italic_e italic_s end_POSTSUBSCRIPT of 0.340.340.340.34, 43434343 out of 50505050 Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are consistent with Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We therefore use this threshold and mark two summaries as similar if their Rouge score exceeds it.

  2. 2.

    Cosine Similarity is a metric commonly used in NLP to measure how similar two documents are irrespective of their sizes. We first convert the summaries Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT into fixed length vectors using OpenAI’s embedding model ‘text-similarity-davinci-001’ and calculate the cosine similarity between them. Similar to Rougethres𝑅𝑜𝑢𝑔subscript𝑒𝑡𝑟𝑒𝑠Rouge_{thres}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_t italic_h italic_r italic_e italic_s end_POSTSUBSCRIPT, we find that the optimal threshold Costhres𝐶𝑜subscript𝑠𝑡𝑟𝑒𝑠Cos_{thres}italic_C italic_o italic_s start_POSTSUBSCRIPT italic_t italic_h italic_r italic_e italic_s end_POSTSUBSCRIPT is 0.840.840.840.84, and consider two summaries similar if their cosine similarity exceeds this value.

  3. 3.

    GPT-4 is prompted to evaluate if the reasoning in Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT align. If the reasonings are similar and they align with each other, GPT-4 responds ‘yes’ and we assign a reasoning score of 1111, otherwise we assign 00. We find that GPT-4 successfully classifies 48484848 out of 50505050 Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT correctly to their corresponding Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, when validated on Rvalsubscript𝑅𝑣𝑎𝑙R_{val}italic_R start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT.

We then determine whether the reasoning by the LLM is correct or not by majority vote. That is, if two or more of the above criteria match, we consider the reasoning as similar to the ground truth reasoning Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

4 Experimental Investigation

In this section, we use our framework to investigate all LLMs listed in Table I.666We only report the results for the five best performing LLMs in the main body of the paper. The ones for the remaining three LLMs can be found in the Appendix. We first investigate which values of the LLM parameters are most likely to produce consistent (Section 4.1) and best performing (Section 4.2) output. We then perform the rest of our investigations using the most suitable parameter values.

4.1 Evaluation for Deterministic Responses

To perform a rigorous comparison between LLMs and assess their capabilities, it is of critical importance that their responses are consistent, meaning that running the same test multiple times under identical parameters should provide the same final verdict. Therefore, we first investigate whether this consistency is achievable at all, and what LLM parameters deliver the most consistent results. OpenAI’s documentation [48] recommends a temperature of 0.20.20.20.2 and ‘top_p’ of 0.10.10.10.1 to achieve the most deterministic output for code related tasks. Similarly, the recommended ‘temperature’ value for all LLMs in our evaluation is 0.20.20.20.2. When experimenting with modifying these values, both the OpenAI documentation [49] and previous research [7] recommend to keep the value of ‘top_p’ constant and modify the value of ‘temperature.’ We therefore fix ‘top_p’ to default value specific to an LLM, and perform experiments using two different ‘temperature’ values: 0.20.20.20.2 (‘default’) and 0.00.00.00.0. We perform this experiment on two vulnerable and two patched medium code difficulty level scenarios (2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) from two distinct vulnerabilities, “out-of-bound write” (CWE-787) in C and “SQL injection” (CWE-89) in Python (for the same reasons as discussed in Section V). For the input prompts we select the set of Standard prompts (see Section 3.3). We run each experiment ten times, and record how many times the model provides the same answer. We consider a model to be consistent if it always provides the same binary answer, irregardless of whether it is correct.

TABLE VII: Evaluation Results for LLM Output Consistency at Recommended Temperature. The table shows results for each CWE scenario and every Standard prompt, in the format of # correctly answered / # total answered out of 10.
(a) CWE-787
S1 S2 S3 S4 S5 S6
Models 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
chat-bison 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 2/10 8/10 10/10 0/10
codechat-bison 9/10 0/10 10/10 0/10 10/10 0/10 0/10 9/10 0/10 10/10 0/10 10/10
codellama34b 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 5/10 7/10
gpt-3.5 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 10/10 0/10
gpt-4 0/10 10/10 6/10 10/10 1/10 10/10 6/10 10/10 0/10 10/10 7/10 10/10
(b) CWE-89
S1 S2 S3 S4 S5 S6
Models 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
chat-bison 10/10 10/10 10/10 8/10 10/10 10/10 10/10 0/10 10/10 10/10 0/10 10/10
codechat-bison 10/10 10/10 10/10 2/10 10/10 7/10 10/10 0/10 10/10 10/10 10/10 10/10
codellama34b 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 10/10 10/10 0/10
gpt-3.5 10/10 10/10 10/10 10/10 10/10 10/10 10/10 1/10 10/10 10/10 10/10 7/10
gpt-4 10/10 10/10 10/10 10/10 10/10 0/10 10/10 10/10 10/10 10/10 10/10 5/10
TABLE VIII: Evaluation Results for LLM Output Consistency at Temperature = 0.00.00.00.0.
(a) CWE-787
S1 S2 S3 S4 S5 S6
Models 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
chat-bison 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 0/10 10/10 9/10 0/10
codechat-bison 10/10 0/10 10/10 0/10 10/10 0/10 0/10 10/10 0/10 10/10 0/10 10/10
codellama34b 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 0/10 10/10
gpt-3.5 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 10/10 0/10
gpt-4 0/10 10/10 8/10 10/10 2/10 10/10 4/10 9/10 0/10 10/10 4/10 10/10
(b) CWE-89
S1 S2 S3 S4 S5 S6
Models 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2psubscript2𝑝2_{p}2 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
chat-bison 10/10 10/10 10/10 0/10 10/10 10/10 10/10 0/10 10/10 10/10 10/10 10/10
codechat-bison 10/10 10/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 10/10 10/10 0/10
codellama34b 10/10 0/10 10/10 0/10 10/10 0/10 10/10 0/10 10/10 10/10 10/10 0/10
gpt-3.5 10/10 10/10 10/10 10/10 10/10 10/10 10/10 0/10 10/10 10/10 10/10 10/10
gpt-4 10/10 10/10 10/10 10/10 10/10 0/10 10/10 10/10 10/10 10/10 10/10 9/10

Observations. Table VII(b) shows that all LLMs provide inconsistent responses for one or more of the tests at the recommended ‘temperature’ value of 0.20.20.20.2. ‘codechat-bison@001’ even provides a wrong answer with the most basic ‘S1’ prompt (as shown in Figure LABEL:fig:codechat-bison-inconsistent). This suggests that the default ‘temperature’ is not a good choice to evaluate LLMs for vulnerability detection. Using 0.00.00.00.0 as temperature improves consistency, as shown in Table VIII(b): ‘codechat-bison@001,’ ‘codellama34b,’ and ‘gpt-3.5-turbo-16k’ provide consistent responses for all tests at this temperature. However, two LLMs (‘chat-bison@001’ and ‘gpt-4’) still provide inconsistent results. Based on these results, we find that 0.00.00.00.0 is the best ‘temperature’ value to get consistent responses from an LLM, although we note that even at this setting some LLMs fail in delivering consistent responses.

1...
2int MAX_SIZE = 32;
3char* encode_input(char *user_supplied_string){
4 int i, dst_index;
5 char* dst_buf = (char*)malloc((4 * sizeof(char) * MAX_SIZE) + 1);
6 if ( MAX_SIZE <= strlen(user_supplied_string) )
7 exit(1);
8 dst_index = 0;
9 for ( i = 0; i < strlen(user_supplied_string); i++ ){
10 if(’&’ == user_supplied_string[i] ){
11 dst_buf[dst_index++] = ’&’;
12 dst_buf[dst_index++] = ’a’;
13 dst_buf[dst_index++] = ’m’;
14 dst_buf[dst_index++] = ’p’;
15 dst_buf[dst_index++] = ’;’;
16 } else if (’<’ == user_supplied_string[i] ){
17 /* replace with &lt; */
18 } else if (’>’ == user_supplied_string[i] ){
19 /* replace with &lt; */
20 } else dst_buf[dst_index++] = user_supplied_string[i];
21 }
22 dst_buf[dst_index] = ’\0’;
23 return dst_buf;
24}\end{lstlisting}
25 \caption{CWE-787 (Out-of-Bound Write) $2_v$: This code scenario encodes certain characters in the user input string. The program assumes that encoding expansion will only expand a given character by a factor of 4, however the ampersand encoding expands by 5. If the attacker provides a string of many ampersands, the string will over flow the destination buffer.}
26 \label{fig:cwe-exp}
27\end{subfigure}
28
29\vspace{0.3cm}
30
31\begin{subfigure}[t]{1.05\linewidth}
32 \centering
33 \includegraphics[width=\linewidth]{images/deter-out.pdf}
34 \caption{‘codechat-bison@001 responses.}
35 \label{fig:output-inconsistency}
36\end{subfigure}
37
38\caption{Non-deterministic and inconsistent responses by \textbf{‘codechat-bison@001’}.}
39\label{fig:codechat-bison-inconsistent}
40\end{figure}

4.2 Performance Over Range of Parameters

While lower temperatures increase the consistency in results, setting a higher temperature is supposed to increase the creativity in LLMs. In this section, we aim to investigate whether increasing the temperature for LLMs increases their performance in identifying vulnerabilities, both with respect to their accuracy and the correctness of their reasoning.

For this experiment, we select the same two classes of security weaknesses used in the previous section, and select two vulnerable and two patched code scenarios from these classes. We choose scenarios with the highest code difficulty level (3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) as they would need more creativity to be correctly solved by LLMs.

As running experiments on all prompts is not feasible due to budget constraints, we perform the experiment on only one prompt, S4 (ZS-RO), based on three reasons; (1) this prompt does not provide any additional information like definition or step-by-step reasoning instructions to the LLM, so the response will be mainly based on the intrinsic knowledge of the model from its training data and no external instruction or information will influence its reasoning or creativity at a higher temperature, (2) being a zero-shot prompt it does not limit the creativity or reasoning of the model as in few-shot prompting might influence the model to mimic the few-shot examples, and (3) this prompt is the most non-deterministic zero-shot prompt as shown in Tables VIII(b) and VII(b), allowing it to show more randomness or creativity at higher temperature. We evaluate LLMs on six temperature values: their recommended value 0.20.20.20.2, and over a range of values from 0 to 1 i.e., 00, 0.250.250.250.25, 0.50.50.50.5, 0.750.750.750.75, 1.01.01.01.0. We run each experiment ten times, and show how many times an LLM provides a correct answer (i.e., accuracy) and correct reasoning (i.e., reasoning score). The results are summarized in Tables IX(d) and X(d).

TABLE IX: Evaluation of LLMs Over a Range of Temperature Values (CWE-787). The table shows results for each temperature value in the format of # correct / # total answered out of 10.
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 2/10 10/10 0/10 4/10 4/10 4/10
codechat-bison 0/10 10/10 2/10 1/10 2/10 4/10
codellama34b 10/10 10/10 10/10 10/10 10/10 9/10
gpt-3.5 0/10 0/10 2/10 5/10 5/10 8/10
gpt-4 10/10 10/10 10/10 10/10 10/10 10/10
(a) Accuracy (3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 2/3 10/10 0/0 4/5 4/5 4/6
codechat-bison 0/1 10/10 2/2 1/2 2/3 4/4
codellama34b 10/10 10/10 10/10 10/10 10/10 9/10
gpt-3.5 0/10 0/10 2/10 5/10 6/10 8/10
gpt-4 10/10 10/10 10/10 10/10 10/10 10/10
(b) Reason (3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 6/10 0/10 6/10 10/10 5/10 7/10
codechat-bison 9/10 0/10 5/10 8/10 4/10 4/10
codellama34b 0/10 0/10 0/10 0/10 1/10 1/10
gpt-3.5 9/10 10/10 9/10 9/10 6/10 6/10
gpt-4 0/10 0/10 0/10 0/10 0/10 0/10
(c) Accuracy (3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 0/4 0/10 0/4 0/0 0/5 1/4
codechat-bison 0/1 0/10 0/5 0/2 1/7 2/8
codellama34b 0/10 0/10 0/10 1/10 0/10 2/10
gpt-3.5 4/10 3/10 4/10 3/10 2/10 2/10
gpt-4 0/10 0/10 0/10 0/10 0/10 0/10
(d) Reason (3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT)
TABLE X: Evaluation of LLMs Over a Range of Temperature Values (CWE-89).
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 10/10 10/10 10/10 10/10 6/10 7/10
codechat-bison 10/10 10/10 9/10 10/10 7/10 9/10
codellama34b 10/10 10/10 10/10 10/10 10/10 10/10
gpt-3.5 10/10 10/10 10/10 10/10 10/10 10/10
gpt-4 10/10 10/10 10/10 10/10 10/10 10/10
(a) Accuracy (3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 10/10 10/10 10/10 10/10 6/9 7/9
codechat-bison 10/10 10/10 9/10 10/10 8/9 9/10
codellama34b 10/10 10/10 10/10 10/10 10/10 10/10
gpt-3.5 10/10 10/10 10/10 10/10 10/10 10/10
gpt-4 10/10 10/10 10/10 10/10 10/10 10/10
(b) Reason (3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 0/10 0/10 1/10 0/10 5/10 7/10
codechat-bison 0/10 0/10 0/10 1/10 2/10 1/10
codellama34b 0/10 0/10 0/10 0/10 0/10 0/10
gpt-3.5 0/10 0/10 0/10 0/10 0/10 0/10
gpt-4 0/10 0/10 0/10 0/10 0/10 0/10
(c) Accuracy (3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT)
Model Rec 0.0 0.25 0.5 0.75 1.0
chat-bison 1/10 0/10 1/10 0/10 2/8 6/10
codechat-bison 0/10 0/10 0/10 0/10 1/9 1/10
codellama34b 0/10 0/10 1/10 0/10 2/10 1/10
gpt-3.5 0/10 0/10 0/10 0/10 0/10 0/10
gpt-4 0/10 1/10 1/10 1/10 1/10 1/10
(d) Reason (3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT)

Observations. Our results do not show a general trend of better performance with the increase in model temperature. Since increasing the temperature does not present a general improvement of results across our models, to prioritize result consistency we elected to use 0.00.00.00.0 as the ‘temperature’ value for the remaining of our experiments, and set ‘top_p’ to LLM specific default value.

4.3 Diversity of Prompts

In this part of investigation, we test the LLMs on their ability to detect vulnerabilities in the 48 hand-crafted code scenarios described in Section 3.4, by using the 17 prompts ranging over three categories and four prompting techniques, as described in Section 3.3. This experiment allows us to evaluate the capabilities of LLMs over a wide input spectrum and answer questions like (1) what kind of prompting techniques work best for the model, (2) whether emulating the multi-step reasoning process followed by human security experts improves performance, and (3) whether providing extra information or examples helps LLMs in decision making?. Table XI shows the results of this experiment based on the following three metrics:

(1) Response Rate: Measures how often the model provides an answer to a given input at all. E.g., for prompts ‘S5’ and ‘S6,’ ‘codechat-bison@001’ provides answers to 36363636 out of 48484848 inputs and for the rest it responds “I’m not able to help with that, as I’m only a language model. If you believe this is an error, please send us your feedback.”

ResponseRate=#InputsAnsweredTotalInputs𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑅𝑎𝑡𝑒#𝐼𝑛𝑝𝑢𝑡𝑠𝐴𝑛𝑠𝑤𝑒𝑟𝑒𝑑𝑇𝑜𝑡𝑎𝑙𝐼𝑛𝑝𝑢𝑡𝑠ResponseRate=\frac{\#InputsAnswered}{TotalInputs}italic_R italic_e italic_s italic_p italic_o italic_n italic_s italic_e italic_R italic_a italic_t italic_e = divide start_ARG # italic_I italic_n italic_p italic_u italic_t italic_s italic_A italic_n italic_s italic_w italic_e italic_r italic_e italic_d end_ARG start_ARG italic_T italic_o italic_t italic_a italic_l italic_I italic_n italic_p italic_u italic_t italic_s end_ARG

(2) Accuracy Rate: Measures the correctness of the model’s response, regardless of the provided reasoning. E.g., for prompt ‘D2,’ ‘codechat-bison@001’ provides correct answers to 24242424 inputs out of the 48484848 answered inputs.

AccuracyRate=#CorrectAnswers#InputsAnswered𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑅𝑎𝑡𝑒#𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝐴𝑛𝑠𝑤𝑒𝑟𝑠#𝐼𝑛𝑝𝑢𝑡𝑠𝐴𝑛𝑠𝑤𝑒𝑟𝑒𝑑AccuracyRate=\frac{\#CorrectAnswers}{\#InputsAnswered}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y italic_R italic_a italic_t italic_e = divide start_ARG # italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_A italic_n italic_s italic_w italic_e italic_r italic_s end_ARG start_ARG # italic_I italic_n italic_p italic_u italic_t italic_s italic_A italic_n italic_s italic_w italic_e italic_r italic_e italic_d end_ARG

(3) Correct Reasoning Rate (CRR): Evaluates how often the model’s correct answers also have the correct reasoning. E.g., for prompt ‘D2,’ ‘codechat-bison@001’ provides reasoning for 15151515 answers out of the 24242424 correct answers and out of those 15151515 reasonings 14141414 are correct.

CRR=#CorrectAnswerswithCorrectReasoning#ReasoningswithCorrectAnswers𝐶𝑅𝑅#𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝐴𝑛𝑠𝑤𝑒𝑟𝑠𝑤𝑖𝑡𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑅𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔#𝑅𝑒𝑎𝑠𝑜𝑛𝑖𝑛𝑔𝑠𝑤𝑖𝑡𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝐴𝑛𝑠𝑤𝑒𝑟𝑠CRR=\frac{\#CorrectAnswerswithCorrectReasoning}{\#ReasoningswithCorrectAnswers}italic_C italic_R italic_R = divide start_ARG # italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_A italic_n italic_s italic_w italic_e italic_r italic_s italic_w italic_i italic_t italic_h italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_R italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g end_ARG start_ARG # italic_R italic_e italic_a italic_s italic_o italic_n italic_i italic_n italic_g italic_s italic_w italic_i italic_t italic_h italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_A italic_n italic_s italic_w italic_e italic_r italic_s end_ARG

Best Prompts: Based on the above three metrics, we choose the best overall prompts for each model from four categories of prompts (described in Section 3.3) i.e., ZS - TO, ZS - RO, FS - TO, and FS - RO. We calculate a Scoreprompt𝑆𝑐𝑜𝑟subscript𝑒𝑝𝑟𝑜𝑚𝑝𝑡Score_{prompt}italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT, which is the weighted sum of the three metrics where each metric is assigned an equal weight of 0.330.330.330.33. We then select the best prompt from each of the above defined four categories as the one that maximizes Scoreprompt𝑆𝑐𝑜𝑟subscript𝑒𝑝𝑟𝑜𝑚𝑝𝑡Score_{prompt}italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT, as shown in Table XI.

TABLE XI: Evaluation of five LLMs for detecting vulnerabilities across 48 hand-crafted code scenarios, over a range of prompting techniques. The green and red bars represent the count of scenarios with correct and incorrect responses for each LLM (i.e., AccuracyRate𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑅𝑎𝑡𝑒AccuracyRateitalic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y italic_R italic_a italic_t italic_e). A white circle marks scenarios with both correct answers and reasoning (CRR𝐶𝑅𝑅CRRitalic_C italic_R italic_R). Additionally, we highlight top-performing prompts for each technique: ZS-TO ( ), ZS-RO ( ), FS-TO ( ), and FS-RO ( ). The overall best prompt is shown with a red box.

[Uncaptioned image]

1...
2void read_file(char* file_path) {
3 char* resolved_path = realpath(file_path, NULL);
4 if (resolved_path == NULL) {
5 printf("Error: Invalid Path\n");
6 return;
7 }
8 char* dir = "/Users/user1/";
9 /* safely creates full_path by concatenating file_path with dir */
10 FILE *fp;
11 char *data = malloc(256);
12 fp = fopen(full_path, "r");
13 while (fgets(data, 256, fp) != NULL)
14 { /* processes file */ }
15 ...
16}\end{lstlisting}
17 \caption{CWE-476 (NULL-Pointer Derefence) $1_v$: This code scenario sanitizes the user provided file path, opens file at that file path, and processes it. The code does not check for the NULL value of fopen in case an error is raised.}
18 \label{fig:cwe-476-exp}
19\end{subfigure}
20
21\vspace{0.3cm}
22
23\begin{subfigure}[t]{1.05\linewidth}
24 \centering
25 \includegraphics[width=\linewidth]{images/s1-r2.pdf}
26 \caption{GPT-3.5 Responses.}
27 \label{fig:s1-r2}
28\end{subfigure}
29
30\caption{‘GPT-3.5 responses to standard ‘S1’ and security-expert like multi-step reasoning R2 for CWE-476 $1_v$ code scenario.}
31\label{fig:gpt-3.5-s1-r2}
32\end{figure}
33
34\descr{Observations.} ‘gpt-4’ performs the best among the tested LLMs, with a maximum accuracy of 89.5\%. There is no prompt for which all LLMs perform consistently better, but instead they show different success for different types of prompts.
35GPT models and codechat-bison perform better when prompted to follow a human-like step-by-step reasoning process (as shown in Figure \ref{fig:gpt-3.5-s1-r2}) , i.e., R4, R6, and R2 prompts, respectively.
36‘chat-bison’ performs best when assigned a security expert role, while ‘codellama34b’ works best with the S1 prompt, which simply asks if the code snippet contains a certain vulnerability.
37While gpt-4 and ‘codellama34b’ show an increase in accuracy when provided with a vulnerability definition, compared to standard prompts, the same trend is not found in the other LLMs.

4.4 Faithful Reasoning

Faithful reasoning is the quality of an LLM to provide the right reasoning for the right answer or vice versa [50]. The more faithful an LLM’s reasoning is to its final answer, the more a user can trust its response. Table XI shows that even when they provide the right response, LLMs sometimes provide the wrong reason for this decision. In this section, we further analyze the faithful reasoning of LLMs on their decisions from the experiment discussed in the previous section, focusing on five aspects: (1) for how many cases does the LLM provide a reasoning for the presence of a vulnerability in a code snippet at all, (2) for how many correct answers the LLM also provides a correct reasoning, (3) for how many correct answers the LLM provides an incorrect reasoning, (4) for how many incorrect answers the LLM provides an incorrect and (5) for how many answers does the LLM provide the wrong answer, but a correct reasoning. Table XII provides an overview of the results of this experiment.

TABLE XII: Faithfulness of LLMs. The Table shows the Reason Rate i.e., # scenarios for which LLM provides reasoning / # total scenarios answered by LLM (out of total 816 scenarios). Then it displays # of scenarios with correct answer and correct reasoning ( ), # correct answer but incorrect reasoning ( ), # incorrect answer and incorrect reasoning ( ), and # incorrect answer but correct reasoning ( ).

[Uncaptioned image]

Observations. While in the vast majority of cases the answer and reasoning for the tested LLMs align, every LLM presents cases where it provides the correct reasoning but this leads to a wrong answer (as shown in Figure 4). Similarly, we find cases where an LLM provides the right answer but its reasoning or root cause is not correct (as shown under ‘codechat-bison (1st Response)’ in Figure LABEL:fig:output-inconsistency). We also find that Google’s PaLM2 models have overall lower reasoning rate as they do not explain their decisions in many cases, while GPT models show comparatively higher reasoning rates, especially ‘gpt-4’ and ‘codellama34b’ provide a reason for every answer. Our findings suggest that, in certain cases, current LLMs’ responses might not fully rely on faithful and accurate reasoning.

Refer to caption
Figure 4: ‘chat-bison@001’ (PaLM2) response for CWE-77 3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT scenario (see Appendix Figure LABEL:fig:cwe-77-3) using prompt ‘D3’ shows unfaithfulness between provided reasoning and final answer.

4.5 Evaluation Over Variety of Vulnerabilities

In this section, we focus on analyzing LLMs ability to correctly identify both vulnerable and patched code for different types of vulnerabilities, based on the eight CWEs that we used to build our hand-crafted dataset. Similar to Section 4.3, we find and use the best performing prompts for each CWE using Scorecwe𝑆𝑐𝑜𝑟subscript𝑒𝑐𝑤𝑒Score_{cwe}italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_c italic_w italic_e end_POSTSUBSCRIPT, with equal weight to all factors, from four prompting categories. The results are summarized in Table XIII.

TABLE XIII: Evaluation of LLMs over a wide range of eight most critical vulnerabilities. Each bar represents count of correctly classified vulnerable ( bar) and patched ( bar) code scenarios, and each circle marks count of correctly reasoned vulnerable (white circle) and patched (black circle) code scenarios, out of total answered scenarios by each LLM.

[Uncaptioned image]

Observations. Most models show poor performance in classifying the patched versions correctly, which makes these LLMs non-suitable for real-world cases as they will mostly flag safe code as vulnerable, causing manyfalse alarms. We observe that few-shot prompting performs significantly better than zero-shot prompting for almost all models (p-value = 0.003), and role-oriented prompts perform slightly better than task-oriented prompts (p-value = 0.1). The reason for this is that assigning a role to the model grounds its knowledge for the given task and prevents it from hallucinating, which can be seen in the increase in reasoning score for role-oriented prompts. However, Table XIII shows that ‘codechat-bison@001’ does not provide answers for ‘CWE-787’ and ‘CWE-416’ for few-shot prompts.

4.6 Code Difficulty Levels

In this section, we investigate the capabilities of LLMs to handle different complexities of code. Similar to the previous sections, we find the best performing prompts for each difficulty level using Scorediff𝑆𝑐𝑜𝑟subscript𝑒𝑑𝑖𝑓𝑓Score_{diff}italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT, with equal weight to all factors, from four prompting categories. Table XIV summarizes the results of this experiment.

TABLE XIV: Evaluation of LLMs over a Range of Code Difficulty Levels. (Same color coding as Table XIII).

[Uncaptioned image]

Observations. LLMs generally perform better on the easy code scenarios, with limited exceptions (e.g., ‘codechat-bison’ correctly recognizing more medium than easy vulnerabilities in the FS - TO setting). By manually looking at the misclassified scenarios, we make two observations; (1) LLMs are not familiar with the safe practices of library functions, and (2) LLMs cannot handle complex multi-function and multi-variable data flow patterns. For example, in the CWE-89 (SQL Injection) ‘3psubscript3𝑝3_{p}3 start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT’ scenario, the code creates an SQL query and its parameters in one function ‘create_query’, and returns them to the ‘login’ function where they are passed to ‘pymysql’ as separate arguments. None of the LLMs were able to understand this flow (e.g., the response from ‘gpt-4’ is shown in Figure LABEL:fig:gpt-4-cwe-89), as they mistakenly assume that the ‘login’ function only passes one argument to the ‘execute’ function. The LLMs also seem to be unaware that ‘pymysql’ itself sanitizes the output.

1def createQuery(username, password):
2 query_text = "SELECT * FROM users WHERE username = ’%s’ AND password = ’%s’"
3 parmeters = (username, password)
4 query = query_text, parmeters
5 return query
6
7@app.route("/login")
8def login():
9 username = request.form(’username’)
10 password = request.form(’password’)
11
12 db = pymysql.connect("localhost")
13 cursor = db.cursor()
14
15 query = createQuery(username, password)
16 cursor.execute(query)\end{lstlisting}
17 \caption{CWE-89 (SQL Injection) $3_p$ code scenario. This code is not vulnerable to SQL Injection as it passes the query and ‘parameters’ separately to the pymysql function.}
18 \label{fig:cwe-89-p}
19 \end{subfigure}
20
21 \vspace{0.2cm}
22
23 \begin{subfigure}[t]{\linewidth}
24 \includegraphics[width=1.05\linewidth]{images/diff-exp.pdf}
25 \caption{GPT-4 response.}
26 \label{fig:gpt4-diff}
27 \end{subfigure}
28 \caption{GPT-4 provides wrong answer and reasoning to CWE-89 (hard-level) $3_p$ code scenario.}
29 \label{fig:gpt-4-cwe-89}
30\end{figure}’

4.7 Robustness to Code Augmentations

In this section we test the robustness of LLMs by testing them against the code augmentations described in Section V. Our results are summarized in Table VI. For each input augmentation we show the change in accuracy and reasoning score as compared to the original non-augmented version of the input. For each LLM, we test each augmentation using three prompts: standard prompt ‘S1,’ and the best zero-shot (ZS) and few-shot (FS) prompts.777Since we show in Sections 4.5 and 4.6 that role-oriented prompts work better than task-oriented ones, we do not run experiments on all four categories of prompts.

TABLE XV: Evaluation for Code-Level Augmentations. The tables show ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (# of answers that are correct in non-augmented scenarios but incorrect in this specific augmentation case) and ΔpsubscriptΔ𝑝\Delta_{p}roman_Δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (# of reasoning that are correct in non-augmented scenarios but incorrect in this specific augmentation case) for each code augmentation and for three prompts (standard ‘S1’, best zero-shot, and best few-shot) of each LLM.
T1 T2 T3 T4 T5 T6 T7
M PS ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
c-bison S1S 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12
R2ZS 2/12 2/12 1/12 1/12 0/12 0/12 1/12 1/12 2/12 3/12 0/12 0/12 0/12 1/12
S6FS 2/12 3/12 2/12 2/12 2/12 4/12 1/12 3/12 2/12 2/12 2/12 2/12 2/12 1/12
cc-bison S1S 0/12 2/12 0/12 3/12 0/12 3/12 0/12 0/12 0/12 2/12 0/12 2/12 0/12 1/12
R2ZS 0/12 0/12 0/12 1/12 0/12 0/12 3/12 3/12 2/12 2/12 0/12 2/12 1/12 1/12
R4FS 0/12 0/12 0/12 0/12 0/12 0/12 1/12 1/12 0/12 0/12 1/12 1/12 0/12 0/12
c.lla.34b S1S 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12
S1ZS 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12
S5FS 0/12 0/12 3/12 3/12 3/12 3/12 2/12 3/12 0/12 0/12 1/12 1/12 2/12 2/12
gpt-3.5 S1S 0/12 0/12 0/12 0/12 2/12 3/12 1/12 1/12 0/12 0/12 1/12 2/12 0/12 0/12
R2ZS 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12
R4FS 1/12 1/12 1/12 1/12 1/12 1/12 1/12 1/12 2/12 2/12 0/12 0/12 1/12 1/12
gpt-4 S1S 0/12 0/12 0/12 2/12 0/12 2/12 0/12 0/12 0/12 0/12 0/12 1/12 0/12 0/12
R2ZS 2/12 1/12 1/12 0/12 3/12 2/12 2/12 1/12 1/12 0/12 2/12 1/12 1/12 1/12
R6FS 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12 0/12
(a) Trivial Augmentations
NT1 NT2 NT3 NT4 NT5 NT6
M PS ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ΔasubscriptΔ𝑎\Delta_{a}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ΔrsubscriptΔ𝑟\Delta_{r}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
c-bison S1S 0/12 0/12 1/12 0/12 8/12 7/12 0/12 1/12 0/9 2/9 0/9 1/9
R2ZS 3/12 2/12 2/12 2/12 4/12 4/12 0/12 6/12 0/9 2/9 0/9 3/9
S6FS 0/12 0/12 5/12 5/12 8/12 8/12 0/12 0/12 4/9 4/9 2/9 1/9
cc-bison S1S 0/12 2/12 3/12 1/12 9/12 9/12 1/12 4/12 1/9 4/9 1/9 0/9
R2ZS 1/12 2/12 4/12 4/12 7/12 6/12 1/12 1/12 0/9 0/9 0/9 0/9
R4FS 0/12 0/12 5/12 6/12 1/12 1/12 0/12 0/12 6/9 6/9 0/9 0/9
c.lla.34b S1S 1/12 0/12 4/12 4/12 9/12 9/12 1/12 3/12 1/9 2/9 1/9 0/9
S1ZS 1/12 0/12 4/12 4/12 9/12 9/12 1/12 5/12 1/9 1/9 1/9 0/9
S5FS 0/12 0/12 3/12 3/12 3/12 3/12 1/12 5/12 0/9 0/9 1/9 0/9
gpt-3.5 S1S 1/12 0/12 1/12 2/12 1/12 1/12 2/12 2/12 0/9 0/9 2/9 1/9
R2ZS 0/12 0/12 2/12 2/12 0/12 0/12 2/12 2/12 3/9 3/9 0/9 0/9
R4FS 0/12 0/12 3/12 4/12 3/12 3/12 0/12 4/12 3/9 3/9 3/9 1/9
gpt-4 S1S 0/12 2/12 1/12 3/12 0/12 0/12 2/12 7/12 0/9 0/9 2/9 1/9
R2ZS 0/12 0/12 0/12 0/12 3/12 3/12 0/12 2/12 0/9 0/9 0/9 1/9
R6FS 0/12 0/12 3/12 3/12 0/12 0/12 1/12 5/12 5/9 5/9 1/9 1/9
(b) Non-Trivial Augmentations

Observations. Table XV(a) shows that even trivial augmentations like the addition of whitespaces (Figure 6(a)) and new-line characters lead all LLMs to an incorrect answer and reasoning in some cases, and further breaks their chain-of-thought reasoning. Furthermore, changing function or variable names or the presence of unreachable code lead to incorrect answers. When looking at non-trivial augmentations, Table XV(b) shows that LLM performance is also affected by function and variable names, e.g., changing a variable name to ‘buffer’ in NT1 leads to the wrong detection of a buffer overflow and changing a function name to ‘non_vulnerable’ or to a safe function name increases the chances to be detected as non-vulnerable. Most importantly, LLMs present a bias towards library functions that are usually used for sanitization or are considered potentially vulnerable. E.g., all LLMs would declare the safe usage of ‘strcat’ in C as vulnerable, and unsafe uses of ‘strncat’ would be flagged as safe (Figure 6(b)). Similarly, the unsafe use of sanitizing library functions like ‘realpath’ in C or ‘escape’ in Python (Figure 6(c)) are detected as non-vulnerable. Our experiments show that there is no prompting technique that is completely robust as our robustness tests break even the best types of prompting techniques and chain-of-thought for all LLMs, leading to incorrect responses (17% of cases for GPT-4).

Refer to caption
(a) Example of a complete change in ‘chat-bison’ decision by just adding whitespaces (T5) in code scenario CWE-787 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (shown in Figure LABEL:fig:cwe-exp) after line 22.
Refer to caption
(b) Example of NT5 augmentation to CWE-787 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, where even the addition of safely used ‘strcat’ successfully confuses ‘gpt-4’ to classify the code as vulnerable merely on the basis of the presence of ‘strcat.’ However, the addition of ‘strncat’ leads to the classification of vulnerable code as safe.
Refer to caption
(c) Example of NT5 augmentation to CWE-79 2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, where even an addition of an unsafe ‘escape’ function call makes ‘codellama34b’ believe that the code is safe.
Figure 6: Non-robustness in most capable LLMs responses. Red shows response for code scenario before augmentation and green is for after augmentation.

4.8 Real-World Cases

Finally, we investigate the ability of LLMs to identify real-world vulnerable code, by leveraging our CVE dataset (see Table V) using the best prompts listed in Table XI. The results are summarized in Tables XVI and XVII.

TABLE XVI: Evaluation on real-world CVEs for Linux and pjsip. This table shows results for both vulnerable and patched versions of every CVE, given by the best prompts of every LLM. (no answer), (correct answer but wrong reasoning), (correct answer with correct reasoning), (wrong answer and no reasoning), and (wrong answer and no or wrong reasoning).

[Uncaptioned image]

TABLE XVII: Evaluation on Real-World CVEs for gpac and libtiff.

[Uncaptioned image]

Observations. Overall, the evaluation with our real-world CVEs samples highlights that LLMs face challenges in detecting vulnerabilities in real-world projects, with all studied LLMs providing incorrect answers for several of our test cases. In addition to providing wrong answers for vulnerable code, LLMs frequently mistakenly identify patched examples as vulnerable, which would be particularly problematic if these models were used in production, as it would make the number of false positives skyrocket. We also observe that few-shot prompting does not work in case of real-world scenarios, likely because LLMs fail to extrapolate information from the provided examples (despite being from the same CWEs) and apply it to other software code-bases. At the same time, we find that the zero-shot role-oriented prompt ‘R2’ shows relatively better performance for all LLMs , which indicates that grounding LLM’s role as ‘security expert’ and providing them explicit guidelines to follow a human-like multi-step vulnerability detection process can improve performance, but is still insufficient for real-world deployment.

5 Discussion

SecLLMHolmes allows users to evaluate any chat-based LLM for its ability to identify software vulnerabilities. Further, we have publicly released our framework, enabling the community to evaluate LLMs released in the future. For example, researchers will be able to compare performance between different releases of an LLM, or study if changing properties like model architecture or number of parameters improves their ability to detect vulnerabilities.888We include these case studies for ‘codellama7B’, ‘codellama13B’, and ‘starchat-beta’ in tables in Appendix

AI companies are taking steps to address some of the issues highlighted in this work. For example, the latest release of ‘GPT-4 Turbo (preview)’ introduces the use of a ‘seed’ during inference to enable deterministic output. While a step in the right direction to ensure reliable output, this approach still presents the problem that different seeds might produce different answers for the same input.

Limitations. As any research project, our work presents some limitations. In the following, we discuss them in detail.

Answer and Reason Extraction. We use GPT-4 to parse the LLM output and extract the final answer and reasoning, using a prompt that requires GPT-4 to answer in a given format (as shown in Figure 7 in the Appendix), otherwise further steps of our analysis would fail. We manually analyze 100 extracted answers and reasonings by GPT-4 and only two were not answered in the given format.999We note that using the newest ‘GPT-4 Turbo,’ which provides responses in json format, could eliminate even these two anomalies.

Knowledge Cut-Off. To evaluate newer LLMs, researchers might have to identify CVEs that were release after their knowledge cut-off, to avoid biases in the results. Our framework is modular and allows to add new ground truth data to the evaluation pipeline.

Reasoning Score. We use a combination of three metrics (Rouge, Cosine Similarity, and GPT-4) and select the majority decision to the chance of false positives. However, if two metrics agree on a wrong output, our approach would still report a false positive. Out of 100 manually selected examples, we find that this happened 7 times.

Representativeness of Code Scenarios. While we developed a wide numbers of code scenarios, there are many aspects of difficulty levels and code augmentations as well as many languages and vulnerabilities that were not considered. Our framework can be easily extended in the future to include additional code scenarios.

6 Conclusion

This work presents the first scalable and fully automated framework to evaluate the efficiency and reasoning capabilities of chat-based LLMs across eight distinct dimensions for the task of vulnerability detection. We performed an evaluation of state-of-the-art LLMs using this framework, showing that they are currently unreliable at this task and will answer wrongly when asked to identify vulnerabilities in source code. Based on these results, we conclude that state-of-the-art LLMs are not yet ready to be used for vulnerability detection and urge future research to address and resolve the highlighted issues. Our framework and benchmarks will be a useful tool for the community to evaluate the progress of future LLM versions in vulnerability detection.

Acknowledgments

We would like to thank Syed Qasim and Pujan Paudel for their help in generating ground-truth reasoning for the code scenarios. This work was supported by the Red Hat Collaboratory and by the NSF under grants CNS-1942610 and CNS-2127232. Any opinions, findings, and conclusions, or recommendations expressed are those of the authors and do not necessarily reflect the views of the sponsors.

References

  • [1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Petroski Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. Hebgen Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, and Radford, “Evaluating Large Language Models Trained on Code,” arXiv e-prints, p. arXiv:2107.03374, Jul. 2021.
  • [2] R. Anil, A. M. Dai, O. Firat, M. Johnson, and D. Lepikhin, “Palm 2 technical report,” 2023.
  • [3] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Canton Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code Llama: Open Foundation Models for Code,” arXiv e-prints, p. arXiv:2308.12950, Aug. 2023.
  • [4] R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Sankalp Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, and J. Ding, “StarCoder: may the source be with you!” arXiv e-prints, p. arXiv:2305.06161, May 2023.
  • [5] OpenAI, “Gpt-4 technical report,” 2023.
  • [6] “Gitlab 2022 survey.” https://about.gitlab.com/blog/2022/08/23/gitlabs-2022-global-devsecops-survey-security-is-the-top-concern-investment/#more-work-to-do, accessed on: 2023-07-10.
  • [7] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in IEEE Symposium on Security and Privacy, 2023.
  • [8] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” in IEEE Symposium on Security and Privacy, 2022.
  • [9] N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” in Proceedings of ACM SIGSAC Conference on Computer and Communications Security, 2023.
  • [10] “Diffblue cover: Autonomous java unit test writing with ai for code,” https://www.diffblue.com/products/, accessed on: 2023-07-10.
  • [11] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, 2022.
  • [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
  • [13] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.
  • [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, and Berner, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, 2020.
  • [15] “OWASP List.” https://owasp.org/www-community/Source_Code_Analysis_Tools, accessed on: 2023-07-10.
  • [16] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in IEEE Symposium on Security and Privacy, 2014.
  • [17] Y. Mirsky, G. Macon, M. Brown, C. Yagemann, M. Pruett, E. Downing, S. Mertoguno, and W. Lee, “VulChecker: Graph-based vulnerability localization in source code,” in 32nd USENIX Security Symposium, 2023.
  • [18] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,” Proceedings of Network and Distributed System Security Symposium, 2018.
  • [19] G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “Poster: Vulnerability discovery with function representation learning from unlabeled projects,” in Proceedings of ACM SIGSAC Conference on Computer and Communications Security, 2017.
  • [20] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Transactions on Dependable and Secure Computing, 2022.
  • [21] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder: Unified cross-modal pre-training for code representation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, May 2022.
  • [22] H. Hanif and S. Maffeis, “Vulberta: Simplified source code pre-training for vulnerability detection,” in International Joint Conference on Neural Networks (IJCNN), 2022.
  • [23] L. Phan, H. Tran, D. Le, H. Nguyen, J. Annibal, A. Peltekian, and Y. Ye, “CoTexT: Multi-task learning with code-text transformer,” in Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog), 2021.
  • [24] “Pysa.” https://engineering.fb.com/2020/08/07/security/pysa/, accessed on: 2023-07-10.
  • [25] “Bandit.” https://bandit.readthedocs.io/en/latest/, accessed on: 2023-07-10.
  • [26] “Cppcheck.” https://cppcheck.sourceforge.io/, accessed on: 2023-07-10.
  • [27] “Infer.” https://fbinfer.com/, accessed on: 2023-07-10.
  • [28] G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vulnerability discovery using machine learning,” 2016.
  • [29] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and don’ts of machine learning in computer security,” in 31st USENIX Security Symposium, 2022.
  • [30] N. Risse and M. Böhme, “Limits of Machine Learning for Automatic Vulnerability Detection,” arXiv e-prints, p. arXiv:2306.17193, Jun. 2023.
  • [31] W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  • [32] “Microsoft codexglue leaderboard.” https://microsoft.github.io/CodeXGLUE/, accessed on: 2023-07-10.
  • [33] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Advances in Neural Information Processing Systems, 2019.
  • [34] C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, “Transformer-based language models for software vulnerability detection,” in Proceedings of the 38th Annual Computer Security Applications Conference, 2022.
  • [35] “Best practices for prompt engineering with OpenAI API,” https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, accessed on: 2023-07-10.
  • [36] “Introduction to prompt design,” https://developers.generativeai.google/guide/prompt_best_practices, accessed on: 2023-07-10.
  • [37] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, 2022, pp. 22 199–22 213.
  • [38] M. Nye, A. Andreassen, G. Gur-Ari, H. W. Michalewski, J. Austin, D. Bieber, D. M. Dohan, A. Lewkowycz, M. P. Bosma, D. Luan, C. Sutton, and A. Odena, “Show your work: Scratchpads for intermediate computation with language models,” 2021, https://arxiv.org/abs/2112.00114.
  • [39] D. Votipka, R. Stevens, E. Redmiles, J. Hu, and M. Mazurek, “Hackers vs. testers: A comparison of software vulnerability discovery processes,” in IEEE Symposium on Security and Privacy, 2018.
  • [40] D. Votipka, S. Rabin, K. Micinski, J. S. Foster, and M. L. Mazurek, “An observational investigation of reverse Engineers’ processes,” in 29th USENIX Security Symposium, 2020.
  • [41] “MITRE Top 25 Most Dangerous Software Weaknesses.” https://cwe.mitre.org/data/definitions/1387.html, accessed on: 2023-07-10.
  • [42] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley, “Automated vulnerability detection in source code using deep representation learning,” in 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.
  • [43] A. Rahman, C. Parnin, and L. Williams, “The seven sins: Security smells in infrastructure as code scripts,” in IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019.
  • [44] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, Y. Zhang, N. Zhenqiang Gong, and X. Xie, “PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,” arXiv e-prints, p. arXiv:2306.04528, 2023.
  • [45] C. Li, J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie, “Large Language Models Understand and Can be Enhanced by Emotional Stimuli,” arXiv e-prints, p. arXiv:2307.11760, 2023.
  • [46] J. L. Fleiss, B. Levin, and M. C. Paik, Statistical methods for rates and proportions.   Wiley-Interscience, 2003.
  • [47] C. Y. Lin, “ROUGE: A package for automatic evaluation of summaries.”   Association for Computational Linguistics, Jul. 2004.
  • [48] “Cheat sheet: Mastering temperature and top_p in chatgpt api (a few tips and tricks on controlling the creativity/deterministic output of prompt responses.),” https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683/1.
  • [49] “OpenAI Chat Completion API Reference,” https://platform.openai.com/docs/api-reference/completions/create, accessed on: 2023-07-10.
  • [50] A. Creswell and M. Shanahan, “Faithful Reasoning Using Large Language Models,” arXiv e-prints, p. arXiv:2208.14271, Aug. 2022.

Appendix A Examples of Code Difficulty Levels

Easy: CWE-22 ‘1vsubscript1𝑣1_{v}1 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT’ (see Figure LABEL:fig:easy) takes a file path as an input, which is then concatenated with an absolute directory path, and then the file is read and displayed on the console. The file path provided by the user is not sanitized, leading to a directory traversal vulnerability.

Medium: CWE-22 ‘2vsubscript2𝑣2_{v}2 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT’ (see Figure LABEL:fig:medium) takes four inputs: file path, flag, data, and directory path (set using an environment variable). Based on the flag, data and file are processed. The program also calls the ‘realpath’ function to sanitize the input, but only applies it to the directory path, leaving the file path vulnerable to directory traversal.

Hard: CWE-22 ‘3vsubscript3𝑣3_{v}3 start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT’ (see Figure LABEL:fig:hard) contains two functions: ‘resolve_path’ that takes in a path and replaces all white spaces with hyphens, and ‘print_file’ that takes in the directory path from an environment variable and the file path from the user, concatenates them and calls ‘resolve_path’ on it. The ‘resolve_path’ function call in ‘print_file’ semantically appears to be sanitizing the given path, but actually it fails in doing so.

Refer to caption

Figure 7: Extraction Prompt Pesubscript𝑃𝑒P_{e}italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT
1...
2void read_file(char* file_name)
3{
4 char* dir = "/Users/user1/";
5 int file_path_len = strlen(dir) + strlen(file_name) + 1;
6 char* file_path = malloc(file_path_len);
7 ...
8 strcpy(file_path, dir);
9 strcat(file_path, file_name);
10
11 FILE* f = fopen(file_path, "r");
12 ...
13 /* read file */
14}\end{lstlisting}
15 \caption{CWE-22 ($1_v$) Easy Code Difficulty Level}
16 \label{fig:easy}
17 \end{subfigure}
18
19 \begin{subfigure}[t]{\linewidth}
20 \begin{lstlisting}[language=C]
21...
22void file_operation(char* flag, char* file_name, char* data)
23{
24 char* dir = getenv("dir");
25 ...
26 char* resolved_dir = realpath(dir, NULL);
27 if (resolved_dir == NULL)
28 {
29 printf("Invalid path\n");
30 return;
31 }
32
33 char* path = malloc(strlen(resolved_dir) + strlen(file_name) + 1);
34 ...
35 strcpy(path, resolved_dir);
36 strcat(path, file_name);
37
38 FILE* f = fopen(path, flag);
39 ...
40 if (*flag == ’w’)
41 {
42 /* write */
43 }
44 else if (*flag == ’r’)
45 {
46 /* read */
47 }
48 ...
49}\end{lstlisting}
50 \caption{CWE-22 ($2_v$) Medium Difficulty Level}
51 \label{fig:medium}
52 \end{subfigure}
53
54 \begin{subfigure}[t]{\linewidth}
55 \begin{lstlisting}[language=C]
56...
57void resolve_path(char* path)
58{
59 char* p = path;
60 while (*p != ’\0’) {
61 if (*p == ) {
62 *p = ’-’;
63 }
64 p++;
65 }
66}
67
68void print_file(char* file_name)
69{
70 char* dir = getenv("dir");
71 ...
72
73 int path_len = strlen(dir) + strlen(file_name) + 1;
74 char* path = malloc(path_len);
75 ...
76 strcpy(path, dir);
77 strcat(path, file_name);
78 resolve_path(path);
79 FILE* f = fopen(path, "r");
80 ...
81}\end{lstlisting}
82 \caption{CWE-22 ($3_v$) Hard Difficulty Level}
83 \label{fig:hard}
84 \end{subfigure}
85 \caption{Examples of Different Difficulty Levels}
86 \label{fig:diff-examples}
87\end{figure}
88
89\section{Robustness to Code Augmentations}
90\label{sec:app-robust}
91\descr{A1}: In this case, we design 12 CWE scenarios for the top two vulnerabilities in C i.e., CWE-787 (out-of-bound write) and CWE-416 (use-after-free) and replace the name of the variables with buffer and ask the the model to analyze the code for ‘out-of-bound write / buffer overflow’.
92
93\descr{A2}: In this case, we select 12 patched versions of CWE-scenarios from the four most dangerous classes of vulnerabilities i.e., CWE-787 (C), CWE-79 (Py), CWE-89 (Py), and CWE-416 (C), and change the names of functions to vulnerable\_func’.
94
95\descr{A3}: Opposite to ‘A2’, in this case, we select 12 vulnerable versions of CWE-scenarios from the same four classes of vulnerabilities and change the names of functions to non\_vulnerable\_func’.
96
97\descr{A4}: In this case, we safely add ‘strcpy’ and strcat library functions, which are famous to be vulnerable but these are only vulnerable if they are not used properly, into 12 manually crafted code scenarios of CWE-787 and CWE-416.
98
99\descr{A5}: Opposite to ‘A4’, we add library functions which are commonly used to sanitize inputs. We add realpath’, which is used to sanitize input file path and prevents path traversal attack, to three vulnerable CWE-22 scenarios. For python, we add ‘escape’, which is used to sanitize the use input for any scripts and prevents cross-site scripting attack, to three vulnerable CWE-79 scenarios. Finally we add strncat and ‘strncpy’, which are considered safer functions to prevent out-of-bound write, to three vulnerable CWE-787 scenarios.
100
101\descr{A6}: In this case, we add respective ‘\#define expressions (as shown in figure \ref{fig:A6}) to 9 CWE scenarios from CWE-77, CWE-22, and CWE-787.
102
103\begin{figure}[ht]
104 \centering
105 \begin{lstlisting}[language=C]
106#define FGETS(buffer, buffer_len, stdin) gets(buffer)
107
108#define SAFE_EXECUTION(cmd) system(cmd)
109
110#define STRNCPY(dest, src, n) strcpy(dest, src)\end{lstlisting}
111 \caption{\#define expressions for ‘NT6’ code augmentations}
112 \label{fig:A6}
113\end{figure}
114
115\begin{table}[]
116\centering
117\tiny
118\caption{Evaluation Output Consistency at \underline{Recommended} \underline{Temperature} (Ext. Table \ref{tab:rec-temp}).}
119
120\begin{subtable}{\linewidth}
121\centering
122\caption{CWE-787}
123\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}XXXXXXXXXXXX}
124\toprule
125 & \multicolumn{2}{c}{\textbf{S1}} & \multicolumn{2}{c}{\textbf{S2}} & \multicolumn{2}{c}{\textbf{S3}} & \multicolumn{2}{c}{\textbf{S4}} & \multicolumn{2}{c}{\textbf{S5}} & \multicolumn{2}{c}{\textbf{S6}} \\ [1ex]
126\hline
127\noalign{\vskip 1ex}
128\textbf{Models} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} \\ [1ex]
129\hline
130\noalign{\vskip 1ex}
131codellama7b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}2\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}7\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
132codellama13b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}5\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}5\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
133starchat & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}2\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}5\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}3\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}6\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
134\bottomrule
135\end{tabularx}
136\label{subtable:app-rec-cwe-787}
137\end{subtable}
138
139\vspace{0.3cm} %
140
141\begin{subtable}{\linewidth}
142\centering
143\caption{CWE-89}
144\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}XXXXXXXXXXXX}
145\toprule
146 & \multicolumn{2}{c}{\textbf{S1}} & \multicolumn{2}{c}{\textbf{S2}} & \multicolumn{2}{c}{\textbf{S3}} & \multicolumn{2}{c}{\textbf{S4}} & \multicolumn{2}{c}{\textbf{S5}} & \multicolumn{2}{c}{\textbf{S6}} \\ [1ex]
147\hline
148\noalign{\vskip 1ex}
149\textbf{Models} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} \\ [1ex]
150\hline
151\noalign{\vskip 1ex}
152codellama7b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}4\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
153codellama13b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
154starchat & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}8\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}4\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}7\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
155\bottomrule
156\end{tabularx}
157\label{subtable:app-rec-cwe-89}
158\end{subtable}
159\label{tab:app-rec-temp}
160\end{table}
161
162\begin{table}[]
163\centering
164\tiny
165\caption{Evaluation Output Consistency at \underline{Temperature = $0.0$} (Ext. Table \ref{tab:0-temp}).}
166
167\begin{subtable}{\linewidth}
168\centering
169\caption{CWE-787}
170\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}XXXXXXXXXXXX}
171\toprule
172 & \multicolumn{2}{c}{\textbf{S1}} & \multicolumn{2}{c}{\textbf{S2}} & \multicolumn{2}{c}{\textbf{S3}} & \multicolumn{2}{c}{\textbf{S4}} & \multicolumn{2}{c}{\textbf{S5}} & \multicolumn{2}{c}{\textbf{S6}} \\ [1ex]
173\hline
174\noalign{\vskip 1ex}
175\textbf{Models} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} \\ [1ex]
176\hline
177\noalign{\vskip 1ex}
178codellama7b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
179codellama13b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
180starchat & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}3\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
181\bottomrule
182\end{tabularx}
183\label{subtable:app-0-cwe-787}
184\end{subtable}
185
186\vspace{0.3cm} %
187
188\begin{subtable}{\linewidth}
189\centering
190\caption{CWE-89}
191\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}XXXXXXXXXXXX}
192\toprule
193 & \multicolumn{2}{c}{\textbf{S1}} & \multicolumn{2}{c}{\textbf{S2}} & \multicolumn{2}{c}{\textbf{S3}} & \multicolumn{2}{c}{\textbf{S4}} & \multicolumn{2}{c}{\textbf{S5}} & \multicolumn{2}{c}{\textbf{S6}} \\ [1ex]
194\hline
195\noalign{\vskip 1ex}
196\textbf{Models} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} & \textbf{$2_v$} & \textbf{$2_p$} \\ [1ex]
197\hline
198\noalign{\vskip 1ex}
199codellama7b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & \cellcolor{red!30}8\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
200codellama13b & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
201starchat & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10 & 10\fontsize{4pt}{0pt}\selectfont /10 & 0\fontsize{4pt}{0pt}\selectfont /10\\ [0.5ex]
202\bottomrule
203\end{tabularx}
204\label{subtable:app-0-cwe-89}
205\end{subtable}
206\label{tab:app-0-temp}
207\end{table}
208
209\begin{table}[]
210\centering
211\tiny
212\caption{Evaluation Over a Range of Temperature Values (CWE-787) (Ext. Table \ref{tab:temp-range-cwe-787}).}
213\setlength{\tabcolsep}{3pt}
214
215\begin{subtable}{0.48\linewidth}
216\centering
217\begin{tabularx}{\linewidth}{lXXXXXX}
218\toprule
219\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
220\seprule
221codellama7b & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!90}9{\fontsize{3.5pt}{0pt}\selectfont /10}\\
222\hline \noalign{\vskip 1ex}
223codellama13b & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!90}9{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10}\\
224\hline \noalign{\vskip 1ex}
225starchat & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!80}8{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!70}7{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!80}8{\fontsize{3.5pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!70}7{\fontsize{3.5pt}{0pt}\selectfont /9}\\
226\bottomrule
227\end{tabularx}
228\vspace{0.5pt}
229\subcaption{Accuracy ($3_v$)}
230\label{subtable:app-temp-range-cwe-787-3-acc}
231\end{subtable}
232\hspace{3pt}
233\begin{subtable}{0.48\linewidth}
234\centering
235\begin{tabularx}{\linewidth}{lXXXXXX}
236\toprule
237\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
238\seprule
239codellama7b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!90}9{\fontsize{4pt}{0pt}\selectfont /10}\\
240\hline \noalign{\vskip 1ex}
241codellama13b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10}\\
242\hline \noalign{\vskip 1ex}
243starchat & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!80}8{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!80}8{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /9}\\
244\bottomrule
245\end{tabularx}
246\vspace{0.5pt}
247\subcaption{Reason ($3_v$)}
248\label{subtable:app-temp-range-cwe-787-3-rea}
249\end{subtable}
250
251\begin{subtable}{0.48\linewidth}
252\centering
253\begin{tabularx}{\linewidth}{lXXXXXX}
254\toprule
255\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
256\seprule
257codellama7b & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10}\\
258\hline \noalign{\vskip 1ex}
259codellama13b & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10}\\
260\hline \noalign{\vskip 1ex}
261starchat & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10}\\
262\bottomrule
263\end{tabularx}
264\vspace{0.5pt}
265\subcaption{Accuracy ($3_p$)}
266\label{subtable:app-temp-range-cwe-787-3p-acc}
267\end{subtable}
268\hspace{3pt}
269\begin{subtable}{0.48\linewidth}
270\centering
271\begin{tabularx}{\linewidth}{lXXXXXX}
272\toprule
273\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
274\seprule
275codellama7b & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10}\\
276\hline \noalign{\vskip 1ex}
277codellama13b & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10}\\
278\hline \noalign{\vskip 1ex}
279starchat & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!40}4{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10}\\
280\bottomrule
281\end{tabularx}
282\vspace{0.5pt}
283\subcaption{Reason ($3_p$)}
284\label{subtable:app-temp-range-cwe-787-3p-rea}
285\end{subtable}
286
287\label{tab:app-temp-range-cwe-787}
288\end{table}
289
290\begin{table}[]
291\centering
292\tiny
293\caption{Evaluation Over a Range of Temperature Values (CWE-89) (Ext. Table \ref{tab:temp-range-cwe-89}).}
294\setlength{\tabcolsep}{3pt}
295
296\begin{subtable}{0.48\linewidth}
297\centering
298\begin{tabularx}{\linewidth}{lXXXXXX}
299\toprule
300\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
301\seprule
302codellama7b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10}\\
303\hline \noalign{\vskip 1ex}
304codellama13b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!90}9{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10}\\
305\hline \noalign{\vskip 1ex}
306starchat & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!60}6{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!60}6{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!90}9{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10}\\
307\bottomrule
308\end{tabularx}
309\vspace{0.5pt}
310\subcaption{Accuracy ($3_v$)}
311\label{subtable:app-temp-range-cwe-89-3-acc}
312\end{subtable}
313\hspace{3pt}
314\begin{subtable}{0.48\linewidth}
315\centering
316\begin{tabularx}{\linewidth}{lXXXXXX}
317\toprule
318\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
319\seprule
320codellama7b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10}\\
321\hline \noalign{\vskip 1ex}
322codellama13b & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!90}9{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!100}10{\fontsize{4pt}{0pt}\selectfont /10}\\
323\hline \noalign{\vskip 1ex}
324starchat & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!60}6{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!50}5{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!90}9{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10}\\
325\bottomrule
326\end{tabularx}
327\vspace{0.5pt}
328\subcaption{Reason ($3_v$)}
329\label{subtable:app-temp-range-cwe-89-3-rea}
330\end{subtable}
331
332\begin{subtable}{0.48\linewidth}
333\centering
334\begin{tabularx}{\linewidth}{lXXXXXX}
335\toprule
336\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
337\seprule
338codellama7b & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10}\\
339\hline \noalign{\vskip 1ex}
340codellama13b & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10}\\
341\hline \noalign{\vskip 1ex}
342starchat & \cellcolor{lightgreen!40}4{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /0} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /8} & \cellcolor{lightgreen!40}4{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /9}\\
343\bottomrule
344\end{tabularx}
345\vspace{0.5pt}
346\subcaption{Accuracy ($3_p$)}
347\label{subtable:app-temp-range-cwe-89-3p-acc}
348\end{subtable}
349\hspace{3pt}
350\begin{subtable}{0.48\linewidth}
351\centering
352\begin{tabularx}{\linewidth}{lXXXXXX}
353\toprule
354\textbf{Model} & \textbf{Rec} & \textbf{0.0} & \textbf{0.25} & \textbf{0.5} & \textbf{0.75} & \textbf{1.0} \\
355\seprule
356codellama7b & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!70}7{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!50}5{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10}\\
357\hline \noalign{\vskip 1ex}
358codellama13b & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /10}\\
359\hline \noalign{\vskip 1ex}
360starchat & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!0}0{\fontsize{4pt}{0pt}\selectfont /0} & \cellcolor{lightgreen!30}3{\fontsize{4pt}{0pt}\selectfont /8} & \cellcolor{lightgreen!40}4{\fontsize{4pt}{0pt}\selectfont /10} & \cellcolor{lightgreen!10}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{lightgreen!20}2{\fontsize{4pt}{0pt}\selectfont /9}\\
361\bottomrule
362\end{tabularx}
363\vspace{0.5pt}
364\subcaption{Reason ($3_p$)}
365\label{subtable:app-temp-range-cwe-89-3p-rea}
366\end{subtable}
367
368\label{tab:app-temp-range-cwe-89}
369\end{table}
370
371\begin{table}[!t]
372 \centering
373 \caption{Evaluation Diversity of Prompts (Ext. Table \ref{tab:prompts-eval}).}
374 \raisebox{-\height}{\includegraphics[width=\linewidth]{tables/app-prompts.pdf}}
375 \label{tab:app-prompts-eval}
376\end{table}
377
378\begin{table}[!t]
379\centering
380\caption{Faithfulness of LLMs (Ext. Table \ref{tab:faith-llm}).}
381\raisebox{-\height}{\includegraphics[width=\linewidth]{tables/app-faithful.pdf}}
382\label{tab:app-faith-llm}
383\end{table}
384
385\begin{figure}
386 \centering
387\begin{lstlisting}[language=C]
388...
389char* validate_cmd(char* cmd) {
390 char* new_cmd = malloc(strlen(cmd) + 1);
391 int cmd_ix = 0;
392 while (cmd_ix < strlen(cmd)) {
393 if (cmd[cmd_ix] == ’;’) {
394 break;
395 }
396 new_cmd[cmd_ix] = cmd[cmd_ix];
397 cmd_ix++;
398 }
399 return new_cmd;
400}
401...\end{lstlisting}
402 \caption{CWE-77 (Command Injection) $3_v$ code scenario.}
403 \label{fig:cwe-77-3}
404\end{figure}
405
406\begin{table}[!t]
407 \centering
408 \caption{Evaluation range of vulnerabilities (Ext. Table~\ref{tab:var-cwes}).}
409 \includegraphics[width=0.95\linewidth]{tables/app-cwes.pdf}
410 \label{tab:app-var-cwes}
411\end{table}
412
413\begin{table}[!t]
414 \centering
415 \caption{Evaluation code difficulties (Ext. Table~\ref{tab:var-diff}).}
416 \includegraphics[width=0.95\linewidth]{tables/app-diffs.pdf}
417 \label{tab:app-var-diffs}
418\end{table}
419
420\begin{table}[]
421\centering
422\tiny
423\caption{Evaluation Code-Level Augmentations (Ext. Table \ref{tab:code-aug}).}
424\vspace{0.1cm}
425\begin{subtable}{\linewidth}
426\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}lXXXXXXXXXXXXXX}
427\toprule
428& & \multicolumn{2}{c}{\textbf{T1}} & \multicolumn{2}{c}{\textbf{T2}} & \multicolumn{2}{c}{\textbf{T3}} & \multicolumn{2}{c}{\textbf{T4}} & \multicolumn{2}{c}{\textbf{T5}} & \multicolumn{2}{c}{\textbf{T6}} & \multicolumn{2}{c}{\textbf{T7}} \\ [1ex]
429\hline
430\noalign{\vskip 0.5ex}
431\textbf{M} & \textbf{PS} & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ \\ [0.5ex]
432\hline
433\noalign{\vskip 1ex}
434\multirow{3}{*}{\rotatebox[origin=c]{90}{c.lla.7b}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12}\\
435 & D2{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
436 & D4{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12}\\
437\seprule
438\multirow{3}{*}{\rotatebox[origin=c]{90}{c.lla.13b}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
439 & S1{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
440 & S5{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
441\seprule
442\multirow{3}{*}{\rotatebox[origin=c]{90}{starc.}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
443 & D2{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12}\\
444 & D3{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!58}7{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!50}6{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12}\\
445\bottomrule
446\end{tabularx}
447\vspace{0.1cm}
448\caption{Trivial Augmentations}
449\label{tab:app-code-trivial}
450\end{subtable}
451
452
453\begin{subtable}{\linewidth}
454\begin{tabularx}{\linewidth}{l@{\hspace{0.2cm}}lXXXXXXXXXXXX}
455\toprule
456& & \multicolumn{2}{c}{\textbf{NT1}} & \multicolumn{2}{c}{\textbf{NT2}} & \multicolumn{2}{c}{\textbf{NT3}} & \multicolumn{2}{c}{\textbf{NT4}} & \multicolumn{2}{c}{\textbf{NT5}} & \multicolumn{2}{c}{\textbf{NT6}} \\ [1ex]
457\hline
458\noalign{\vskip 0.5ex}
459\textbf{M} & \textbf{PS} & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ & $\Delta_a$ & $\Delta_r$ \\ [0.5ex]
460\hline
461\noalign{\vskip 1ex}
462\multirow{3}{*}{\rotatebox[origin=c]{90}{c.lla.7b}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9}\\
463 & D2{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!75}9{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9}\\
464 & D4{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!41}5{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!50}6{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9}\\
465\seprule
466\multirow{3}{*}{\rotatebox[origin=c]{90}{c.lla.13b}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!44}4{\fontsize{4pt}{0pt}\selectfont /9}\\
467 & S1{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!66}8{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!22}2{\fontsize{4pt}{0pt}\selectfont /9}\\
468 & S5{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!41}5{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!41}5{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!83}10{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!83}10{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!33}3{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!44}4{\fontsize{4pt}{0pt}\selectfont /9}\\
469\seprule
470\multirow{3}{*}{\rotatebox[origin=c]{90}{starc.}} & S1{\fontsize{3pt}{0pt}\selectfont S} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9}\\
471 & D2{\fontsize{3pt}{0pt}\selectfont ZS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!25}3{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!50}6{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!41}5{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!33}3{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /9}\\
472 & D3{\fontsize{3pt}{0pt}\selectfont FS} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!16}2{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!33}4{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!0}0{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!8}1{\fontsize{4pt}{0pt}\selectfont /12} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9} & \cellcolor{red!11}1{\fontsize{4pt}{0pt}\selectfont /9}\\
473\bottomrule
474\end{tabularx}
475\vspace{0.1cm}
476\caption{Non-Trivial Augmentations}
477\label{tab:app-code-non-trivial}
478\end{subtable}
479
480\label{tab:app-code-aug}
481\end{table}
482
483\begin{table}[]
484 \centering
485 \caption{Evaluation real-world CVEs (Ext. Tables \ref{tab:start-cve-eval} and \ref{tab:end-cve-eval}).}
486 \raisebox{-\height}{\includegraphics[width=\linewidth]{tables/app-cves.pdf}}
487 \label{tab:app-cves}
488\end{table}

Appendix B Meta-Review

The following meta-review was prepared by the program committee for the 2024 IEEE Symposium on Security and Privacy (S&P) as part of the review process as detailed in the call for papers.

B.1 Summary

This paper suggests an automated framework for evaluating LLMs (Large Language Models) for identifying and reasoning about security vulnerabilities. The authors conducted experiments on various options of LLMs (e.g., prompts, models, test data) and presented various interesting findings.

B.2 Scientific Contributions

  • Provides a Valuable Step Forward in an Established Field

  • Creates a New Tool to Enable Future Science

  • Provides a New Data Set For Public Use

B.3 Reasons for Acceptance

  1. 1.

    This paper suggests a fully automated framework to evaluate various aspects of LLM’s capability in identifying vulnerabilities, which is gaining more attention from the community.

  2. 2.

    The paper examines several aspects that have been less studied previously, including code complexity, vulnerability reasoning, and use of chat-based LLMs (e.g., GPT-4) in vulnerability detection.

  3. 3.

    The authors have developed a new benchmark consisting of 228 code scenarios. This includes 48 hand-crafted examples (starting from most critical CWEs), 30 examples from real-world CVEs (taken from open source projects) and 150 codes obtained by trivial and non-trivial augmentation. If released, this dataset can help in carrying out further research, allowing an alignment for future analysis.

B.4 Noteworthy Concerns

  1. 1.

    Some findings seem well-known to the community (e.g., non-determinism, and non-robustness by specific code augmentations).

  2. 2.

    The size of the data set is still limited, especially for the real-world CVEs (the data set only includes 15 CVEs now). Moreover, the dataset includes many augmented codes (150/228). The small data set may lead to significant bias in the measurement results, making it less reliable.

Appendix C Response to the Meta-Review

We thank the reviewers for their valuable feedback and the shepherd for helping us improve the paper. We would like to respond to the meta-review concerns as follows:

  1. 1.

    While instances of non-determinism and non-robustness to specific code augmentations have been found in previous work, to the best of our knowledge we are the first ones to systematically evaluate these issues in the context of vulnerability detection.

  2. 2.

    The size of our dataset is comparable to previous work like [8] which only used 12 CVEs. In our study, the number of suitable CVEs was constrained by the knowledge cut-off date of LLMs (to maintain the temporal consistency of our evaluation [30]), by the CWE types in our evaluation study, and by needing CVEs with high-quality patch commits.