LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?):
A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah Boston University
saadu@bu.edu Mingji Han Boston University
mjhan@bu.edu Saurabh Pujar IBM Research
saurabh.pujar@
ibm.com Hammond Pearce UNSW Sydney
hammond.pearce@
unsw.edu.au Ayse Coskun Boston University
acoskun@bu.edu Gianluca Stringhini Boston University
gian@bu.edu

Abstract

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like ‘PaLM2’ and ‘GPT-4’: by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in $26\%$ and $17\%$ of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

1 Introduction

Large Language Models (LLMs), such as OpenAI’s Codex [1], Google’s PaLM2 [2], Meta’s Codellama [3], and StarCoder [4], etc., have demonstrated great potential in performing programming-language related tasks such as code generation, code documentation , and debugging. In 2022, around 1.2 million developers used Copilot, and since then we have witnessed the release of increasingly capable LLM models at a quick pace [2, 5, 4, 3]. LLMs could be particularly useful to help developers with their cybersecurity needs, as humans typically produce and miss many security relevant bugs. This issue was highlighted in the 2022 GitLab Survey [6], noting that “developers do not find enough bugs early enough” and “do not prioritize the bug remediation” when developing. It is pertinent then to investigate if LLMs could be an aid towards early identification of security problems, especially as LLMs have already been suggested for use in automated bug repair [7].

In this paper, we aim to answer the following question: Can LLMs be used as helpful security assistants for vulnerability detection? This is an important question, especially as LLMs are not infallible in security-related tasks, for example introducing vulnerabilities into source code [8, 9] and software testing [10]. Unfortunately, there is no standardized and automated approach to evaluate the performance of LLMs at identifying vulnerable code. We fill this gap by introducing SecLLMHolmes, a generalized, fully automated, and scalable framework to systematically evaluate the performance (i.e., accuracy and reasoning capabilities) of LLMs for vulnerability detection. Our framework tests the capabilities of a given LLM as a security assistant across eight distinct dimensions: (1) deterministic response, (2) performance over range of parameters, (3) diversity of prompts, (4) faithful reasoning, (5) evaluation over variety of vulnerabilities, (6) assessment of various code difficulty levels, (7) robustness to code augmentations, and (8) use in real-world projects.

We apply our framework to eight of the most capable LLMs across 228 code scenarios spanning over 8 most critical vulnerabilities in C and Python, and show that: (a) LLM performance varies widely depending on the model and the prompting technique used, however all models analyzed have a high false positive rate (FPR), and flag code where vulnerabilities have been patched as still vulnerable. (b) the output of LLMs is non-deterministic, with all models changing their answers over multiple runs for one or more of our tests. (c) even when they correctly identify a vulnerability, the reasoning that LLMs provide for this decision is often incorrect, questioning their trustworthiness. (d) LLM chain-of-thought reasoning [11] is not robust, and can be ‘confused’ by even simple code augmentations such as whitespace modification, changing function names, or using different but related library functions. Also, (e) LLMs fail at detecting vulnerabilities in real-world projects. Our study provides significant evidence that LLMs are not yet ready to be used for automated vulnerability detection, and the successful usage of our framework as a benchmark suite by future models would demonstrate meaningful progress in this space.

This paper makes the following contributions:

•

We develop SecLLMHolmes, a comprehensive framework to test LLMs for their ability to identify and reason about software vulnerabilities. Our framework is fully automated and includes a set of 228 code scenarios, and 17 prompting techniques. We publicly release our framework and dataset¹¹1https://github.com/ai4cloudops/SecLLMHolmes, allowing the community to test newly developed LLMs and easily keep track of their progress in being able to identify vulnerabilities.
•

We use our framework to test eight state-of-the-art LLMs for the task of vulnerability detection, showing that as of today no LLM achieves satisfacory performance at it.
•

We identify and enumerate a set of shortcomings that current LLMs show (as outlined above). Our observations provide a checklist for researchers working in this space, showing aspects that need to be addressed before LLMs can be considered ready to be used in the wild for the task of vulnerability detection.

2 Background and Related Work

Large Language Models (LLMs). All language models work on the basic principle of next word (token) prediction; i.e., given a sequence of words (tokens) $x_{1},x_{2},...,x_{n-1}$ select a word (token) $x_{n}$ with the highest probability to appear next in the sequence

x_{n}=\arg\max_{w\in V}P(w|x_{1},x_{2},\dots,x_{n-1}),

where $V$ is the vocabulary of the model. Language models learn to perform this task by training on a large amount of text data (i.e., natural language text or code) and use various techniques (e.g., attention mechanism [12]) to learn to focus on certain parts of the input for better output prediction. Language models have shown excellent proficiency in NLP tasks, as well as good results for programming language tasks such as code generation, code suggestion, natural language querying for code, etc.

Refer to caption — Figure 1: LLM chat input format. LLMs operate on a three-part input format: (1) a system prompt, (2) few-shot examples presented as chat history to guide the model’s learning, and (3) the specific user input/task to be processed.

The recent drastic increase in the number of parameters of models has enabled several remarkable capabilities, the most prominent of which being zero-shot and few-shot learning [13, 14]. LLMs are typically prompted (i.e., queried) by the user and provide a response—these advances enable the prompt to provide new knowledge or instructions that the model was not trained over.

Approaches like instruction-tuning ‘teach’ LLMs how to follow instructions in their prompt responses, and reinforcement learning from human feedback is used to ‘teach’ them how to answer, converse, and reason like humans. This has led to the creation of several chat-based LLMs, which can interact conversationally with human inputs (see Figure 1). The chat-based LLMs can be prompted using various techniques:

•

In Zero-Shot (ZS) scenario, the user asks a model to perform a task that the model might not have observed during pre-training.
•

In Few-Shot (FS) scenario, the user adds a few examples demonstrating input space and expected output to perform a specific task (as shown in Figure 1).
•

Task-Oriented (TO) scenarios explicitly assign a task to the model (either in ‘system’ or ‘user’ prompt) in the form of a statement or a question, which encourages the model to generate a task-specific response.
•

Role-Oriented (RO) scenarios assign a role to the model, e.g., helpful assistant, security expert, etc., and the model implicitly understands the expected behavior. This role is mostly assigned in the ‘system’ prompt.

Table I shows the details of the currently most capable chat-based LLMs that we investigate in this paper.

TABLE I: Studied LLMs. We select a number of capable chat-based LLMs, both Remote and Local, for our study, with diverse ranges of number of parameters, max. input tokens limits, and different training knowledge cut-offs. Oct. 7, 2023 is the date of access for all LLMs.

Model API	Base Model	# Params	Max. Tokens	Type	Knowledge Cut Off
gpt-4	GPT-4	1.76T	8,192	Rem	09/2021
gpt-3.5-turbo-16k	GPT-3.5	175B	16,385	Rem	09/2021
codechat-bison@001	PaLM2	340B	6,144	Rem	mid-2021
chat-bison@001	PaLM2	340B	6,144	Rem	mid-2021
codellama-7b-instruct	Llama2	7B	100k	Loc	01/2022
codellama-13b-instruct	Llama2	13B	100k	Loc	01/2022
codellama-34b-instruct	Llama2	34B	100k	Loc	01/2022
starchat-beta	StarCoder+	15.5B	8,192	Loc	09/2022

Vulnerability Detection. One of the biggest challenges during the development and maintenance of software systems is the process of detecting security bugs and identifying their root cause [6]. OWASP [15] presents a list of static and run-time analysis tools and techniques that work in this space. Most of these tools transform source code into a specific representation, e.g., abstract syntax tree, program dependency graph, code property graph, etc., and scan them to identify pre-defined insecure patterns. For instance, Yamaguchi et al. introduce Joern, a tool that uses code property graphs [16] to identify several vulnerability patterns in C/C++ code.

Unfortunately, these techniques require considerable manual effort and curation of security bug datasets, especially if they are based on trained supervised Machine Learning (ML) models. Examples of methods that use supervised ML include VulChecker [17], VulDeePecker [18], Poster [19], and SySeVR [20]. As the ratio of vulnerable to non-vulnerable examples is very low in the real world, the vulnerable examples in these datasets are not sufficient for ML models to learn the necessary information. Suggested improvements in this category include using pre-trained LLMs for code and ‘fine-tuning’ them on security bug datasets for the downstream task of vulnerability detection. Notable approaches here are UniXCoder [21], VulBERTa [22], and CoText [23].

Evaluating for Code Security. Evaluating approaches for identifying security weaknesses in code requires testing their performance on multiple axes. For instance, what are their accuracy and false positive rate (FPR)? Are reasons/root causes for a vulnerability provided by the tool, and if so, at what quality? Is the tool robust to noise in testing data? How many types of vulnerabilities can it detect? Static analysis tools have always struggled with the trade-off between accuracy and coverage. Tools either focus on high accuracy and low coverage (e.g., Pysa [24] by Meta, which identifies data-flow issues in Python applications), or (b) low accuracy (false positives) and high coverage (e.g., [25, 26, 27]), which can lead to developer frustration and wasted time.

ML-based static analysis tools [28, 18] not only face the accuracy-coverage trade-off but also struggle with robustness. Such tools have demonstrated good performance on research datasets but can fail to generalize and perform well over real-world datasets [29]. Risse et al. [30] identify the same issue of non-robustness in LLM-based vulnerability detection tools [23, 22, 31], and showcase the drop in accuracy with even trivial code augmentations. Moreover, Microsoft’s leader-board for LLM-based vulnerability detection tools [32] shows that the even the best tool has an accuracy of less than $70\%$ and $30\%+$ FPR, which shows that these systems cannot be trusted in real-world cases.

TABLE II: Evaluation studies for code security. This table describes if the evaluation is ( semi or fully) Automated, Evaluates Reasoning (root cause) or only performs binary evaluation , tests Code-level Robustness, evaluates both or just one of the Vulnerable and Patched code scenarios, evaluates on Real-World CVEs or github mined potential defect commits [33] , and # of Code Scenarios included in the study, which are not generated by synthetic AI methods or labeled using out-dated research tools.

Study	Auto	Eval Reas.	Code Robust	Vuln- Patch	Real World	# Scen.
Vuln. Code Gen. [8]						89
Vuln. Code Repair [7]						syn.
Limits of ML [30]						res.
Transf. Vuln. Det. [34]						res.
SecLLMHolmes						228

Recent works have evaluated LLMs for vulnerable code generation [8], code repair [7], and vulnerability detection [34]. However, these approaches are either limited by the number of LLMs, coverage of vulnerabilities, diversity of prompts, range of code complexities, robustness testing, or require manual labor for evaluating LLMs. Most importantly, these studies only evaluate binary responses, checking whether the LLM gives the right label to code snippets (e.g., ‘vulnerable’ or ‘not vulnerable’). In this paper, we present the first comprehensive evaluation framework to evaluate LLMs on the task of vulnerability detection, providing a multi-faceted analysis of the capabilities of LLMs, including going beyond binary decisions and evaluating their reasoning abilities. Table II summarizes evaluation studies of LLMs, highlights their short-comings, and compares against our framework.

In addition to presenting a new state-of-the-art framework, our study uncovers previously unidentified challenges associated with LLMs, including their inability to understand complex code data-flows, the fragility of their COT or step-by-step reasoning to even trivial augmentations, a bias towards security-related functions and variable names that overlooks actual vulnerabilities, a reasoning process that diverges from the methodology of real-world human security experts, particularly in accurately identifying correct root cause of a vulnerability.

3 SecLLMHolmes

Figure 2 presents an overview of our fully automated framework, SecLLMHolmes, which is designed to be applicable to any chat-based LLM. Section 3.1 describes how users can configure an LLM for integration. The core of our framework consists of five pre-defined key components: (i) a set of parameters (Section 3.2), (ii) an extensive set of prompt templates (Section 3.3), (iii) datasets (Section 3.4), and (iv) code augmentations (Section V), all of which facilitate the generation of test prompts. Each prompt is then passed to the LLM to generate a response, and the quality of the response is then analyzed by the (v) ‘Evaluator’ module (Section 3.6).

3.1 LLM configuration

To scale SecLLMHolmes for any chat-based LLM, we have established three configurable user inputs:

LLM-Specific Best ‘Prompting Practices’ and Rules. As each LLM may be tuned differently with respect to instruction following, this configurable input enables users to configure optimal prompting practices for the specific LLM they are integrating. This may be done following the LLM’s documentation. For example, OpenAI’s GPT documentation recommends the use of three quotes before and after the given content to separate it from the instruction [35], while Google’s PaLM2 documentation recommends the use of keywords before the content that describe its semantics such as ‘Code,’ ‘Text,’ ‘Question’ [36].

LLM ‘Initialization and Configuration.’ All LLMs require an initial setup; for example, remote models may require an API key (e.g. OpenAI) or to initialize a session with a specific project in the cloud (e.g. Google). Local models require loading the model, tokenizer checkpoints, etc. This input allows user to specify such configuration details.

LLM ‘Chat Structure and Inference Function.’ To generate a response, each chat-based LLM receives three inputs: ‘system,’ ‘few-shot examples,’ and ‘task’ prompts (as shown in Figure 1). In this function, users specify how these three inputs are passed to the configured LLM (via API for remote models or ‘text-generation’ pipeline for local models).

3.2 LLM Parameters

LLM responses are significantly impacted by two parameters which control token sampling: (1) temperature controls the determinism of an LLM’s output. A higher value of temperature generates more ‘creative’/‘random’ text by adding noise to potential token scores (2) top_p controls the nucleus sampling, where the LLM considers the results of the tokens with top_p probability mass. We discuss how we select the values of these parameters in Section 4.

3.3 Prompt Templates

TABLE III: Prompt templates.

ID	Type	Description
S1	ZS-TO	Code snippet is added to the input prompt with a question about a specific Common Weakness Enumeration (CWE) (e.g., out-of-bound write, path traversal).
S2	ZS-RO	Same as S1, but the LLM is assigned the role of a ‘helpful assistant’.
S3	ZS-RO	Similar to S1, with the LLM acting as a ‘security expert’.
S4	ZS-RO	The LLM is defined as a ‘security expert’ who analyzes a specified security vulnerability, without the question being added to the input prompt.
S5	FS-TO	Similar to S1, but includes a vulnerable example, its patch, and standard reasoning from the same CWE.
S6	FS-RO	Like S4, but also includes a vulnerable example, its patch, and standard reasoning from the same CWE.
R1	ZS-TO	Similar to S1, but begins with ”Lets think step by step” [37] to encourage a methodical approach.
R2	ZS-RO	The LLM plays the role of a security expert with a multi-step approach to vulnerability detection, following a chain-of-thought reasoning.
R3	ZS-TO	A multi-round conversation with the LLM, starting with a code snippet and progressively analyzing sub-components for a security vulnerability like human security-experts.
R4	FS-RO	Similar to S6, but the reasoning for answers involves step-by-step analysis developed by the first author.
R5	FS-RO	Like R2, but includes few-shot examples (from the same CWE) with step-by-step reasoning for detecting vulnerabilities.
R6	FS-TO	Similar to R5, but does not assign a specific role to the LLM in the system prompt.
D1	ZS-TO	Adds the definition of a security vulnerability to the input prompt, followed by a related question.
D2	ZS-RO	The LLM is a security expert analyzing code for a specific vulnerability, with the vulnerability’s definition included.
D3	FS-RO	Similar to S6, but includes the definition of the security vulnerability in the system prompt.
D4	FS-RO	Like R4, with the addition of the security vulnerability’s definition in the system prompt.
D5	FS-TO	Similar to D4, but does not assign a specific role to the LLM in the system prompt.

SecLLMHolmes explores four techniques for LLM prompting: (1) zero-shot task-oriented (ZS - TO), (2) zero-shot role-oriented (ZS - RO), (3) few-shot task-oriented (FS - TO), and (4) few-shot role-oriented (FS - RO). Moreover, we divide our set of prompts in the following three categories (all prompts are described in Table III):

Standard (S) prompting asks the model to directly give an answer to the problem.

Step-by-Step Reasoning-based (R) prompting asks the model to solve the problem in a step-by-step manner using chain-of-thought (COT) reasoning [11, 38, 37]. In addition to evaluating the intrinsic step-by-step reasoning process of LLMs, we also create prompts that emulate the multi-step vulnerability detection method followed by human security-experts, as identified in prior qualitative studies [39, 40]. These studies observed that security experts follow a general multi-step vulnerability detection approach, i.e., (1) get an overview of the code, (2) based on the overview, identify the critical sub-components that can lead to a security vulnerability in code, e.g., copying user provided information into a buffer, etc., (3) perform a detailed analysis of these sub-components, e.g., if a user input is being copied into a buffer the security experts will check if in any scenario the user input can overflow the buffer, and then, (4) based on the detailed experiments, provide the final answer on whether the given code contains any instances of the given security vulnerability.

Definition-based (D) prompting provides additional information like the definition of a security vulnerability from MITRE’s official website [41] to the model, while asking the model to detect that vulnerability in the given code.

3.4 Datasets

We design 228 code scenarios (48 hand-crafted, 30 real-world, and 150 with code augmentations) to test various aspects of the capabilities of LLMs to detect software vulnerabilities in code. We use these scenarios to craft prompts by including code, examples, definitions, and step-by-step reasoning as shown in Table III. In the following, we describe how we built our code scenarios in detail.

Hand-Crafted CWE Scenarios. We curate a dataset of 48 hand-crafted code scenarios, containing vulnerable and patched pairs from 8 most critical and diverse Common Weakness Enumerations (CWEs) from the MITRE Top 25 Most Dangerous Software Weaknesses for the year 2023 [41], as shown in Table IV. To investigate the ability of LLMs to analyze multiple programming languages, we include examples from both C and Python.

TABLE IV: Hand-crafted dataset.

CWE ID	Description	MITRE Rank	Lang.
787	Out-of-bounds Write	1	C
79	Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)	2	Py
89	Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)	3	Py
416	Use After Free	4	C
22	Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)	8	C
476	NULL Pointer Dereference	12	C
190	Integer Overflow or Wraparound	14	C
77	Improper Neutralization of Special Elements used in a Command (‘Command Injection’)	16	C

Similar to previous work [8], we create six code scenarios (three pairs of vulnerable and patched scenarios) for each CWE. Moreover, we design our code scenarios with three difficulty levels, (1) easy, (2) medium, (3) hard. Code scenario ‘ $2_{v}$ ’ in a specific CWE represents the vulnerable scenario with ‘medium’ difficulty level and ‘ $2_{p}$ ’ its patch. The difficulty levels assess how LLMs interact with code of increasing complexity. Easy scenarios consist of simple programs containing only one function and less than 30 lines of code. Medium level scenarios increase the complexity by making the program longer, using different library functions, and adding more than one user input. Hard level introduces scenarios with multiple functions in which functions can be safe on an individual level but when they work together they make the program vulnerable.²²2Appendix A shows examples of all difficulty levels for ‘CWE-22’.

Real-World CVE Senarios. We leverage a set of real-world Common Vulnerabilities and Exposures (CVEs) from public open source projects to investigate if LLMs are able to identify vulnerabilities in them. Note that existing benchmarks for vulnerability detection [33, 18, 42, 43], cannot be used for this project, as they were released before the cut-off training date of current LLMs, and it is therefore likely that the models saw that data during training. To avoid this potential confounder, we curate 30 code scenarios containing vulnerable and patched versions of 15 CVEs from four open source projects, all published and fixed in 2023, after current LLMs were trained (see Table V).

TABLE V: Real-world CVEs and their details including Original and Truncated Lines of Code (LoC).

Project	CVE ID	CWE Description (ID)	Orig. LoC	Trun. LoC	Pub. Date (2023)	Fix Date (2023)
gpac	2023-1452	Out-of-Bound Write (787)	4.5k	243	Mar	May
	2023-3012	NULL Pointer Deref (476)	2.5k	398	May	Nov
	2023-23143	Out-of-Bound Write (787)	12.3k	117	Jan	May
	2023-23144	Integer Overflow (190)	439	389	Jan	May
libtiff	2023-2908	NULL Pointer Deref (476)	2.3k	629	Jan	Nov
	2023-3316	NULL Pointer Deref (476)	159	159	Jan	Nov
	2023-26966	Out-of-Bound Write (787)	1.8k	238	Jun	Nov
	2023-40745	Integer Overflow (190)	2.2k	757	Oct	Nov
	2023-41175	Integer Overflow (190)	779	748	Oct	Nov
linux	2023-40283	Use-After-Free (416)	1.9k	515	Aug	Nov
	2023-42753	Integer Overflow (190)	628	623	Sept	Nov
	2023-42754	NULL Pointer Deref (476)	3.7k	177	Oct	Nov
	2023-45863	Out-of-Bound Write (787)	1.1k	565	Oct	Nov
	2023-45871	Out-of-Bound Write (787)	10.1k	386	Oct	Nov
pjsip	2023-27585	Out-of-Bound Write (787)	784	737	Mar	Aug

As the length of the code increases significantly for the real-world code scenarios, it can exceed the maximum number of tokens a model can take as an input. To solve this problem, we shorten code files by removing comments and functions that are neither called by nor call the vulnerable (or patched) function. Also to maintain fairness among the LLMs, we make sure that all CVEs after truncation have a number of tokens less than or equal to 6,144, which is the maximum token limit supported by Palm2, and the lowest among all LLMs in our study.

Code Augmentations. While standard frameworks exist to evaluate the robustness of LLMs for NLP tasks [44, 45], there is no standard framework to evaluate the robustness on code security related tasks. To fill this gap, we design a set of 150 augmented code scenarios, meticulously crafted and reviewed to preserve the ability of human security experts to identify vulnerabilities. These augmentations are organized in two distinct categories:

1.

Trivial Augmentations. This class of code augmentations measure the robustness of LLMs to random noise. We select 12 CWE scenarios from two classes, i.e., CWE-787 (C) (#1 MITRE) and CWE-89 (Py) (#3 MITRE) of our hand-crafted dataset, and apply seven trivial augmentations (Table VI) on them and create a total of 84 different code scenarios (12 per augmentation). We choose these two CWEs as (1) they can lead to the most catastrophic impacts like root privilege escalation, and data loss, etc, (2) they belong to two completely different levels of abstractions i.e., “lower-level” (C) and “higher-level” (Python) languages, and (3) the presence of their instances can be determined directly from code without any need for external information.
2.

Non-Trivial Augmentations. We design non-trivial code augmentations to perform stress-tests on LLMs to measure their robustness and bias towards semantics of function or variable names, specific library functions, or code security practices. We use combinations of all CWEs defined in Table IV and the six non-trivial code augmentations from Table VI and design 66 code scenarios (12 per NT1-NT4 and 9 per NT5 and NT6) ³³3See Appendix LABEL:sec:app-robust for details on creation of non-trivial augmented code scenarios.

TABLE VI: Code Augmentations.

Trivial		Non-Trivial
ID	Description	ID	Description
T1	Rename function parameters randomly	NT1	Change variable names to vulnerability related keywords
T2	Rename function randomly	NT2	Change the name of a safe function to ‘vulnerable’ function
T3	Add random unreachable code	NT3	Change the name of an unsafe function to ‘non_vulnerable’ function
T4	Add random code in comments	NT4	Add a potentially dangerous library function (e.g., ‘strcpy’ or ‘strcat’) but use it in a safe way
T5	Insert whitespaces	NT5	Use sanitizing functions (e.g., ‘realpath’) in vulnerable code but in a way that it does not resolve the vulnerability
T6	Add a useless function	NT6	Add hash-defined expressions for safe functions names (e.g., ‘fgets’) but add vulnerable library functions in its body (e.g., ‘gets’)
T7	Add next-line character

3.5 Ground-Truth Reasoning $G_{r}$

In addition to ground truth labels indicating if a code snippet contains a vulnerability, we also need explanations for these vulnerabilities, as we aim to evaluate the reasoning capabilities of LLMs and assess whether they can justify their decisions. To this end, we randomly sample 48 code scenarios out of the 228 total scenarios, and have three security experts, including the first author of the paper equally divide the sampled code scenarios amongst each other and create a 100-word ground truth reason for each scenario using MITRE’s official CWE documentation [41] as a guide. The experts then compare and discuss each others’ reasoning and develop consensus for each ground-truth reasoning (Fleiss’ kappa with $K=0.93$ , meaning almost perfect agreement [46]). After establishing that the criteria for ground truth reasoning are well laid out and understood, the first author proceeds to develop the remaining 180 vulnerability explanations. This ground truth reasoning $G_{r}$ is then used by the ‘Evaluator’ module in the next step.

3.6 Evaluator

The output of an LLM for a specific test is passed to the Evaluator (see Figure 2). As SecLLMHolmes is fully automated, we leverage GPT-4 to analyze the response. First, the response is passed to GPT-4 ⁴⁴4Our manual evaluation shows that GPT-4 performs the best for this extraction process., with an additional role-based instruction prompt $P_{e}$ (shown in Figure 7 in the Appendix) in the ‘system’ input to extract two pieces of information from the raw response. The first one is the binary answer, which is “yes/no” based on whether the LLM found a vulnerability in the given code or not. We find that in some cases the LLM does not provide a definite answer, therefore we include a third verdict, i.e., “n/a.”⁵⁵5For example, in some cases LLMs provide some variation of the following response: “As an AI model I cannot answer this question.” The second part of the information extracted is the textual reasoning provided by the LLM ( $P_{r}$ ). To extract it, we ask GPT-4 to summarize the reason described by the model output on why a vulnerability is present in the code or not, in 100 words. We extract the summary of root cause of vulnerability from raw responses of LLMs to maintain consistency for further evaluation methods, and to avoid contents like suggestions, fixed code, etc. to be analyzed. In the rest of this section we provide more details on the evaluation metrics used to evaluate the final answer, and its reasoning provided by LLM for the vulnerability detection task.

Accuracy Score. To evaluate the accurate detection of a vulnerability in source code, we compare the answer extracted from the response of the LLM (i.e., “yes/no/n/a”) with the ground truth labels and use the “accuracy” metric, i.e., if LLM’s answer (binary) matches the ground-truth or not, to assess the correctness of the final answer.

Reasoning Score. Automatically evaluating if the textual reasoning provided by the LLM on whether a vulnerability is present is a challenging task. To solve this task we use GPT-4 to summarize the reason ( $P_{r}$ ) provided by an LLM and compare it with the ground-truth reasoning ( $G_{r}$ ) generated by the authors, using the following three metrics:

1.

Rouge [47] score is traditionally used in NLP to measure the similarity between a machine-translated summary and reference summaries using overlapping n-grams. In our case, we use it to measure similarity between the summaries $P_{r}$ and $G_{r}$ . We first sample $50$ pairs of $P_{r}$ and their corresponding $G_{r}$ , as our reasoning score validation set ( $R_{val}$ ), and manually check the consistency and alignment of their reasonings. We find that at the optimal threshold $Rouge_{thres}$ of $0.34$ , $43$ out of $50$ $P_{r}$ are consistent with $G_{r}$ . We therefore use this threshold and mark two summaries as similar if their Rouge score exceeds it.
2.

Cosine Similarity is a metric commonly used in NLP to measure how similar two documents are irrespective of their sizes. We first convert the summaries $P_{r}$ and $G_{r}$ into fixed length vectors using OpenAI’s embedding model ‘text-similarity-davinci-001’ and calculate the cosine similarity between them. Similar to $Rouge_{thres}$ , we find that the optimal threshold $Cos_{thres}$ is $0.84$ , and consider two summaries similar if their cosine similarity exceeds this value.
3.

GPT-4 is prompted to evaluate if the reasoning in $P_{r}$ and $G_{r}$ align. If the reasonings are similar and they align with each other, GPT-4 responds ‘yes’ and we assign a reasoning score of $1$ , otherwise we assign $0$ . We find that GPT-4 successfully classifies $48$ out of $50$ $P_{r}$ correctly to their corresponding $G_{r}$ , when validated on $R_{val}$ .

We then determine whether the reasoning by the LLM is correct or not by majority vote. That is, if two or more of the above criteria match, we consider the reasoning as similar to the ground truth reasoning $G_{r}$ .

4 Experimental Investigation

In this section, we use our framework to investigate all LLMs listed in Table I.⁶⁶6We only report the results for the five best performing LLMs in the main body of the paper. The ones for the remaining three LLMs can be found in the Appendix. We first investigate which values of the LLM parameters are most likely to produce consistent (Section 4.1) and best performing (Section 4.2) output. We then perform the rest of our investigations using the most suitable parameter values.

4.1 Evaluation for Deterministic Responses

To perform a rigorous comparison between LLMs and assess their capabilities, it is of critical importance that their responses are consistent, meaning that running the same test multiple times under identical parameters should provide the same final verdict. Therefore, we first investigate whether this consistency is achievable at all, and what LLM parameters deliver the most consistent results. OpenAI’s documentation [48] recommends a temperature of $0.2$ and ‘top_p’ of $0.1$ to achieve the most deterministic output for code related tasks. Similarly, the recommended ‘temperature’ value for all LLMs in our evaluation is $0.2$ . When experimenting with modifying these values, both the OpenAI documentation [49] and previous research [7] recommend to keep the value of ‘top_p’ constant and modify the value of ‘temperature.’ We therefore fix ‘top_p’ to default value specific to an LLM, and perform experiments using two different ‘temperature’ values: $0.2$ (‘default’) and $0.0$ . We perform this experiment on two vulnerable and two patched medium code difficulty level scenarios ( $2_{v}$ and $2_{p}$ ) from two distinct vulnerabilities, “out-of-bound write” (CWE-787) in C and “SQL injection” (CWE-89) in Python (for the same reasons as discussed in Section V). For the input prompts we select the set of Standard prompts (see Section 3.3). We run each experiment ten times, and record how many times the model provides the same answer. We consider a model to be consistent if it always provides the same binary answer, irregardless of whether it is correct.

TABLE VII: Evaluation Results for LLM Output Consistency at Recommended Temperature. The table shows results for each CWE scenario and every Standard prompt, in the format of # correctly answered / # total answered out of 10.

(a) CWE-787

	S1		S2		S3		S4		S5		S6
Models	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$
chat-bison	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	2/10	8/10	10/10	0/10
codechat-bison	9/10	0/10	10/10	0/10	10/10	0/10	0/10	9/10	0/10	10/10	0/10	10/10
codellama34b	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	5/10	7/10
gpt-3.5	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	10/10	0/10
gpt-4	0/10	10/10	6/10	10/10	1/10	10/10	6/10	10/10	0/10	10/10	7/10	10/10

(b) CWE-89

	S1		S2		S3		S4		S5		S6
Models	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$
chat-bison	10/10	10/10	10/10	8/10	10/10	10/10	10/10	0/10	10/10	10/10	0/10	10/10
codechat-bison	10/10	10/10	10/10	2/10	10/10	7/10	10/10	0/10	10/10	10/10	10/10	10/10
codellama34b	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	10/10	10/10	0/10
gpt-3.5	10/10	10/10	10/10	10/10	10/10	10/10	10/10	1/10	10/10	10/10	10/10	7/10
gpt-4	10/10	10/10	10/10	10/10	10/10	0/10	10/10	10/10	10/10	10/10	10/10	5/10

TABLE VIII: Evaluation Results for LLM Output Consistency at Temperature =

0.0

(a) CWE-787

	S1		S2		S3		S4		S5		S6
Models	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$
chat-bison	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	0/10	10/10	9/10	0/10
codechat-bison	10/10	0/10	10/10	0/10	10/10	0/10	0/10	10/10	0/10	10/10	0/10	10/10
codellama34b	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	0/10	10/10
gpt-3.5	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	10/10	0/10
gpt-4	0/10	10/10	8/10	10/10	2/10	10/10	4/10	9/10	0/10	10/10	4/10	10/10

(b) CWE-89

	S1		S2		S3		S4		S5		S6
Models	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$	$2_{v}$	$2_{p}$
chat-bison	10/10	10/10	10/10	0/10	10/10	10/10	10/10	0/10	10/10	10/10	10/10	10/10
codechat-bison	10/10	10/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	10/10	10/10	0/10
codellama34b	10/10	0/10	10/10	0/10	10/10	0/10	10/10	0/10	10/10	10/10	10/10	0/10
gpt-3.5	10/10	10/10	10/10	10/10	10/10	10/10	10/10	0/10	10/10	10/10	10/10	10/10
gpt-4	10/10	10/10	10/10	10/10	10/10	0/10	10/10	10/10	10/10	10/10	10/10	9/10

Observations. Table VII(b) shows that all LLMs provide inconsistent responses for one or more of the tests at the recommended ‘temperature’ value of $0.2$ . ‘codechat-bison@001’ even provides a wrong answer with the most basic ‘S1’ prompt (as shown in Figure LABEL:fig:codechat-bison-inconsistent). This suggests that the default ‘temperature’ is not a good choice to evaluate LLMs for vulnerability detection. Using $0.0$ as temperature improves consistency, as shown in Table VIII(b): ‘codechat-bison@001,’ ‘codellama34b,’ and ‘gpt-3.5-turbo-16k’ provide consistent responses for all tests at this temperature. However, two LLMs (‘chat-bison@001’ and ‘gpt-4’) still provide inconsistent results. Based on these results, we find that $0.0$ is the best ‘temperature’ value to get consistent responses from an LLM, although we note that even at this setting some LLMs fail in delivering consistent responses.

[Uncaptioned image] — TABLE IX: Evaluation of LLMs Over a Range of Temperature Values (CWE-787). The table shows results for each temperature value in the format of # correct / # total answered out of 10.

		T1		T2		T3		T4		T5		T6		T7
M	PS	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$
c-bison	S1S	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12
	R2ZS	2/12	2/12	1/12	1/12	0/12	0/12	1/12	1/12	2/12	3/12	0/12	0/12	0/12	1/12
	S6FS	2/12	3/12	2/12	2/12	2/12	4/12	1/12	3/12	2/12	2/12	2/12	2/12	2/12	1/12
cc-bison	S1S	0/12	2/12	0/12	3/12	0/12	3/12	0/12	0/12	0/12	2/12	0/12	2/12	0/12	1/12
	R2ZS	0/12	0/12	0/12	1/12	0/12	0/12	3/12	3/12	2/12	2/12	0/12	2/12	1/12	1/12
	R4FS	0/12	0/12	0/12	0/12	0/12	0/12	1/12	1/12	0/12	0/12	1/12	1/12	0/12	0/12
c.lla.34b	S1S	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12
	S1ZS	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12
	S5FS	0/12	0/12	3/12	3/12	3/12	3/12	2/12	3/12	0/12	0/12	1/12	1/12	2/12	2/12
gpt-3.5	S1S	0/12	0/12	0/12	0/12	2/12	3/12	1/12	1/12	0/12	0/12	1/12	2/12	0/12	0/12
	R2ZS	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12
	R4FS	1/12	1/12	1/12	1/12	1/12	1/12	1/12	1/12	2/12	2/12	0/12	0/12	1/12	1/12
gpt-4	S1S	0/12	0/12	0/12	2/12	0/12	2/12	0/12	0/12	0/12	0/12	0/12	1/12	0/12	0/12
	R2ZS	2/12	1/12	1/12	0/12	3/12	2/12	2/12	1/12	1/12	0/12	2/12	1/12	1/12	1/12
	R6FS	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12	0/12

		NT1		NT2		NT3		NT4		NT5		NT6
M	PS	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$	$\Delta_{a}$	$\Delta_{r}$
c-bison	S1S	0/12	0/12	1/12	0/12	8/12	7/12	0/12	1/12	0/9	2/9	0/9	1/9
	R2ZS	3/12	2/12	2/12	2/12	4/12	4/12	0/12	6/12	0/9	2/9	0/9	3/9
	S6FS	0/12	0/12	5/12	5/12	8/12	8/12	0/12	0/12	4/9	4/9	2/9	1/9
cc-bison	S1S	0/12	2/12	3/12	1/12	9/12	9/12	1/12	4/12	1/9	4/9	1/9	0/9
	R2ZS	1/12	2/12	4/12	4/12	7/12	6/12	1/12	1/12	0/9	0/9	0/9	0/9
	R4FS	0/12	0/12	5/12	6/12	1/12	1/12	0/12	0/12	6/9	6/9	0/9	0/9
c.lla.34b	S1S	1/12	0/12	4/12	4/12	9/12	9/12	1/12	3/12	1/9	2/9	1/9	0/9
	S1ZS	1/12	0/12	4/12	4/12	9/12	9/12	1/12	5/12	1/9	1/9	1/9	0/9
	S5FS	0/12	0/12	3/12	3/12	3/12	3/12	1/12	5/12	0/9	0/9	1/9	0/9
gpt-3.5	S1S	1/12	0/12	1/12	2/12	1/12	1/12	2/12	2/12	0/9	0/9	2/9	1/9
	R2ZS	0/12	0/12	2/12	2/12	0/12	0/12	2/12	2/12	3/9	3/9	0/9	0/9
	R4FS	0/12	0/12	3/12	4/12	3/12	3/12	0/12	4/12	3/9	3/9	3/9	1/9
gpt-4	S1S	0/12	2/12	1/12	3/12	0/12	0/12	2/12	7/12	0/9	0/9	2/9	1/9
	R2ZS	0/12	0/12	0/12	0/12	3/12	3/12	0/12	2/12	0/9	0/9	0/9	1/9
	R6FS	0/12	0/12	3/12	3/12	0/12	0/12	1/12	5/12	5/9	5/9	1/9	1/9

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?):
A Comprehensive Evaluation, Framework, and Benchmarks

Abstract

1 Introduction

2 Background and Related Work