Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Are We Skillful or Just Lucky? Interpreting the Possible Histories of Vulnerability Disclosures

Published: 07 February 2022 Publication History

Abstract

 Coordinated Vulnerability Disclosure (CVD) stands as a consensus response to the persistent fact of vulnerable software, yet few performance indicators have been proposed to measure its efficacy at the broadest scales. In this article, we seek to fill that gap. We begin by deriving a model of all possible CVD histories from first principles, organizing those histories into a partial ordering based on a set of desired criteria. We then compute a baseline expectation for the frequency of each desired criteria and propose a new set of performance indicators to measure the efficacy of CVD practices based on the differentiation of skill and luck in observation data. As a proof of concept, we apply these indicators to a variety of longitudinal observations of CVD practice and find evidence of significant skill to be prevalent. We conclude with reflections on how this model and its accompanying performance indicators could be used by various stakeholders (vendors, system owners, coordinators, and governments) to interpret the quality of their CVD practices.

1 Introduction

The practice of Coordinated Vulnerability Disclosure (CVD) emerged as part of a growing consensus to develop normative behaviors in response to the persistent fact of vulnerable software. Yet while the basic principles of Coordinated Vulnerability Disclosure (CVD) have been established [13, 22, 23, 30], to date there has been limited work to measure the efficacy of CVD programs, especially at the scale of industry benchmarks.
ISO 29147 [23] sets out the goals of vulnerability disclosure:
(a) ensuring that identified vulnerabilities are addressed;
(b) minimizing the risk from vulnerabilities;
(c) providing users with sufficient information to evaluate risks from vulnerabilities to their systems;
(d) setting expectations to promote positive communication and coordination among involved parties.
Meanwhile, use of third party libraries and shared code components across vendors and their products creates a need to coordinate across those parties whenever a vulnerability is found in a shared component. Multi-Party Coordinated Vulnerability Disclosure (MPCVD) is a more complex form of CVD, as illustrated by the Senate hearings about the Meltdown and Spectre vulnerabilities [31]. The need for Multi-Party Coordinated Vulnerability Disclosure (MPCVD) arises from the complexities of the software supply chain [18]. Nevertheless, the goals of CVD apply to MPCVD, as the latter is merely a special case of the former.
The difficulty of MPCVD derives from the diversity of its stakeholders: Software vendors have different development budgets, schedules, tempos, and analysis capabilities to do any of isolate, understand, or fix vulnerabilities. Additionally, they face diverse customer support expectations and obligations, and an increasing variety of regulatory regimes governing some stakeholders but not others. For these and many other reasons, practitioners of MPCVD highlight fairness as a core difficulty in coordinating disclosures across vendors [22].
So with the goal of minimizing the societal harm that results from the existence of a vulnerability in multiple products spread across multiple vendors, our motivating question is “What does fair mean in MPCVD?”. Optimizing MPCVD directly is not currently possible as we lack a utility function to map from the events that occur in a given case to the impact that case has on the world. While this article does not fully address the problem, it sets out a number of steps toward a solution. We seek a way to sort MPCVD cases into better or worse outcomes. Ideally the sorting criteria should be agreed based on unambiguous principles, which are intelligible by all interested parties. Furthermore, we seek a way to measure relevant features across MPCVD cases. Feature observability is a key factor: Our measurement needs to be simple and repeatable without relying overly much on proprietary or easily hidden information.
While a definition of fair in MPCVD is a responsibility for the broader community, we focus on evaluating the skill of the coordinator. We expect this contributes to fairness based on the EthicsfIRST principles of ethics for incident response teams promoted by Forum of Incident Response and Security Teams (FIRST) [39].1 To that end, our research questions are:
RQ1: Construct a model of CVD states amenable to analysis and also future generalization to MPCVD.
RQ2: What is a reasonable baseline expectation for ordering of events in the model of CVD?
RQ3: Given this baseline and model, does CVD as observed “in the wild” demonstrate skillful behavior?
This article primarily focuses on the simpler case of CVD. This focus provides an opportunity for incremental analysis of the success of the model; MPCVD modeling can follow in future work.

1.1 Contributions

The contributions of this article are as follows:
We define a simple yet comprehensive model of possible disclosure histories in Section 3 and a set of criteria to order them in Section 4, which will address RQ1.
We explore the implications of our model with respect to expected outcomes in Section 5, which will address RQ2.
We propose a method for measuring the relative contribution of both skill and luck to observations of CVD outcomes over time in Section 6.
We demonstrate the application of these techniques to analyze the efficacy of observed CVD processes in Section 7, which will address RQ3.
A discussion of how the model could be applied to benchmarks and multiparty coordination follows in Section 8. Section 9 describes the limitations of the approach and lays out future work to improve it. Section 10 surveys related work, and Section 11 summarizes and concludes.

2 Events in A Vulnerability Lifecycle

The goal of this section is to establish a model of events that affect the outcomes of vulnerability disclosure.
Our model builds on previous models of the vulnerability lifecycle, specifically those of Arbaugh et al.[1], Frei et al.[19], and Bilge and et al.[8]. A more thorough literature review of vulnerability lifecycle models can be found in [27].
Because we are modeling only the disclosure process, we assume the vulnerability both exists and is known to at least someone. Therefore, we ignore the birth (creation, introduced) and discovery states as they are implied at the beginning of all possible vulnerability disclosure histories. We also omit the anti-virus signatures released event from [8] since we are not attempting to model vulnerability management operations in detail.
The first event we are interested in modeling is Vendor Awareness (\(V\)). This event corresponds to Disclosure in [1] and vulnerability discovered by vendor in [8] (this event is not modeled in [19]). We are not concerned with how the vendor came to find out about the vulnerability’s existence, whether it was found via internal testing, reported by a security researcher, or noticed as the result of incident analysis.
The second event we include is Public Awareness (\(P\)) of the vulnerability. This event corresponds to Publication in [1], time of public disclosure in [19], and vulnerability disclosed publicly in [8]. The public might find out about a vulnerability through the vendor’s announcement of a fix, a news report about a security breach, a conference presentation by a researcher, by comparing released software versions as in [40, 41], or any of a variety of other means. As above, we are primarily concerned with the occurrence of the event itself rather than the details of how the \(P\) event arises.
The third event we address is Fix Readiness (\(F\)), by which we refer to the vendor’s creation and possession of a fix that could be deployed to a vulnerable system, if the system owner knew of its existence. Here we differ somewhat from [1, 8, 19] in that their models address the release of the fix rather than its readiness for release.
The reason for this distinction will be made clear, but first we must mention that Fix deployed (\(D\)) is simply that: the fix exists, and it has been deployed.
We chose to include the Fix Ready (\(F\)), Fix Deployed (\(D\)), and Public Awareness (\(P\)) events so that our model could better accommodate two common modes of modern software deployment:
shrinkwrap —The traditional distribution mode in which the vendor and deployer are distinct entities and deployers must be made aware of the fix before it can be deployed. In this case, which corresponds to the previously mentioned fix release event, both fix readiness (\(F\)), and public awareness (\(P\)) are necessary for the fix to be deployed (\(D\)).
SaaS—A more recent delivery mode in which the vendor also plays the role of deployer. In this distribution mode, fix readiness (\(F\)) can lead directly to fix deployed (\(D\)) with no dependency on public awareness (\(P\)).
We note that so-called silent fixes by vendors can sometimes result in a fix being deployed without public awareness even if the vendor is not the deployer. Thus, it is possible, albeit unlikely, for \(D\) to occur before \(P\) even in the shrinkwrap case above. It is also possible, and somewhat more likely, for \(P\) to occur before \(D\) in the SaaS case as well.
We diverge from [1, 8, 19] again in our treatment of exploits and attacks. Because attacks and exploit availability are often discretely observable events, the broader concept of exploit automation in [1] is insufficiently precise for our use. Both [8, 19] focus on the availability of exploits rather than attacks, but the observability of their chosen events is hampered by attackers’ incentives to maintain stealth. Frei et al. [19] use exploit availability, whereas Bilge et al.[8] call it exploit released in wild. Both refer to the state in which an exploit is known to exist, but this can arise for at least two distinct reasons, which we wish to discriminate:
exploit public (\(X\))—when the method of exploitation for a vulnerability has been made public in sufficient detail to be reproduced by others. Proof of concept (POC) code posted to a widely available site or inclusion of the exploit in a commonly available exploit tool meets this criteria, whereas privately held exploits do not.
attacks observed (\(A\))—when the vulnerability has been observed to be exploited in attacks. In this case, one has evidence that the vulnerability has been exploited and can presume the existence of an exploit regardless of its availability to the public. Analysis of malware from an incident might meet \(A\) but not \(X\) depending on how closely held the malware is thought to be to the attacker. Use of an already public exploit in an attack meets both \(X\) and \(A\).
Therefore, while we appreciate the existence of a hidden exploit exists event as causal predecessor of both exploit public and attacks, for our model we assert no causal relationship between \(X\) and \(A\). We make this choice in the interest of observability. The exploit exists event is difficult to consistently observe independently of the two events we have chosen to use; its occurrence is nearly always inferred from the observation of either exploit public or attacks.
A summary of this model comparison is shown in Table 1. Further discussion of related work can be found in Section 10.
Table 1.
Arbaugh et al. [1]Frei et al.[19]Bilge et al.[8]Our Model
Birthcreation ( \(t_{creat}\))introduced ( \(t_c\))(implied)
Discoverydiscovery ( \(t_{disco}\))n/a(implied)
Disclosuren/adiscovered by vendor ( \(t_d\))Vendor Awareness ( \(V\))
n/apatch availability ( \(t_{patch}\))n/aFix Ready ( \(F\))
Fix Releasen/apatch released ( \(t_p\)) \(F \mathrm{~and~} P\)
Publicationpublic disclosure ( \(t_{discl}\))disclosed publicly ( \(t_0\))Public Awareness ( \(P\))
n/apatch installation ( \(t_insta\))patch deployment completed ( \(t_a\))Fix Deployed ( \(D\))
Exploit Automationexploit availability ( \(t_{explo}\))Exploit released in wild ( \(t_e\))n/a
Exploit Automationn/an/aExploit Public ( \(X\))
Exploit Automationn/an/aAttacks ( \(A\))
n/an/aanti-virus signatures released ( \(t_s\))n/a
Table 1. Vulnerability Lifecycle Events: Comparing Models

2.1 Definitions and Notation

Before we discuss either possible histories (Section 3) or desirable histories (Section 4) in the vulnerability life cycle, we need to formally define our terms. In all these definitions, we take standard Zermelo–Fraenkel set theory. The concept of sequences extends set theory to include a concept of ordered sets. From them, we adopt the following notation:
\(\lbrace \dots \rbrace\) An unordered set, which makes no assertions about sequence.
The normal proper subset (\(\subset\)), equality (\(=\)), and subset (\(\subseteq\)) relations between sets.
\((\dots)\) An ordered set in which the events \(e\) occur in that sequence.
The precedes (\(\prec\)) relation on members of an ordered set: \(e_i \prec e_j \textrm { if and only if } e_i,e_j \in s \textrm { and } i \lt j\) where \(s\) is as defined in (2).
From Table 1, we define the set of events \(E\)
\begin{equation} E \stackrel{\mathsf {def}}{=}\lbrace V,F,D,P,X,A\rbrace . \end{equation}
(1)
A sequence \(s\) is an ordered set of some number of events \(e_i \in E\) for \(1 \le i \le n\) and the length of \(s\) is \(|s| \stackrel{\mathsf {def}}{=}n\).
\begin{equation} s \stackrel{\mathsf {def}}{=}\left(e_1, e_2, \dots e_n \right). \end{equation}
(2)
A valid vulnerability coordination history \(h\) is a sequence \(s\) containing one and only one of each of the event types in \(E\); by definition \(|h| = |E| = 6\). Note this is a slight abuse of notation; \(|\textrm { }|\) represents both sequence length and the cardinality of a set
\begin{equation} h \stackrel{\mathsf {def}}{=}s : \forall e_i, e_j \in s \textrm { it is the case that } e_i \ne e_j \textrm { and } \forall e_k \in E \textrm { it is the case that } \exists e_i \in s \textrm { such that } e_k = e_i, \end{equation}
(3)
where two members of the set \(E\) are equal if they are represented by the same symbol and not equal otherwise. The set of all possible histories, \(S_H\), is a set of all the sequences \(h\) that satisfy this definition.

3 THE Possible Histories OF CVD

Given that a history \(h\) contains all six events \(E\) in some order, there are at most 720 (\(_{6} \mathrm{P}_{6} = 6! = 720\)) possible histories. That is, \(|S_H| = 720\). However, we can apply causal constraints as follows:
vendor awareness must precede fix ready (\(V \prec F\));
fix ready must precede fix deployed (\(F \prec D\)).
In symbols, this puts the following constraints on the possible set of histories, which we call \(H_0\)
\begin{equation} H_0 \stackrel{\mathsf {def}}{=}\lbrace h \in S_H \textrm { such that } \forall e \in h \textrm { it is the case that } V \prec F \textrm { and } F \prec D\rbrace . \end{equation}
(4)
We further impose two simplifying assumptions. The first is that vendors know at least as much as the public does. In other words, all histories must meet one of two criteria: either Vendor Awareness \(V\) precedes Public Awareness \(P\) or else Vendor Awareness must immediately follow it
\begin{equation} H_1 \stackrel{\mathsf {def}}{=}\lbrace h \in S_H \textrm { such that } \forall e \in h \textrm { it is the case that if } e_i = V \textrm { then either } e_i \prec P \textrm { or } e_{i-1} = P \rbrace . \end{equation}
(5)
The second is that the public can be informed about a vulnerability by a public exploit. Therefore, either public awareness precedes exploit public or must immediately follow it
\begin{equation} H_2 \stackrel{\mathsf {def}}{=}\lbrace h \in S_H \textrm { such that } \forall e \in h \textrm { it is the case that if } e_i = P \textrm { then either } e_i \prec X \textrm { or } e_{i-1} = X \rbrace . \end{equation}
(6)
Combining them, we arrive at our formal definition of valid possible histories as all sequences meeting the three constraining assumptions (4), (5), and (6)
\begin{equation} H \stackrel{\mathsf {def}}{=}H_0 \cap H_1 \cap H_2. \end{equation}
(7)
Once these constraints are applied, only 70 possible histories \(h \in S_H\) remain viable (\(|H| = 70\)). This model is amenable for analysis of CVD, but we need to add a way to express preferences before it is complete. Thus, we are part way through RQ1. Section 8.2 will address how this model can generalize from CVD to MPCVD.
These histories are listed exhaustively in Table 2. The skill ranking function on the histories will be defined in Section 5.5. The desirability of the history (\(\mathbb {D}^h\)) will be defined in Section 4. The expected frequency of each history \(f_h\) is explained in Section 5.2.
Table 2.
# \(h\)rank \(|\mathbb {D}^h|\) \(f_h\) \(D \prec A\) \(D \prec P\) \(D \prec X\) \(F \prec A\) \(F \prec P\) \(F \prec X\) \(P \prec A\) \(P \prec X\) \(V \prec A\) \(V \prec P\) \(V \prec X\) \(X \prec A\)
0AXPVFD100.0833000000000000
1APVXFD220.0417000000010010
2AVXPFD320.0278000000000110
3XPVAFD430.1250000000101001
4VAXPFD530.0208000000001110
5PVAXFD640.0417000000111010
6AVPXFD730.0139000000010110
7APVFXD730.0208000001010010
8XPVFAD840.0625000100101001
9VAPXFD940.0104000000011110
10PVXAFD1050.0417000000111011
11VPAXFD1150.0104000000111110
12PVAFXD1150.0208000001111010
13VXPAFD1150.0312000000101111
14AVPFXD1240.0069000001010110
15APVFDX1340.0208001001010010
16VAPFXD1450.0052000001011110
17XPVFDA1550.0625100100101001
18PVXFAD1660.0208000100111011
19AVFXPD1740.0093000011000110
20VPXAFD1860.0104000000111111
21PVFAXD1960.0139000101111010
22VXPFAD1960.0156000100101111
23VPAFXD2060.0052000001111110
24VAFXPD2150.0069000011001110
25PVAFDX2260.0208001001111010
26AVPFDX2350.0069001001010110
27AVFPXD2450.0046000011010110
28PVFXAD2570.0139000101111011
29VPXFAD2570.0052000100111111
30VAPFDX2660.0052001001011110
31VAFPXD2760.0035000011011110
32PVXFDA2870.0208100100111011
33VPFAXD2970.0035000101111110
34VFAXPD3060.0052000111001110
35VXPFDA3170.0156100100101111
36PVFADX3270.0139001101111010
37VPAFDX3370.0052001001111110
38VPFXAD3480.0035000101111111
39AVFPDX3560.0046001011010110
40VFAPXD3670.0026000111011110
41VPXFDA3780.0052100100111111
42PVFXDA3780.0139100101111011
43VAFPDX3870.0035001011011110
44VPFADX3980.0035001101111110
45VFPAXD4080.0026000111111110
46VFXPAD4180.0078000111101111
47AVFDXP4260.0046011011000110
48PVFDAX4380.0139101101111010
49VAFDXP4470.0035011011001110
50VPFXDA4590.0035100101111111
51VFAPDX4680.0026001111011110
52VFPXAD4690.0026000111111111
53AVFDPX4770.0046011011010110
54PVFDXA4890.0139101101111011
55VPFDAX4990.0035101101111110
56VFXPDA5090.0078100111101111
57VFPADX5190.0026001111111110
58VAFDPX5280.0035011011011110
59VFADXP5380.0026011111001110
60VPFDXA54100.0035101101111111
61VFPXDA55100.0026100111111111
62VFADPX5690.0026011111011110
63VFPDAX57100.0026101111111110
64VFDAXP5890.0026111111001110
65VFPDXA59110.0026101111111111
66VFDAPX60100.0026111111011110
67VFDXPA61110.0052111111101111
68VFDPAX61110.0026111111111110
69VFDPXA62120.0026111111111111
Table 2. Possible Histories \(h \in H\) of CVD

4 ON THE Desirability OF Possible Histories

All histories are not equally preferable. Some are quite bad—for example, those in which attacks precede vendor awareness (\(A \prec V\))—while others are very desirable, for example, those in which fixes are deployed before either an exploit is made public (\(D \prec X\)) or attacks (\(D \prec A\)).
In pursuit of a way to reason about our preferences for some histories over others, we define the following preference criteria: history \(h_a\) is preferred over history \(h_b\) if, all else being equal, a more desirable event \(e_1\) precedes a less desirable event \(e_2\). In notation, \(e_1 \prec e_2\). We define the following ordering preferences:
\(V \prec P\), \(V \prec X\), or \(V \prec A\) – Vendors can take no action to produce a fix if they are unaware of the vulnerability. Public awareness prior to vendor awareness can cause increased support costs for vendors at the same time they are experiencing increased pressure to prepare a fix. If public awareness of the vulnerability prior to vendor awareness is bad, then a public exploit is at least as bad because it encompasses the former and makes it readily evident that adversaries have exploit code available for use. Attacks prior to vendor awareness represent a complete failure of the vulnerability remediation process because they indicate that adversaries are far ahead of defenders.
\(F \prec P\), \(F \prec X\), or \(F \prec A\) – As noted above, the public can take no action until a fix is ready. Because public awareness also implies adversary awareness, the vendor/adversary race becomes even more critical if this condition is unmet. When fixes exist before exploits or attacks, defenders are better able to protect their users.
\(D \prec P\), \(D \prec X\), or \(D \prec A\) – Even better than vendor awareness and fix availability prior to public awareness, exploit publication, or attacks are scenarios in which fixes are deployed prior to one or more of those events.
\(P \prec X\) or \(P \prec A\) – In many cases, \(D\) requires system owners to take action. We therefore prefer histories in which public awareness happens prior to either exploit publication or attacks.
\(X \prec A\) – This criteria is not about whether exploits should be published or not.2 It is about whether we should prefer histories in which exploits are published and then attacks happen over histories in which attacks happen and then an exploit is published. Our position is that attackers have more advantages in the latter case than the former, and therefore we should prefer histories in which \(X \prec A\).
Equation (8) formalizes our definition of desired orderings \(\mathbb {D}\). Table 3 displays all 36 possible orderings of paired events and whether they are considered impossible, required (as defined by Equation (4)), desirable (as defined by Equation (8)), or undesirable (the complement of the set defined in Equation (8)).
Table 3.
 VFDPXA
V-rrddd
F--rddd
D---ddd
Puuu-dd
Xuuuu-d
Auuuuu-
Table 3. Ordered Pairs of Events
where \({row} \prec {col}\) (Key: - = impossible, r = required, d = desired, and u = undesired).
Before proceeding, we note that our model focuses on the ordering of events, not their timing. We acknowledge that in some situations, the interval between events may be of more interest than merely the order of those events, as a rapid tempo of events can alter the options available to stakeholders in their response. We discuss this limitation further in Section 9; however, the following model posits event sequence timing on a human-oriented timescale measured in minutes to weeks.
\begin{equation} \begin{split} \mathbb {D} \stackrel{\mathsf {def}}{=}\lbrace V \prec P, V \prec X, V \prec A,\\ F \prec P, F \prec X, F \prec A,\\ D \prec P, D \prec X, D \prec A,\\ P \prec X, P \prec A, X \prec A \rbrace . \end{split} \end{equation}
(8)
An element \(d \in \mathbb {D}\) is of the form \(e_i \prec e_j\). More formally, \(d\) is a relation of the form \(d\left(e_1, e_2, \prec \right)\). \(\mathbb {D}\) is a set of such relations.
Given the desired preferences over orderings of events (\(\mathbb {D}\) in Equation (8)), we can construct a partial ordering over all possible histories \(H\), as defined in Equation (10). This partial order requires a formal definition of which desiderata are met by a given history, provided by (9).
\begin{equation} \begin{split} \mathbb {D}^{h} \stackrel{\mathsf {def}}{=}\lbrace d \in \mathbb {D} \textrm { such that } d \textrm { is true for } h \rbrace \textrm {, for } h \in H \\ \textrm {where } d\left(e_1,e_2,\prec \right) \textrm { is true for } h \textrm { if and only if: } \\ \exists e_i, e_j \in h \textrm { such that } e_i = e_1 \textrm { and } e_j = e_2 \textrm { and } h \textrm { satisfies the relation } d\left(e_i,e_j,\prec \right). \end{split} \end{equation}
(9)
\begin{equation} (H,\le _{H}) \stackrel{\mathsf {def}}{=}\forall h_a, h_b \in H \textrm { it is the case that } h_b \le _{H} h_a \textrm { if and only if } \mathbb {D}^{h_b} \subseteq \mathbb {D}^{h_a}. \end{equation}
(10)
A visualization of the resulting partially ordered set, or poset, \((H,\le _{H})\) is shown as a Hasse Diagram in Figure 1. Hasse Diagrams represent the transitive reduction of a poset. Each node in the diagram represents an individual history \(h_a\) from Table 2; labels correspond to the index of the table. Figure 1 follows Equation (10), in that \(h_a\) is higher in the order than \(h_b\) when \(h_a\) contains all the desiderata from \(h_b\) and at least one more. Histories that do not share a path are incomparable (formally, two histories incomparable if both \(\mathbb {D}^{h_a} \not\supset \mathbb {D}^{h_b}\) and \(\mathbb {D}^{h_a} \not\subset \mathbb {D}^{h_b}\)). The diagram flows from least desirable histories at the bottom to most desirable at the top. This model satisfies RQ1; Sections 5 and 6 will demonstrate that the model is amenable to analysis and Section 8.2 will lay out the criteria for extending it to cover MPCVD.
Fig. 1.
Fig. 1. The Lattice of Possible CVD Histories: A Hasse Diagram of the partial ordering \((H, \le _{H})\) of \(h_a \in H\) given \(\mathbb {D}\) as defined in Equation (10). The diagram flows from least desirable histories at the bottom to most desirable at the top. Histories that do not share a path are incomparable. Labels indicate the index (row number) \(a\) of \(h_a\) in Table 2.
The poset \((H,\le _{H})\), has as its upper bound \(h_{69} = (V, F, D, P, X, A)\), while its lower bound is \(h_{0} = (A, X, P, V, F, D)\).
Thus far, we have made no assertions about the relative desirability of any two desiderata (that is, \(d_i,d_j \in \mathbb {D}\) where \(i \ne j\)). In the next section, we will expand the model to include a partial order over our desiderata, but for now it is sufficient to note that any simple ordering over \(\mathbb {D}\) would remain compatible with the partial order given in Equation (10). In fact, a total order on \(\mathbb {D}\) would create a linear extension of the poset defined here, whereas a partial order on \(\mathbb {D}\) would result in a more constrained poset of which this poset would be a subset.

5 Reasoning Over Possible Histories

Our goal in this section is to formulate a way to rank our undifferentiated desiderata \(\mathbb {D}\) from Section 4 in order to develop the concept of CVD skill and its measurement in Section 6. This will provide a baseline expectation about events (RQ2).
In order to begin to differentiate skill from chance in Section 6, we need a model of what the CVD world would look like without any skill. We cannot derive this model by observation. Even when CVD was first practiced in the 1980s, some people may have had social, technical, or organizational skills that transferred to better CVD. We follow the principle of indifference, as stated in [17]:
Principle of Indifference: Let \(X = \lbrace x_1,x_2,\ldots ,x_n\rbrace\) be a partition of the set \(W\) of possible worlds into \(n\) mutually exclusive and jointly exhaustive possibilities. In the absence of any relevant evidence pertaining to which cell of the partition is the true one, a rational agent should assign an equal initial credence of \(n\) to each cell.
While the principle of indifference is rather strong, it is inherently a bit difficult to reason about absolutely skill-less CVD when the work of CVD is, by its nature, a skilled job. We will use the principle of indifference to define a baseline against which measurement can be meaningful. For additional analysis of the application of the principle of indifference to this problem, see [21, Section 3.3].

5.1 Event Frequency Analysis

We model event frequency with a simple state-based model of the possible histories \(h \in H\) in which each state is a binary vector indicating which events \(e \in E\) have occurred prior to reaching that state. The events \(e \in E\) therefore represent state transitions, and the histories \(h \in H\) are paths (traces) through the states. This meets the definition above because each \(e \in E\) is unique (mutually exclusive) and the set of available \(e\) at each step of the way is exhaustive. Let \(E^h_{i+1}\) be the set of possible next events following the \(i\)th event in history \(h\), which is a subset of all possible events: \(E^h_{i+1} \subseteq E\). The fragment of a history \(h\) up to its \(i\)th element is a sequence, which contains the first \(i\) events of \(h\), denoted as \(h_i\). The initial case \(h_0 \stackrel{\mathsf {def}}{=}\emptyset\). The probability of transition from \(e_i\) to any of the possible next events \(e_{i+1}\), where \(e_{i+1}\in E^h_{i+1}\), is defined, based on the principle of indifference, as the inverse of the cardinality of the set of possible states, namely:
\begin{equation} p({e_{i+1}|h_i}) = 1/|E^h_{i+1}|. \end{equation}
(11)
For example, because Equation (4) requires \(V \prec F\) and \(F \prec D\), only four of the six events in \(E\) are possible at the beginning of a history: \(\lbrace V,P,X,A\rbrace\). Therefore, \(p(F|\emptyset) = p(D|\emptyset) = 0\). Since the principle of indifference assigns each possible transition event as equally probable in this model of unskilled CVD, we assign an initial probability of 0.25 to each possible event (\(p(V|\emptyset) = p(P|\emptyset) = p(X|\emptyset) = p(A|\emptyset) = 0.25\)). From there, we see that the other rules dictate possible transitions from each subsequent state. For example, Equation (5) says that any \(h\) starting with \(P\) must start with \(PV\). And Equation (6) requires any \(h\) starting with \(X\) must proceed through \(XP\) and again Equation (5) gets us to \(XPV\). Therefore, we expect histories starting with \(PV\) or \(XPV\) to occur with frequency 0.25 as well.

5.2 History Frequency Analysis

We apply the principle of indifference to the available events (\(E^h_{i+1} \subseteq E\)) at each state \(i\) for each of the possible histories to compute the expected frequency of each history, which we denote as \(f_h\). The frequency of a history \(f_h\) is the cumulative product of the probability \(p\) of each event \(e\) in the history \(h\). We are only concerned with histories that meet our sequence constraints, namely, \(h \in H\)
\begin{equation} f_h = \prod _{i=0}^{5} p(e_{i+1}|h_i). \end{equation}
(12)
Table 2 displays the value of \(f_h\) for each history. Having an expected frequency (\(f_h\)) for each history \(h\) will allow us to examine how often we might expect our desiderata \(d \in \mathbb {D}\) to occur across \(H\).
Choosing uniformly over event transitions is more useful than treating the six-element histories as uniformly distributed. For example, \(P \prec A\) in 59% of valid histories, but when histories are weighted by the assumption of uniform state transitions \(P \prec A\) is expected to occur in 67% of the time. These differences arise due to the dependencies between some states. Since CVD practice is comprised a sequence of events, each informed by the last, our uniform distribution over events is more likely a useful baseline than a uniform distribution over histories.

5.3 Event Order Frequency Analysis

Each of the event pair orderings in Table 3 can be treated as a Boolean condition that either holds or does not hold in any given history.
In Section 5.2, we described how to compute the expected frequency of each history (\(f_h\)) given the presumption of indifference to possible events at each step. We can use \(f_h\) as a weighting factor to compute the expected frequency of event orderings (\(e_i \prec e_j\)) across all possible histories \(H\). Equations (13) and (14) define the frequency of an ordering \(f_{e_i \prec e_j}\) as the sum over all histories in which the ordering occurs (\(H^{e_i \prec e_j}\)) of the frequency of each such history (\(f_h\)) as shown in Table 2.
\begin{equation} H^{e_i \prec e_j} \stackrel{\mathsf {def}}{=}\lbrace h \in H \textrm { where } e_i \prec e_j \textrm { is true for } h \textrm { and } i \ne j\rbrace , \end{equation}
(13)
\begin{equation} f_{e_i \prec e_j} \stackrel{\mathsf {def}}{=}\sum _{h \in H^{e_i \prec e_j}} {f_h.} \end{equation}
(14)
Table 4 displays the results of this calculation. Required event orderings have an expected frequency of 1, while impossible orderings have an expected frequency of 0. As defined in Section 4, each desiderata \(d \in \mathbb {D}\) is specified as an event ordering of the form \(e_i \prec e_j\). We use \(f_d\) to denote the expected frequency of a given desiderata \(d \in \mathbb {D}\). The values for the relevant \(f_d\) appear in the upper right of Table 4. Some event orderings have higher expected frequencies than others. For example, vendor awareness precedes attacks in 3 out of 4 histories in a uniform distribution of event transitions (\(f_{V \prec A} = 0.75\)), whereas fix deployed prior to public awareness holds in less than 1 out of 25 (\(f_{D \prec P} = 0.037\)) histories generated by a uniform distribution over event transitions.
Table 4.
 VFDPXA
V0110.3330.6670.75
F0010.1110.3330.375
D0000.0370.1670.187
P0.6670.8890.96300.50.667
X0.3330.6670.8330.500.5
A0.250.6250.8120.3330.50
Table 4. Expected Frequency of \({row} \prec {col}\) When Events are Chosen Uniformly from Possible Events at Each Point

5.4 A Partial Order on Desiderata

Any observations of phenomena in which we measure the performance of human actors can attribute some portion of the outcome to skill and some portion to chance [15, 26]. It is reasonable to wonder whether good outcomes in CVD are the result of luck or skill. How can we tell the difference?
We begin with a simple model in which outcomes \(o\) are a combination of luck and skill,
\begin{equation} o_{observed} = o_{luck} + o_{skill.} \end{equation}
(15)
In other words, outcomes due to skill are what remain when you subtract the outcomes due to luck from the outcomes you observe. In this model, we treat luck as a random component: the contribution of chance. In a world where neither attackers nor defenders held any advantage and events were chosen uniformly from \(E\) whenever they were possible, we would expect to see the preferred orderings occur with probability equivalent to their frequency \(f_d\) as shown in Table 4.
Skill, on the other hand, accounts for the outcomes once luck has been accounted for. So the more likely an outcome is due to luck, the less skill we can infer when it is observed. As an example, from Table 4 we see that fix deployed before the vulnerability is public is the rarest of our desiderata with \(f_{D \prec P} = 0.037\), and thus exhibits the most skill when observed. On the other hand, vendor awareness before attacks is expected to be a common occurrence with \(f_{V \prec A} = 0.75\).
We can therefore use the set of \(f_d\) to construct a partial order over \(\mathbb {D}\) in which we prefer desiderata \(d,\) which are more rare (and therefore imply more skill when observed) over those that are more common. We create the partial order on \(\mathbb {D}\) as follows: for any pair \(d_1,d_2 \in \mathbb {D}\), we say that \(d_2\) exhibits less skill than \(d_1\) if \(d_2\) occurs more frequently in \(H\) than \(d_1\)
\begin{equation} (\mathbb {D},\le _{\mathbb {D}}) \stackrel{\mathsf {def}}{=}d_2 \le _{\mathbb {D}} d_1 \iff {f_{d_2}} \stackrel{\mathbb {R}}{\ge } {f_{d_1}}. \end{equation}
(16)
Note that the inequalities on the left and right sides of Equation (16) are flipped because skill is inversely proportional to luck. Also, while \(\le _{\mathbb {D}}\) on the left side of Equation (16) defines a preorder over the poset \(H\), the \(\stackrel{\mathbb {R}}{\ge }\) is the usual ordering over the set of real numbers. The result is a partial order \((\mathbb {D},\le _{\mathbb {D}})\) because a few \(d\) have the same \(f_d\) (\(f_{F \prec X} = f_{V \prec P} = 0.333,\) for example). The full Hasse Diagram for the partial order \((\mathbb {D},\le _{\mathbb {D}})\) is shown in Figure 2.
Fig. 2.
Fig. 2. Hasse Diagram of the partial order \((\mathbb {D},\le _{\mathbb {D}})\) defined in Equation (16) where the rarity of each \(d\) as shown in Table 4 is taken to reflect skill. Nodes at the top of the diagram reflect the most skill.

5.5 Ordering Possible Histories by Skill

Next we develop a new partial order on \(H\) given the partial order \((\mathbb {D},\le _{\mathbb {D}})\) just described. We observe that \(\mathbb {D}^{h}\) acts as a Boolean vector of desiderata met by a given \(h\). Since \(0 \le f_d \le 1\), simply taking its inverse could in the general case lead to some large values for rare events, so for convenience we use \(log(1/f_d)\) as our proxy for skill. Taking the dot product of \(\mathbb {D}^h\) with the set of \(log(1/f_d)\) represented as a vector, we arrive at a single value representing the skill exhibited for each history \(h\). Careful readers may note that this value is equivalent to the Term Frequency—Inverse Document Frequency (TF-IDF) score for a search for the “skill terms” represented by \(\mathbb {D}\) across the corpus of possible histories \(H\).
We have now computed a skill value for every \(h \in H\), which allows us to sort \(H\) and assign a rank to each history \(h\) contained therein. The rank is shown in Table 2. Rank values start at 1 for least skill up to a maximum of 62 for most skill. Owing to the partial order \((\mathbb {D},\le _{\mathbb {D}})\), some \(h\) have the same computed skill values, and these are given the same rank.
The ranks for \(h \in H\) lead directly to a new poset \((H,\le _{\mathbb {D}})\), which is an extension of and fully compatible with \((H,\le _{H})\) as developed in Section 4. The resulting Hasse Diagram would be too large to reproduce here. Instead, we include the resulting rank for each \(h\) as a column in Table 2. In the table, rank is ordered from least desirable and skillful histories to most. Histories having identical rank are incomparable to each other within the poset. The refined poset \((H,\le _{\mathbb {D}})\) is much closer to a total order on \(H\), as indicated by the relatively few histories having duplicate ranks.
The remaining incomparable histories are the direct result of the incomparable \(d\) in \((\mathbb {D},\le _{\mathbb {D}})\), corresponding to the branches in Figure 2. Achieving a total order on \(\mathbb {D}\) would require answering the following: Assuming you could achieve only one, and without regard to any other \(d \in \mathbb {D}\), would you prefer
that fix ready precede exploit publication (\(F \prec X\)) or that vendor awareness precede public awareness (\(V \prec P\))?
that public awareness precede exploit publication (\(P \prec X\)) or that exploit publication precede attacks (\(X \prec A\))?
that public awareness precede attacks (\(P \prec A\)) or vendor awareness precede exploit publication (\(V \prec X\))?
Recognizing that readers may have diverse opinions on all three questions, we leave further analysis of the answers to these as future work.
This is just one example of how poset refinements might be used to order \(H\). Different posets on \(\mathbb {D}\) would lead to different posets on \(H\). For example, one might construct a different poset if certain \(d\) were considered to have much higher financial value when achieved than others.

6 Discriminating Skill AND Luck in Observations

This section defines a method for measuring skillful behavior in CVD, which we will need to answer RQ3 about measuring and evaluating CVD “in the wild.” The measurement method makes use of all the modeling tools and baselines established thus far: A comprehensive set of possible histories \(H\), a partial order over them in terms of the presence of desired event precedence \(\mathbb {D}\), and the a priori expected frequency of each desiderata \(d \in \mathbb {D}\).
If we expected to be able to observe all events in all CVD cases, we could be assured of having complete histories and could be done here. But the real world is messy. Not all events \(e \in E\) are always observable. We need to develop a way to make sense of what we can observe, regardless of whether we are ever able to capture complete histories. Continuing towards our goal of measuring efficacy, we return to considering the balance between skill and luck in determining our observed outcomes.
Of course, there are any number of conceivable reasons why we should expect our observations to differ from the expected frequencies we established in Section 5. Adversaries might be rare, or conversely very well equipped. Vendors might be very good at releasing fixes faster than adversaries can discover vulnerabilities and develop exploits for them. System owners might be diligent at applying patches. We did say might, did we not? Regardless, for now we will lump all of those possible explanations into a single attribute we will call “skill.”
In a world of pure skill, one would expect that a player could achieve all 12 desiderata \(d \in \mathbb {D}\) consistently. That is, a maximally skillful player could consistently achieve the specific ordering \(h=(V,F,D,P,X,A)\) with probability \(p_s = 1\).
Thus, we construct the following model: For each of our preferred orderings \(d \in \mathbb {D}\), we model their occurrence due to luck using the binomial distribution with parameter \(p_l = f_d\) taken from Table 4.
Recall that the mean of a binomial distribution is simply the probability of success \(p\), and that the mean of a weighted mixture of two binomial distributions is simply the weighted mixture of the individual means. Therefore, our model adds a parameter \(\alpha _d\) to represent the weighting between our success rates arising from skill \(p_s\) and luck \(p_l\). Because there are 12 desiderata \(d \in \mathbb {D}\), each \(d\) will have its own observations and corresponding value for \(\alpha _d\) for each history \(h_a\),
\begin{equation} f_d^{obs} = \alpha _d p_s + (1 - \alpha _d) p_l. \end{equation}
(17)
Where \(f_d^{obs}\) is the observed frequency of successes for desiderata \(d\). Because \(p_s = 1\), one of those binomial distributions is degenerate. Substituting \(p_s = 1\), \(p_l = f_d\) and solving Equation (17) for \(\alpha\), we get
\begin{equation} \alpha _d = \frac{f_d^{obs} - f_d}{1 - f_d}. \end{equation}
(18)
The value of \(\alpha _d\) therefore gives us a measure of the observed skill normalized against the background success rate provided by luck \(f_d\).
We denote the set of \(\alpha _d\) values for a given history as \(\alpha _\mathbb {D}\). When we refer to the \(\alpha _d\) coefficient for a specific \(d\) we will use the specific ordering as the subscript, for example: \(\alpha _{F \prec P}\)
\begin{equation} \alpha _\mathbb {D} = \lbrace \alpha _d : d \in \mathbb {D} \rbrace . \end{equation}
(19)
The concept embodied by \(f_d\) is founded on the idea that if attackers and defenders are in a state of equilibrium, the frequency of observed outcomes (i.e., how often each desiderata \(d\) and history \(h\) actually occurs) will appear consistent with those predicted by chance. So another way of interpreting \(\alpha _d\) is as a measure of the degree to which a set of observed histories is out of equilibrium.
The following are a few comments on how \(\alpha _d\) behaves. Note that \(\alpha _d \lt 0\) when \(0 \le f_d^{obs} \lt f_d\) and \(0 \le \alpha _d \le 1\) when \(f_d \le f_d^{obs} \le 1\). The implication is that a negative value for \(\alpha _d\) indicates that our observed outcomes are actually worse than those predicted by pure luck. In other words, we can only infer positive skill when the observations are higher (\(f_d^{obs} \gt f_d\)). That makes intuitive sense: If you are likely to win purely by chance, then you have to attribute most of your wins to luck rather than skill. The highest value for any \(\mathbb {D}\) in Table 4 is \(f_{V \prec A}=0.75\), implying that even if a vendor only knows about 7 out of 10 vulnerabilities before attacks occur (\(f_{V \prec A}^{obs} = 0.7\)), they are still not doing better than random.
On the other hand, when \(f_d\) is small it is easier to infer skill should we observe anything better than \(f_d\). However, it takes larger increments of observations \(f_d^{obs}\) to infer growth in skill when \(f_d\) is small than when it is large. The smallest \(f_d\) we see in Table 4 is \(f_{D \prec P} = 0.037\).
Inherent to the binomial distribution is the expectation that the variance of results is lower for both extremes (as \(p\) approaches 0 or 1) and highest at \(p=0.5\). Therefore, we should generally be less certain of our observations when they fall in the middle of the distribution. We address uncertainty further in Section 7.2.

7 Observing CVD in THE Wild

As a proof of concept, in this section, we will demonstrate the use of the model developed in Section 6 to observational data. To begin, Section 7.1 refines the model to accommodate observational data. Next we address uncertainty in observations in Section 7.2. Finally, Sections 7.3 and 7.4 apply the model to two data sets: Microsoft’s security updates from 2017 through early 2020, and commodity public exploits from 2015–2019.

7.1 Computing αd from Observations

Although Equation (18) develops a skill metric from observed frequencies, our observations will in fact be based on counts. Observations consist of some number of successes \(S_d^{o}\) out of some number of trials \(T\), i.e.,
\begin{equation} f_d^{obs} = \frac{S_d^{o}}{T}. \end{equation}
(20)
We likewise revisit our interpretation of \(f_d\)
\begin{equation} f_d = \frac{S_d^l}{T} , \end{equation}
(21)
where \(S_d^l\) is the number of successes at \(d\) we would expect due to luck in \(T\) trials.
Substituting Equations (20) and (21) into Equation (18), and recalling that \(p_s = 1\) because a maximally skillful player succeeds in \(T\) out of \(T\) trials, we get
\begin{equation} \alpha = \frac{\frac{S_d^o}{T}-\frac{S_d^l}{T}}{\frac{T}{T}-\frac{S_d^l}{T}}. \end{equation}
(22)
Rearranging Equation (21) to \(S_d^l = {f_d}T\), substituting into Equation (22), and simplifying, we get
\begin{equation} \alpha = \frac{{S_d^o}-{f_d}T}{(1-{f_d})T}. \end{equation}
(23)
Hence, for any of our desiderata \(\mathbb {D}\) we can compute \(\alpha _d\) given \(S_d^o\) observed successes out of \(T\) trials in light of \(f_d\) taken from Table 4.
Before we address the data analysis we take a moment to discuss uncertainty.

7.2 Calculating Measurement Error

We have already described the basis of our \(f_d^{obs}\) model in the binomial distribution. While we could just estimate the error in our observations using the binomial’s variance \(np(1-p)\), because of boundary conditions at 0 and 1 we should not assume symmetric error. An extensive discussion of uncertainty in the binomial distribution is given in [9].
However, for our purpose the Beta distribution lends itself to this problem nicely. The Beta distribution is specified by two parameters \((a,b)\). It is common to interpret \(a\) as the number of successes and \(b\) as the number of failures in a set of observations of Bernoulli trials to estimate the mean of the binomial distribution from which the observations are drawn. For any given mean, the width of the Beta distribution narrows as the total number of trials increases.
We use this interpretation to estimate a 95% credible interval for \(f_d^{obs}\) using a Beta distribution with parameters \(a = S_d^o\) as successes and \(b = T - S_d^o\) representing the number of failures using the scipy.stats.beta.interval function in Python. This gives us an upper and lower estimate for \(f_d^{obs}\), which we multiply by \(T\) to get upper and lower estimates of \(S_d^o\) as in Equation (20).

7.3 Microsoft 2017–2020

We are now ready to proceed with our data analysis. First, we examine Microsoft’s monthly security updates for the period between March 2017 and May 2020, as curated by the Zero Day Initiative (ZDI) blog.3 Figure 3(a) shows monthly totals for all vulnerabilities, while Figure 3(b) has monthly observations of \(P \prec F\) and \(A \prec F\). This dataset allowed us to compute the monthly counts for \(F \prec P\) and \(F \prec A\).
Fig. 3.
Fig. 3. Publicly Disclosed Microsoft Vulnerabilities 2017–2020.
One benefit of this definition of possible CVD histories in Section 3 is an opportunity to clarify definitions of related terms. While not a focus of the analysis in this article, a brief comment on different interpretations of the term “zero day vulnerability” is helpful at this point. For example, one reviewer stated they prefer to define “zero day vulnerability” as \(X \prec V\) and not \((P \prec F\) or \(A \prec F)\). We should seek these precise definitions because sometimes both \(X \prec V\) and \(P \prec F\), in which case two people might agree that an instance is a zero day without understanding they disagree on the definition. We do not disagree with the reviewer that \(X \prec V\) is a definition of zero day, it just is not the one we are using for this analysis, since \((P \prec F\) or \(A \prec F)\) is what ZDI used as their definition. For example, perhaps \(X \prec V\) is a zero day for a software vendor whereas \((P \prec F\) or \(A \prec F)\) is a zero day from the perspective of a system owner. An extended version of this article brings additional formalism to the various definitions of “zero day” [21]. However, for present purposes the ability to accurately and cleanly state our definition is enough.
Observations of \(F \prec P\): In total, Microsoft issued patches for 2,694 vulnerabilities; 2,610 (0.97) of them met the fix-ready-before-public-awareness (\(F \prec P\)) objective. The mean monthly \(\alpha _{F \prec P} = 0.967\), with a range of [0.878, 1.0]. We can also use the cumulative data to estimate an overall skill level for the observation period, which gives us a bit more precision on \(\alpha _{F \prec P} = 0.969\) with the 0.95 interval of [0.962, 0.975]. Figure 4(a) shows the trend for both the monthly observations and the cumulative estimate of \(\alpha _{F \prec P}\).
Fig. 4.
Fig. 4. Publicly Disclosed Microsoft Vulnerabilities 2017–2020.
Observations of \(F \prec A\): Meanwhile, 2,655 (0.99) vulnerabilities met the fix-ready-before-attacks-observed (\(F \prec A\)) criteria. Thus, we compute a mean monthly \(\alpha _{F \prec A} = 0.976\) with range [0.893, 1.0]. The cumulative estimate yields \(\alpha _{F \prec A} = 0.986\) with an interval of [0.980, 0.989]. The trend for both is shown in Figure 4(b).
Inferring Histories from Observations: Another possible application of our model is to estimate unobserved \(\alpha _d\) based on the cumulative observations of both \(f_{F \prec P}^{obs}\) and \(f_{F \prec A}^{obs}\) above. Here, we estimate the frequency \(f_d\) of the other \(d \in \mathbb {D}\) for this period. Our procedure is as follows:
(1)
For 10,000 rounds, draw an \(f_d^{est}\) for both \(F \prec P\) and \(F \prec A\) from the Beta distribution with parameters \(a=S_d^o\) and \(b=T-S_d^o\) where \(S_d^o\) is 2,610 or 2,655, respectively, and \(T\) is 2,694.
(2)
Assign each \(h \in H\) a weight according to standard joint probability based whether it meets both, either, or neither \(A = F \prec P\) and \(B = F \prec A\), respectively.
\begin{equation} w_h = {\left\lbrace \begin{array}{ll}p_{AB} = f_A * f_B \textrm { if } A \textrm { and } B\\ p_{Ab} = f_A * f_b \textrm { if } A \textrm { and } \lnot B\\ p_{aB} = f_a * f_B \textrm { if } \lnot A \textrm { and } B\\ p_{ab} = f_a * f_b \textrm { if } \lnot A \textrm { and } \lnot B \end{array}\right.}, \end{equation}
where \(f_a = 1 - f_A\) and \(f_b = 1-f_B\)
(3)
Draw a weighted sample (with replacement) of size \(N = 2,694\) from \(H\) according to these weights.
(4)
Compute the sample frequency \(f_{d}^{sample} = S_d^{sample} / N\) for each \(d \in \mathbb {D}\), and record the median rank of all histories \(h\) in the sample.
(5)
Compute the estimated frequency as the mean of the sample frequencies \(f_{d}^{est} = \overline{f_{d}^{sample}}\) for each \(d \in \mathbb {D}\).
(6)
Compute \(\alpha _d\) from \(f_{d}^{est}\) for each \(d \in \mathbb {D}\) .
As one might expect given the causal requirement that vendor awareness precedes fix availability, the estimated values of \(\alpha _d\) are quite high (\(0.96-0.99\)) for our desiderata involving either \(V\) or \(F\). We also estimate that \(\alpha _d\) is positive—indicating that we are observing skill over and above mere luck—for all \(d\) except \(P \prec A\) and \(X \prec A,\) which are slightly negative. The results are shown in Figure 5. The most common sample median history rank across all runs is 53, with all sample median history ranks falling between 51–55. The median rank of possible histories weighted according to the assumption of equiprobable transitions is 11. We take this as evidence that the observations are indicative of skill.
Fig. 5.
Fig. 5. Simulated skill \(\alpha _d\) for Microsoft 2017–2020 based on observations of \(F \prec P\) and \(F \prec A\) over the period.

7.4 Commodity Exploits 2015–2019

Next, we examine the overall trend in \(P \prec X\) for commodity exploits between 2015 and 2019. The dataset is based on the National Vulnerability Database [32], in conjunction with the CERT Vulnerability Data Archive [11]. Between these two databases, a number of candidate dates are available to represent the date a vulnerability was made public. We use the minimum of these as the date for \(P\).
To estimate the exploit availability (\(X\)) date, we extracted the date a CVE-ID appeared in the git logs for Metasploit [36] or Exploitdb [33]. When multiple dates were available for a CVE-ID, we kept the earliest. Note that commodity exploit tools such as Metasploit and Exploitdb represent a non-random sample of the exploits available to adversaries. These observations should be taken as a lower bounds estimate of exploit availability, and therefore an upper bounds estimate of observed desiderata \(d\) and skill \(\alpha _d\).
During the time period from 2013–2019, the dataset contains 73,474 vulnerabilities. Of these, 1,186 were observed to have public exploits (\(X\)) prior to the earliest observed vulnerability disclosure date (\(P\)), giving an overall success rate for \(P \prec X\) of 0.984. The mean monthly \(\alpha _{P \prec X}\) is 0.966 with a range of [0.873, 1.0]. The volatility of this measurement appears to be higher than that of the Microsoft data. The cumulative \(\alpha _{P \prec X}\) comes in at 0.968 with an interval spanning [0.966, 0.970]. A chart of the trend is shown in Figure 6.
Fig. 6.
Fig. 6. \(\alpha _{P \prec X}\) for all NVD vulnerabilities 2013-2019 (\(X\) observations based on Metasploit and ExploitDb).
To estimate unobserved \(\alpha _d\) from the commodity exploit observations, we repeat the procedure outlined in Section 7.3. This time, we use \(N=73,474\) and estimate \(f^{est}_{d}\) for \(P \prec X\) with Beta parameters \(a=72,288\) and \(b=1186\). As above, we find evidence of skill in positive estimates of \(\alpha _d\) for all desiderata except \(P \prec A\) and \(X \prec A\), which are negative. The most common sample median history rank in this estimate is 33 with a range of [32,33], which is lower than the median rank of 53 in the Microsoft estimate from Section 7.3, still beats the median rank of 11 assuming uniform event probabilities. The results are shown in Figure 7.
Fig. 7.
Fig. 7. Simulated skill \(\alpha _d\) for all NVD vulnerabilities 2013–2019 based on observations of \(P \prec X\) over the period.

8 Discussion

The observational analysis in Section 7 supports an affirmative response to RQ3: vulnerability disclosure as currently practiced demonstrates skill. In both datasets examined, our estimated \(\alpha _d\) is positive for most \(d \in \mathbb {D}\). However, there is uncertainty in our estimates due to the application of the principle of indifference to unobserved data. This principle assumes a uniform distribution across event transitions in the absence of CVD, which is an assumption we cannot readily test. The spread of the estimates in Figures 5 and 7 represents the variance in our samples, not this assumption-based uncertainty. Our interpretation of \(\alpha _d\) values near 0 is therefore that they reflect an absence of evidence rather than evidence that skill is absent. While we cannot rule definitively on luck or low skill, values of \(\alpha _d \gt 0.9\) should reliably indicate skillful defenders.
If, as seems plausible from the evidence, it turns out that further observations of \(h\) are significantly skewed toward the higher end of the poset \((H,\le _{\mathbb {D}})\), then it may be useful to empirically calibrate our metrics rather than using the a priori frequencies in Table 4 as our baseline. This analysis baseline would provide context on “more skillful than the average for some set of teams” rather than more skillful than blind luck. Section 8.1 discusses this topic, which should be viewed as an examination of what “reasonable” in RQ2 should mean in the context of “reasonable baseline expectation.”
Section 8.2 suggests how the model might be applied to establish benchmarks for CVD processes involving any number of participants, which closes the analysis of RQ1 in relation to MPCVD. Section 8.3 surveys the stakeholders in CVD and how they might use our model; The stakeholders are vendors, system owners, the security research community, coordinators, and governments. In particular, we focus on how these stakeholders might respond to the affirmative answer to RQ3 and a method to measure skill in a way more tailored to each stakeholder group.

8.1 Benchmarks

As described above, in an ideal CVD situation, each observed history would achieve all 11 desiderata \(\mathbb {D}\). Realistically, this is unlikely to happen. We can at least state that we would prefer that most cases reach fix ready before attacks (\(F \prec A\)). Per Table 4, even in a world without skill we would expect \(F \prec A\) to hold in 73% of cases. This means that \(\alpha _{F \prec A} \lt 0\) for anything less than a 0.73 success rate. Doing just barely better than random (\(\epsilon \gt \alpha _d \gt 0\) for \(\epsilon \approx 0\)) is not terribly satisfying, so we would like to seek outcomes in which \(\alpha _{F \prec A} \ge c_{F \prec A} \ge 0\) for some benchmark constant \(c_{F \prec A}\). In fact, we propose to generalize this for any \(d \in \mathbb {D}\), such that \(\alpha _d\) should be greater than some benchmark constant \(c_d\)
\begin{equation} \alpha _d \ge c_d \ge 0, \end{equation}
(25)
where \(c_d\) is a based on observations of \(\alpha _d\) collected across some collection of CVD cases.
We propose as a starting point a naïve benchmark of \(c_d = 0\). This is a low bar, as it only requires that CVD actually do better than possible events which are independent and identically distributed (i.i.d.) within each case. For example, given a history in which \((V, F, P)\) have already happened, \(D\), \(X\), or \(A\) are equally likely to occur next.
The i.i.d. assumption may not be warranted. We anticipate that event ordering probabilities might be conditional on history: for example, \(p(X|P) \gt p(X|\lnot P)\) or \(p(A|X) \gt p(A|\lnot X)\). If the i.i.d. assumption fails to hold for \(e \in E\), observed frequencies of \(h \in H\) could differ significantly from the rates predicted by the uniform probability assumption behind Table 4.
Some example suggestive observations are:
There is reason to suspect that only a fraction of vulnerabilities ever reach the exploit public event \(X\), and fewer still reach the attack event \(A\). Recent work by the Cyentia Institute found that “5% of all CVEs are both observed within organizations AND known to be exploited”[14], which suggests that \(f_{D \prec A} \approx 0.95\).
Likewise, \(D \prec X\) holds in 28 of 70 (0.4) \(h\). However, Cyentia found that “15.6% of all open vulnerabilities observed across organizational assets in our sample have known exploits”[14], which suggests that \(f_{D \prec X} \approx 0.844\).
On their own these observations can equally well support the idea that we are broadly observing skill in vulnerability response, rather than that the world is biased from some other cause. However, we could choose a slightly different goal than differentiating skill and “blind luck” as represented by the i.i.d. assumption. One could aim at measuring “more skillful than the average for some set of teams” rather than more skillful than blind luck.
If this were the “reasonable” baseline expectation (RQ2), the primary limitation is available observations. This model helps overcome this limitation because it provides a clear path toward collecting relevant observations. For example, by collecting dates for the six \(e \in E\) for a large sample of vulnerabilities, we can get better estimates of the relative frequency of each history \(h\) in the real world. It seems as though better data would serve more to improve benchmarks rather than change expectations of the role of chance.
As an applied example, if we take the first item in the above list as a broad observation of \(f_{D \prec A} = 0.95\), we can plug into Equation (18) to get a potential benchmark of \(\alpha _{D \prec A} = 0.94\), which is considerably higher than the naïve generic benchmark \(\alpha _d = 0\). It also implies that we should expect actual observations of \(h \in H\) to skew toward the 19 \(h\) in which \(D \prec A\) around 19x as often as the 51 \(h\) in which \(D \not\prec A\). Similarly, if we interpret the second item as a broad observation of \(f_{D \prec X} = 0.844\), we can then compute a benchmark \(\alpha _{D \prec X} = 0.81\), which is again a significant improvement over the naïve \(\alpha _d = 0\) benchmark.

8.2 Applicability to MPCVD

MPCVD occurs when multiple vendors or stakeholders are involved in the disclosure process. The need for MPCVD arises due to the inherent nature of the software supply chain [22]. A vulnerability that affects a low-level component (such as a library or operating system API) can require fixes from both the originating vendor and any vendor whose products incorporate the affected component. Alternatively, vulnerabilities are sometimes found in protocol specifications or other design-time issues where multiple vendors may have each implemented their own components based on a vulnerable design.
A common problem in MPCVD is that of fairness: Coordinators are often motivated to optimize the CVD process to maximize the deployment of fixes to as many end users as possible while minimizing the exposure of users of other affected products to unnecessary risks.
The model presented in this article provides a way for coordinators to assess the effectiveness of their MPCVD cases. In an MPCVD case, each vendor/product pair effectively has its own 6-event history \(h_a\). We can therefore recast MPCVD as a set of histories \(M\) drawn from the possible histories \(H\):
\begin{equation} M = \lbrace h_1,h_2,\ldots ,h_m \textrm { where each } h_a \in H \rbrace . \end{equation}
(26)
Where \(m = |M| \ge 1\). The edge case when \(|M| = 1\) is simply the regular (non-multiparty) case.
We can then set desired criteria for the set \(M\), as in the benchmarks described in Section 8.1. In the MPCVD case, we propose to generalize the benchmark concept such that the median \(\tilde{\alpha _d}\) should be greater than some benchmark constant \(c_d\)
\begin{equation} \tilde{\alpha _d} \ge c_d \ge 0. \end{equation}
(27)
In real-world cases where some outcomes across different vendor/product pairs will necessarily be lower than others, we can also add the criteria that we want the variance of each \(\alpha _d\) to be low. An MPCVD case having high median \(\alpha _d\) with low variance across vendors and products involved will mean that most vendors achieved acceptable outcomes.
To summarize:
The median \(\alpha _d\) for all histories \(h \in M\) should be positive and preferably above some benchmark constant \(c_d\), which may be different for each \(d \in \mathbb {D}\).
\begin{equation} Median(\lbrace \alpha _d(h) : h \in M \rbrace) \ge c_d \gt 0. \end{equation}
The variance of each \(\alpha _d\) for all histories \(h \in M\) should be low. The constant \(\varepsilon\) is presumed to be small.
\begin{equation} \sigma ^2(\lbrace \alpha _d(h) : h \in M \rbrace) \le \varepsilon . \end{equation}

8.3 Reflections on the Influence of CVD Roles

CVD stakeholders include vendors, system owners, the security research community, coordinators, and governments [22]. Different stakeholders might want different things, although most benevolent parties will likely seek some subset of \(\mathbb {D}\). Because \(H\) is the same for all stakeholders, the expected frequencies shown in Table 4 will be consistent across any such variations in desiderata. A discussion of some stakeholder preferences is given below, while a summary can be found in Table 5. We notate these variations of the set of desiderata \(\mathbb {D}\) with subscripts: \(\mathbb {D}_v\) for vendors, \(\mathbb {D}_s\) for system owners, \(\mathbb {D}_c\) for coordinators, and \(\mathbb {D}_g\) for governments. In Table 3, we defined a preference ordering between every possible pairing of events; therefore, \(\mathbb {D}\) is the largest possible set of desiderata. We thus expect the desiderata of benevolent stakeholders to be a subset of \(\mathbb {D}\) in most cases. That said, we note a few exceptions in the text that follows.
Table 5.
 VendorSysOwnerCoordinator
\(d \in \mathbb {D}\)( \(\mathbb {D}_v\))( \(\mathbb {D}_s\))( \(\mathbb {D}_c\))
\(V \prec P\)yesmaybe4yes
\(V \prec X\)yesmaybe4yes
\(V \prec A\)yesmaybe4yes
\(F \prec P\)yesmaybe5yes
\(F \prec X\)yesyesyes
\(F \prec A\)yesyesyes
\(D \prec P\)maybe1maybe1yes
\(D \prec X\)maybe2maybe5yes
\(D \prec A\)maybe2yesyes
\(P \prec X\)yesyesyes
\(P \prec A\)yesyesyes
\(X \prec A\)maybe3maybe3maybe3
Table 5. Ordering Preferences for Selected Stakeholders
\(^1\)When vendors control deployment, both vendors and system owners likely prefer \(D \prec P\). When system owners control deployment, \(D \prec P\) is impossible.
\(^2\)Vendors should care about orderings involving \(D\) when they control deployment, but might be less concerned if deployment responsibility is left to system owners.
\(^3\)Exploit publication can be controversial. To some, it enables defenders to test deployed systems for vulnerabilities or detect attempted exploitation. To others, it provides unnecessary adversary advantage.
\(^4\)System owners may only be concerned with orderings involving \(V\) insofar as it is a prerequisite for \(F\).
\(^5\)System owners might be indifferent to \(F \prec P\) and \(D \prec X\) depending on their risk tolerance.

8.3.1 Vendors.

As shown in Table 5, we expect vendors’ desiderata \(\mathbb {D}_v\) to be a subset of \(\mathbb {D}\). It seems reasonable to expect vendors to prefer that a fix is ready before either exploit publication or attacks (\(F \prec X\) and \(F \prec A\), respectively). Fix availability implies vendor awareness (\(V \prec F\)), so we would expect vendors’ desiderata to include those orderings as well (\(V \prec X\) and \(V \prec A\), respectively).
Vendors typically want to have a fix ready before the public finds out about a vulnerability (\(F \prec P\)). We surmise that a vendor’s preference for this item could be driven by at least two factors: the vendor’s tolerance for potentially increased support costs (e.g., fielding customer support calls while the fix is being prepared), and the perception that public awareness without an available fix leads to a higher risk of attacks.
When a vendor has control over fix deployment (\(D\)), it will likely prefer that deployment precede public awareness, exploit publication, and attacks (\(D \prec P\), \(D \prec X\), and \(D \prec A\), respectively).4 However, when fix deployment depends on system owners to take action, the feasibility of \(D \prec P\) is limited.5 Regardless of the vendor’s ability to deploy fixes or influence their deployment, it would not be unreasonable for them to prefer that public awareness precedes both public exploits and attacks (\(P \prec X\) and \(P \prec A\), respectively).
Ensuring the ease of patch deployment by system owners remains a likely concern for vendors. Conscientious vendors might still prefer \(D \prec X\) and \(D \prec A\) even if they have no direct control over those factors. However, vendors may be indifferent to \(X \prec A\).
Although our model only addresses event ordering, not timing, a few comments about timing of events are relevant here since they reflect the underlying process from which \(H\) arises. Vendors have significant influence over the speed of \(V\) to \(F\) based on their vulnerability handling, remediation, and development processes [24]. They can also influence how early \(V\) happens based on promoting a cooperative atmosphere with the security researcher community [23]. Vendor architecture and business decisions affect the speed of \(F\) to \(D\). Cloud-based services and automated patch delivery can shorten the lag between \(F\) and \(D\). Vendors that leave deployment contingent on system owner action can be expected to have longer lags, making it harder to achieve the \(D \prec P\), \(D \prec X\), and \(D \prec A\) objectives, respectively.

8.3.2 System Owners.

System owners ultimately determine the lag from \(F\) to \(D\) based on their processes for system inventory, scanning, prioritization, patch testing, and deployment—in other words, their vulnerability management (VM) practices. In cases where the vendor and system owner are distinct entities, system owners should optimize to minimize the lag between \(F\) and \(D\) in order to improve the chances of meeting the \(D \prec X\) and \(D \prec A\) objectives, respectively.
System owners might select a different desiderata subset than vendors \(\mathbb {D}_s \subseteq \mathbb {D}\), \(\mathbb {D}_s \ne \mathbb {D}_v\). In general, system owners are primarily concerned with the \(F\) and \(D\) events relative to \(X\) and \(A\). Therefore, we expect system owners to be concerned about \(F \prec X\), \(F \prec A\), \(D \prec X\), and \(D \prec A\). As discussed above, \(D \prec P\) is only possible when the vendor controls \(D\). Depending on the system owner’s risk tolerance, \(F \prec P\) and \(D \prec X\) may or may not be preferred. Some system owners may find \(X \prec A\) useful for testing their infrastructure, others might prefer that no public exploits be available.

8.3.3 Security Researchers.

The “friendly” offensive security community (i.e., those who research vulnerabilities, report them to vendors, and sometimes release proof-of-concept exploits for system security evaluation purposes) can do their part to ensure that vendors are aware of vulnerabilities as early as possible prior to public disclosure (\(V \prec P\)). They can also delay the publication of exploits until after fixes exist (\(F \prec X\)) and possibly even until most system owners have deployed the fix (\(D \prec X\)). This does not preclude adversaries from doing their own exploit development on the way to \(A\), but it avoids providing them with unnecessary assistance.

8.3.4 Coordinators.

Coordinators have been characterized as seeking to balance the social good across both vendors and system owners [7]. This implies that they are likely interested in the union of the vendors’ and system owners’ preferences. In other words, coordinators want the full set of desiderata (\(\mathbb {D}_c = \mathbb {D}\)).
We pause for a brief aside about the design of the model with respect to the coordination role. We considered adding a Coordinator Awareness (\(C\)) event, but this would expand \(|H|\) from 70 to 452 because it could occur at any point in any \(h\). There is not much for a coordinator to do once the fix is deployed; however, so we could potentially reduce \(|H|\) to 329 by only including positions in \(H\) that precede the \(D\) event. This is still too large and unwieldy for meaningful analysis within our scope; instead, we simply provide the following comment.
The goal of coordination is this: regardless of which stage a coordinator becomes involved in a case, the objective is to choose actions that make preferred histories more likely and non-preferred histories less likely. A careful reading of the Hasse Diagram in Figure 1 or the ranking in Table 2 can suggest available actions to improve outcomes. Namely this means focusing coordination efforts as needed on vendor awareness, fix availability, fix deployment, and the appropriately timed public awareness of vulnerabilities and their exploits.

8.3.5 Governments.

In their defensive roles, governments are essentially acting as a combination of system owners, vendors, and—increasingly—coordinators. Therefore, we might anticipate \(\mathbb {D}_g = \mathbb {D}_c = \mathbb {D}\).
However, governments sometimes also have an adversarial role to play for national security, law enforcement, or other reasons. The model presented in this article could be adapted to that role as well by drawing some desiderata from the lower left triangle of Table 3. While defining such adversarial desiderata (\(\mathbb {D}_a\)) is out of scope for this article, we leave the topic with our expectation that \(\mathbb {D}_a \not\subseteq \mathbb {D}\).

9 Limitations AND Future Work

This section highlights some limitations of the current work and lays out a path for improving on those limitations in future work. Broadly, the opportunities for expanding the model include modeling multiple agents, gathering more data about CVD in the world, working to account for fairness and MPCVD, addressing the importance of duration between events, options for modeling attacker behavior, and managing the impact of partial information.
Modeling Multiple Agents. We agree with the reviewer who suggested that an agent-based model could allow deeper examination of the interactions between stakeholders in MPCVD. Many of the mechanisms and proximate causes underlying the events this model describes are hidden from the model, and would be difficult to observe or measure even if they were included. Nevertheless, in order to reason about different stakeholders’ strategies and approaches to MPCVD, we need a way to measure and compare outcomes. The model we present here gives us such a framework, but it does so by making a tradeoff in favor of generality over causal specificity. We anticipate that future agent-based models of MPCVD will be better positioned to address process mechanisms, whereas this model will be useful to assess outcomes independently of the mechanisms by which they arise.
Gather Data About CVD. Section 8.1 discusses how different benchmarks and “reasonable baseline expectations” might change the results of a skill assessment. It also proposes how to use observations of the actions a certain team or team performs to create a baseline, which compares other CVD practitioners to the skill of that team or teams. Such data could also inform causal reasoning about certain event orderings and help identify effective interventions. For example, might causing \(X \prec F\) be an effective method to improve the chances of \(D \prec A\) in cases where the vendor is slow to produce a fix? Whether it is better to compare the skill of a team to blind luck via the i.i.d. assumption or to other teams via measurement remains an open question.
To address the question, a future research effort must collect and collate a large amount of data about the timing sequences of events in the model for a variety of stakeholder groups and a variety of vulnerabilities. Then deeper analysis using joint probabilities could continue, if the modeling choice is to base skill off of a measure from past observations.
While there is a modeling choice about using the uniformity assumption versus observations from past CVD (see Section 8.1), the model does not depend on whether the uniformity assumption actually holds. We have provided a means to calculate from observations a deviation from the desired “reasonable baseline,” whether this is based on the i.i.d. assumption or not. Although, via our research questions, we have provided a method for evaluating skill in CVD, evaluating the overarching question of fairness in MPCVD requires a much broader sense of CVD practices.
MPCVD Criteria Do Not Account for Equitable Resilience. The proposed criteria for MPCVD in Section 8.2 fail to account for either user populations or their relative importance. For example, suppose an MPCVD case had a total of 15 vendors, with 5 vendors representing 95% of the total userbase achieving highly preferred outcomes and 10 vendors with poor outcomes representing the remaining 5% of the userbase. The desired criteria (high median \(\alpha\) score with low variance) would likely be unmet even though most users were protected.
Similarly, it is possible that a smaller set of vendor/product pairs represents a disproportionate concentration of the total risk posed by a vulnerability,6 and again aggregation across all vendor/product pairs may be misleading. In fact, risk concentration within a particular user population may lead to a need for strategies that appear inequitable at the vendor level while achieving greater outcome equity at a larger scale.
The core issue is that we lack a utility function to map from observed case histories to harm reduction. Potential features of such a function include aggregation across vendors and/or users. Alternatively, it may be possible to devise a method for weighting the achieved histories in an MPCVD case by some proxy for total user risk. Other approaches remain possible as well, such as a heuristic to avoid catastrophic outcomes for all, then apply a weighted sum over the impact to remaining users. Future work might also consider whether criteria other than high median and low variance could be applied. Regardless, achieving accurate estimates of such parameters is likely to remain challenging.
The Model Has No Sense of Timing. There is no concept of time in this model, but delays between events can make a big difference in history results. Two cases in which \(F \prec A\) would be quite different if the time gap between these two events was 1 week versus 3 months, as this gap directly bears on the need for speed in deploying fixes. Organizations may wish to extend this model by setting timing expectations in addition to simple precedence preferences. For example, organizations may wish to specify service level objectives for \(V \prec F\), \(F \prec D\), \(F \prec A\), and so forth.
Furthermore, in the long run the elapsed time for \(F \prec A\) essentially dictates the response time requirements for vulnerability management (VM) processes for system owners. Neither system owners nor vendors get to choose when attacks happen, so we should expect stochasticity to play a significant role in this timing. However, if an organization cannot consistently achieve a shorter lag between \(F\) and \(D\) than between \(F\) and \(A\) (i.e., achieving \(D \prec A\)) for a sizable fraction of the vulnerability cases they encounter, it is difficult to imagine that organization being satisfied with the effectiveness of their VM program.
Similarly, the model casts each event \(e \in E\) as a singular point event, even though some—such as fix deployed \(D\)—would be more accurately described as diffusion or multi-agent (as above) processes. To apply this model to real world observations, it may be pragmatic to adapt the event definition to include some defined threshold criteria. A fixed quantile appears to be a reasonable approach. For example, a stakeholder might decide that their goal is for 80% of known systems to be patched. They then could observe the deployed fix ratio for their constituency and mark the event \(D\) as having occurred when that threshold is reached.
Attacks as random events. In the model presented here, attacks are modeled as random events, but they are not. At an individual or organization level, attackers are intelligent adversaries and can be expected to follow their own objectives and processes to achieve their ends. However, modeling the details of various attackers is beyond the scope of this model. Thus, we believe that a stochastic approach to adversarial actions is reasonable from the perspective of a vendor or system owner. Furthermore, if attacks were easily predicted, we would be having a very different conversation.
Observation may be limited. Not all events \(e \in E\), and therefore not all desiderata \(d \in \mathbb {D}\) will be observable by all interested parties. But in many cases at least some are, which can still help to infer reasonable limits on the others, as shown in Section 7.3. Vendors are of course well placed to observe most of the events in each case, even more so if they have good sources of threat information to bolster their awareness of the \(X\) and \(A\) events. A vigilant public can also be expected to eventually observe most of the events, although \(V\) might not be observable unless vendors, researchers, and/or coordinators are forthcoming with their notification timelines (as many increasingly are). \(D\) is probably the hardest event to observe for all parties, for the reasons described in the timing discussion above.

10 Related Work

Numerous models of the vulnerability life cycle and CVD have been proposed. Arbaugh, Fithen, and McHugh provide a descriptive model of the life cycle of vulnerabilities from inception to attacks and remediation [1], which we refined with those of Frei et al. [19], and Bilge and et al. [8] to form the basis of this model as described in Section 2. We also found Lewis’ literature review of vulnerability lifecycle models to be useful [27].
Prescriptive models have also been proposed. Christey and Wysopal’s 2002 IETF draft laid out a process for responsible disclosure geared towards prescribing roles, responsibilities for researchers, vendors, customers, and the security community [13]. The NIAC Vulnerability Disclosure Framework also laid out a prescriptive process for coordinating the disclosure and remedation of vulnerabilities [12]. The CERT Guide to Coordinated Vulnerability Disclosure provides a practical overview of the CVD process [22]. ISO/IEC 29147 describes standard externally-facing processes for vulnerability disclosure from the perspective of a vendor receiving vulnerability reports, while ISO/IEC 30111 describes internal vulnerability handling processes within a vendor [23, 24]. The FIRST Product Security Incident Response Team (PSIRT) Services Framework provides a practical description of the capabilities common to vulnerability response within vendor organizations. The FIRST provides a number of scenarios for MPCVD [18]. Many of these scenarios can be mapped directly to the histories \(h \in H\) described here.
Benchmarking CVD capability is the topic of the Vulnerability Crduoordination Maturity Model (VCMM) from Luta Security [28]. The Vulnerability Coordination Maturity Model (VCMM) addresses five capability areas: organizational, engineering, communications, analytics, and incentives. Of these, our model is perhaps most relevant to the analytics capability, and could be used to inform an organization’s assessment of their progress in this dimension.
Economic analysis of CVD has also been done. Arora et al. explored the CVD process from an economic and social welfare perspective [2, 3, 4, 5, 6, 7]. More recently, so did Silfversten et al. [37]. Cavusoglu and Cavusoglu model the mechanisms involved in motivating vendors to produce and release patches [10]. Ellis et al. examine the dynamics of labor market for bug bounties both within and across CVD programs [16]. Lewis highlights systemic themes within the vulnerability discovery and disclosure system [27]. Pupillo et al. explore the policy implications of CVD in Europe [35]. Moore and Householder modeled the interactions of VM and MPCVD processes [29]. A model for prioritizing vulnerability response that considers \(X\) and \(A\), among other impact factors, is found in Spring et al. [38].
Other work has examined the timing of events in the lifecycle, sometimes with implications for forecasting. Ozment and Schechter examine the rate of vulnerability reports as software ages [34]. Bilge and Dumitraş study 18 vulnerabilities in which \(A \prec P\), finding a lag of over 300 days [8]. Jacobs et al. propose an Exploit Prediction Scoring System [25], which could provide insight into \(V \prec A\), \(F \prec A\), and other desiderata \(d \in \mathbb {D}\). Householder et al. find that while only about 5% of vulnerabilities have public exploits available via commodity tools, for those that do the median lag between \(P\) and \(X\) was 2 days [20].
Frei et al. describe the timing of many of the events here, including \(F\), \(D\), \(X\), and \(P\) and the \({\Delta }t\) between them for the period 2000–2007 across a wide swath of industry [19]. In their analysis, they note that \(X \prec P\) in 15% of vulnerabilities they analyzed. That means \(f^{obs}_{P \prec X}=0.85\), but from Table 4 we already expect \(f_{P \prec X}=0.5\) from chance alone. So we therefore compute \(\alpha _{P \prec X}=0.7\). Similarly, they report that a patch is available on or before the date of public awareness in 43% of vulnerabilities (our interpretation of their “zero day patch” where \(t_{P}-t_{F}=0\) is that in order for P and F to happen in the same day the patch existed before the \(P\) event). In other words, \(f^{obs}_{F \prec P}=0.43\), giving us an \(\alpha _{F \prec P}=0.36\) once we factor in \(f_{F \prec P}=0.111\).

11 Conclusion

We have described a model of all possible CVD histories \(H\) and defined a set of desired criteria \(\mathbb {D}\) that are preferable in each history. This allowed us to create a partially ordered set over all histories and to compute a baseline expected frequency for each desired criteria. We also proposed a new performance indicator for comparing actual CVD experiences against a benchmark, and proposed an initial benchmark based on the expected frequency of each desired criteria. We demonstrated this performance indicator in a few examples, indicating that at least some CVD practices appear to be doing considerably better than random. Finally, we posited a way to apply these metrics to measure the efficacy of MPCVD processes.

Acknowledgments

We would like to thank the anonymous reviewers of DTRAP for their helpful and constructive comments.

Footnotes

1
Specifically, skill in our model will align with fulfilling the duty of coordinated vulnerability disclosure, duty of confidentiality, duty to inform, duty to team ability, and duty of evidence-based reasoning.
2
Although we do believe there is some value in exploit publication because it allows defenders to develop detection controls (e.g., in the form of behavioral patterns or signatures). Even if those detection mechanisms are imperfect, it seems better that they be in place prior to adversaries using them than the opposite.
3
https://www.zerodayinitiative.com/blog. The Zero Day Initiative (ZDI) blog posts were more directly useful than the monthly Microsoft security updates because ZDI had already condensed the counts of how many vulnerabilities were known (\(P\)) or exploited (\(A\)) prior to their fix availability \(F\). Retrieving this data from Microsoft’s published vulnerability information requires collecting it from all the individual vulnerabilities patched each month. We are grateful to ZDI for providing this summary and saving us the effort.
4
On the other hand, it is possible that some vendors might actually prefer public awareness before fix deployment even if they have the ability to deploy fixes, for example, in support of transparency or trust building. In that case, \(\mathbb {D_V} \not\subseteq \mathbb {D}\), and some portions of the analysis presented here may not apply.
5
“Silent patches” can obviously occur when vendors fix a vulnerability but do not make that fact known. In principle, it is possible that silent patches could achieve \(D \prec P\) even in traditional COTS or OSS distribution models. However, in practice silent patches result in poor deployment rates precisely because they lack an explicit imperative to deploy the fix.
6
User concentration is one way to think about risk, but it is not the only way. Value density, as defined in [38] is another.

References

[1]
William A. Arbaugh, William L. Fithen, and John McHugh. 2000. Windows of vulnerability: A case study analysis. Computer 33, 12 (2000), 52–59.
[2]
Ashish Arora, Jonathan P. Caulkins, and Rahul Telang. 2006. Research note—Sell first, fix later: Impact of patching on software quality. Management Science 52, 3 (2006), 465–471.
[3]
Ashish Arora, Chris Forman, Anand Nandkumar, and Rahul Telang. 2010. Competition and patching of security vulnerabilities: An empirical analysis. Information Economics and Policy 22, 2 (2010), 164–177.
[4]
Ashish Arora, Ramayya Krishnan, Rahul Telang, and Yubao Yang. 2010. An empirical analysis of software vendors’ patch release behavior: Impact of vulnerability disclosure. Information Systems Research 21, 1 (2010), 115–132.
[5]
Ashish Arora, Anand Nandkumar, and Rahul Telang. 2006. Does information security attack frequency increase with vulnerability disclosure? An empirical analysis. Information Systems Frontiers 8, 5 (2006), 350–362.
[6]
Ashish Arora and Rahul Telang. 2005. Economics of software vulnerability disclosure. IEEE Security & Privacy 3, 1 (2005), 20–25.
[7]
Ashish Arora, Rahul Telang, and Hao Xu. 2008. Optimal policy for software vulnerability disclosure. Management Science 54, 4 (2008), 642–656.
[8]
Leyla Bilge and Tudor Dumitraş. 2012. Before we knew it: An empirical study of zero-day attacks in the real world. In Proceedings of the 2012 ACM Conference on Computer and Communications Security. 833–844.
[9]
Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval estimation for a binomial proportion. Statistical Science 16, 2 (2001), 101–133.
[10]
Hasan Cavusoglu, Huseyin Cavusoglu, and Srinivasan Raghunathan. 2007. Efficiency of vulnerability disclosure mechanisms to disseminate vulnerability knowledge. IEEE Transactions on Software Engineering 33, 3 (2007), 171–185.
[11]
CERT Coordination Center (CERT/CC). 2020. CERT Vulnerability Data Archive. Retrieved from https://github.com/CERTCC/Vulnerability-Data-Archive. (2020). Accessed: 2020-06-08.
[12]
John T. Chambers and John W. Thomson. 2004. National Infrastructure Advisory Council’s Vulnerability Disclosure Framework: Final Report and Recommendations. Retrieved from https://www.cisa.gov/publication/niac-vulnerability-framework-final-report. (2004). Accessed: 2020-07-27.
[13]
Steve Christey and Chris Wysopal. 2002. Responsible Vulnerability Disclosure Process. Retrieved from https://tools.ietf.org/html/draft-christey-wysopal-vuln-disclosure-00. (February2002). https://tools.ietf.org/html/draft-christey-wysopal-vuln-disclosure-00. Accessed: 2020-07-27.
[14]
Cyentia Institute. 2019. Prioritization to Prediction Volume 2: Getting Real About Remediation. Technical Report. Cyentia Institute, LLC.
[15]
Marcel Dreef, Peter Borm, and Ben Van der Genugten. 2004. Measuring skill in games: Several approaches discussed. Mathematical Methods of Operations Research 59, 3 (2004), 375–391.
[16]
Ryan Ellis, Keman Huang, Michael Siegel, Katie Moussouris, and James Houghton. 2017. New Solutions for Cybersecurity. MIT Press, Chapter Fixing a hole: The labor market for bugs., 122–147.
[17]
Benjamin Eva. 2019. Principles of Indifference. The Journal of Philosophy 116, 7 (2019), 390–411.
[18]
Forum of Incident Response and Security Teams. 2020. Guidelines and Practices for Multi-Party Vulnerability Coordination and Disclosure. Retrieved from https://www.first.org/global/sigs/vulnerability-coordination/multiparty/guidelines-v1.1. (2020). Accessed: 2020-07-27.
[19]
Stefan Frei, Dominik Schatzmann, Bernhard Plattner, and Brian Trammell. 2010. Modeling the security ecosystem-the dynamics of (in) security. In Proceedings of the Economics of Information Security and Privacy. Springer, 79–106.
[20]
Allen D. Householder, Jeff Chrabaszcz, Trent Novelly, David Warren, and Jonathan M. Spring. 2020. Historical analysis of exploit availability timelines. In Proceedings of the 13th USENIX Workshop on Cyber Security Experimentation and Test.
[21]
Allen D. Householder and Jonathan Spring. 2021. A State-Based Model for Multi-Party Coordinated Vulnerability Disclosure (MPCVD). Technical Report CMU/SEI-2021-SR-021. Software Engineering Institute, Carnegie-Mellon University, Pittsburgh, PA.
[22]
Allen D. Householder, Garret Wassermann, Art Manion, and Chris King. 2017. The CERT Guide to Coordinated Vulnerability Disclosure. Technical Report CMU/SEI-2017-SR-022. Software Engineering Institute, Carnegie-Mellon University, Pittsburgh, PA.
[23]
ISO. 2018. Information technology — Security techniques — Vulnerability disclosure. Standard 29147:2018. International Organization for Standardization, Geneva, CH.
[24]
ISO. 2019. Information technology — Security techniques — Vulnerability handling processes. Standard 30111:2019. International Organization for Standardization, Geneva, CH.
[25]
Jay Jacobs, Sasha Romanosky, Benjamin Edwards, Michael Roytman, and Idris Adjerid. 2021. Exploit prediction scoring system (EPSS). Digital Threats 2, 3 (2021).
[26]
Patrick Larkey, Joseph B. Kadane, Robert Austin, and Shmuel Zamir. 1997. Skill in games. Management Science 43, 5 (1997), 596–609.
[27]
Paul Simon Lewis. 2017. The global vulnerability discovery and disclosure system: A thematic system dynamics approach. PhD Dissertation, Cranfield University.
[28]
Luta Security. 2020. Vulnerability Coordination Security Model. Retrieved from https://www.lutasecurity.com/vcmm. (2020). Accessed: 2020-09-17.
[29]
Andrew P. Moore and Allen D. Householder. 2019. Multi-Method Modeling and Analysis of the Cybersecurity Vulnerability Management Ecosystem. Technical Report. Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA.
[30]
National Cyber Security Centre. 2018. Coordinated Vulnerability Disclosure: The Guideline. Technical Report. National Cyber Security Centre, Netherlands.
[31]
Lily Hay Newman. 2018. Senators Fear Meltdown and Spectre Disclosure Gave China an Edge. Retrieved from https://www.wired.com/story/meltdown-and-spectre-intel-china-disclosure/, Jul 2018. Accessed: 2022-06-07.
[32]
NIST. 2020. National Vulnerability Database. Retrieved from https://nvd.nist.gov. (2020). Accessed: 2020-06-08.
[33]
Offensive Security. 2020. Exploit DB. Retrieved from https://github.com/offensive-security/exploitdb. (2020). Accessed: 2020-06-08.
[34]
Andy Ozment and Stuart E. Schechter. 2006. Milk or wine: Does software security improve with age?. In Proceedings of the USENIX Security Symposium.
[35]
Lorenzo Pupillo, Afonso Ferreira, and Gianluca Varisco. 2018. Software Vulnerability Disclosure in Europe: Technology, policies and legal challenges, report of a CEPS Task Force. Report ISBN 978-94-6138-687-8, Center for European Policy Studies (CEPS), Place du Congrès 1, B-1000 Brussels, June 2018.
[36]
Rapid7. 2020. Metasploit Framework. Retrieved from https://github.com/rapid7/metasploit-framework. (2020). Accessed: 2020-06-08.
[37]
Erik Silfversten, William D. Phillips, Giacomo Persi Paoli, and Cosmin Ciobanu. 2018. Economics of vulnerability disclosure. Report TP-07-18-080-EN-N, European Union Agency for Network and Information Security (ENISA), 1 Vasilissis Sofias, Marousi 151 24, Attiki, Greece, December 2018. DOI:
[38]
Jonathan M. Spring, Eric Hatleback, Allen D. Householder, Art Manion, and Deana Shick. 2020. Prioritizing vulnerability response: A stakeholder-specific vulnerability categorization. In Proceedings of the Workshop on the Economics of Information Security. Brussels, Belgium.
[39]
Jeroen van der Ham and Shawn Richardson. 2020. Ethics for Incident Response and Security Teams. Retrieved from https://www.first.org/global/sigs/ethics/ethics-first. (Jun. 2020). Accessed: 2020-03-10.
[40]
Yang Xiao, Bihuan Chen, Chendong Yu, Zhengzi Xu, Zimu Yuan, Feng Li, Binghong Liu, Yang Liu, Wei Huo, Wei Zou, and Wenchang Shi. 2020. MVP: Detecting vulnerabilities using patch-enhanced vulnerability signatures. In Proceedings of the 29th USENIX Security Symposium (USENIX Security’20). 1165–1182.
[41]
Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. 2020. Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376–387.

Cited By

View all
  • (2024)Are You Sure You Want To Do Coordinated Vulnerability Disclosure?2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00039(307-314)Online publication date: 8-Jul-2024
  • (2023)The CVE Wayback Machine: Measuring Coordinated Disclosure from Exploits against Two Years of Zero-DaysProceedings of the 2023 ACM on Internet Measurement Conference10.1145/3618257.3624810(236-252)Online publication date: 24-Oct-2023

Index Terms

  1. Are We Skillful or Just Lucky? Interpreting the Possible Histories of Vulnerability Disclosures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Digital Threats: Research and Practice
      Digital Threats: Research and Practice  Volume 3, Issue 4
      December 2022
      232 pages
      EISSN:2576-5337
      DOI:10.1145/3572830
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 February 2022
      Accepted: 22 July 2021
      Revised: 08 April 2021
      Received: 20 November 2020
      Published in DTRAP Volume 3, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Vulnerability disclosure
      2. vulnerability management
      3. evaluation
      4. measurement
      5. process modelling
      6. CSIRTs

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • Department of Homeland Security

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)490
      • Downloads (Last 6 weeks)61
      Reflects downloads up to 27 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Are You Sure You Want To Do Coordinated Vulnerability Disclosure?2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00039(307-314)Online publication date: 8-Jul-2024
      • (2023)The CVE Wayback Machine: Measuring Coordinated Disclosure from Exploits against Two Years of Zero-DaysProceedings of the 2023 ACM on Internet Measurement Conference10.1145/3618257.3624810(236-252)Online publication date: 24-Oct-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media