1 Introduction
The Linux kernel is popular, well-maintained, and widely used. As one of the largest open-source software projects, the Linux kernel has many vulnerabilities induced among the huge number of commits [
26]. As shown in
Figure 1, as the Linux kernel is seeing an steadily increasing number of commits, there are also a number of vulnerabilities (i.e., around 100 to 500) induced per year. Previous works [
32,
35,
44,
54] proposed many different algorithms to detect these vulnerabilities. However, due to the complexity of the Linux kernel and the limitation of the proposed tools, many vulnerabilities still evade detection. This results in the fact that vulnerabilities exist in the Linux kernel for quite a long time (i.e.,
\(1{,}732.97\) days on average [
12]) before being detected and patched, which makes the kernel insecure. Thus, understanding
Kernel Vulnerability Inducing Commits (KVICs) is important. It can help increase the precision of defect prediction [
25,
33,
34,
37,
46] and prevent vulnerabilities from being induced, making the kernel more robust, reliable, and secure. This motivates our study on KVICs. We believe it can provide deep insights and solid guidance for future researchers and help enhance existing defect prediction models on
It is non-trivial to understand KVICs. First, there are millions of commits in the Linux kernel, without any explicit flag indicating if one commit is a KVIC. Manually identifying KVICs is extremely time-consuming and error-prone. Meanwhile, existing algorithms (e.g., SZZ [
42]) do not work well in practice [
49]. Second, KVICs may vary a lot with different attributes. It is not an easy task to determine from which perspective to study and understand them.
To this end, we carefully designed steps for collecting KVICs. Specifically, we first identify the Linux kernel vulnerabilities from the
Common Vulnerabilities and Exposures (CVE) list [
4], which is an enumeration of CVE IDs and their corresponding vulnerabilities. Since direct identification of KVICs via CVE IDs is rather difficult, we locate the
Kernel Vulnerability Fixing Commits (KVFCs) first and then use them as a bridge to pinpoint KVICs. We propose a semi-automatic methodology by combining six different methods.
Specifically, we located the KVFCs from the commit message (
Section 3.2.1), commit URL (
Section 3.2.2), commit tag (
Section 3.2.3), downstream vendors (
Section 3.2.4), the research community (
Section 3.2.5), and the Linux Kernel CVEs project (
Section 3.2.6). After that, we merged the obtained KVFCs and manually verified the conflicts (if any). Finally, we located 2,290 KVFCs for 2,120 CVEs of the Linux kernel from July 2011 (release date of version 3.0) to October 2022, approximately spanning the range of the past decade. We thereafter propose four different methods for identifying KVICs from the obtained KVFCs. Specifically, we locate the KVICs by checking the “Fixes” tag (
Section 3.3.1), searching for the commit IDs from commit messages (
Section 3.3.2), referring to specific vendor information (
Section 3.3.3), and the Linux Kernel CVEs project (
Section 3.3.4). In this case, we collected 1,240 KVICs for 1,335 CVEs. Finally, we aim at understanding the KVICs by answering the following three
Research Questions (RQs):
–
What are the characteristics of the KVICs?
–
What are the purposes of KVICs?
–
How do experience, expertise, and the roles of involved person, influence KVICs?
Our empirical study on the KVICs results in many useful insights and findings. First, we find that KVICs usually change more lines, conditional statements, and files compared to non-KVICs, and the changed files are more complex than general files. Second, we notice that changing particular modules (e.g., network, file system, kvm, and init) is more likely to induce vulnerabilities. Third, we find that about half of the KVICs aim to add new features, and commits attempting to fix bugs can still induce vulnerabilities. Finally, we explore the roles of people involved in the Linux kernel patch process, and noticed that KVICs have a limited number of reviewers. We also examine the number of reviewers along the time scale and find that KVICs do not follow the trend of steadily involving more reviewers over the past decade. We also propose the concept of commit knowledge and find that maintainers may not actually be very familiar with the commit they are working with.
Based on the above insights, we propose many insights and suggestions, which could be applied in the future kernel development and maintenance process to help reduce the induction of vulnerabilities. First, commits with specific characteristics (e.g., change many conditional statements) should be carefully reviewed. Second, pay more attention to commits that add new features. Third, involve more reviewers for the commits instead of carbon copying the commits to other people. Fourth, commits that change particular modules (i.e., init, net, kvm, mm, and fs) should be more carefully reviewed. Fifth, involve more people with higher commit knowledge. Sixth, developers are suggested to use particular tags (e.g., “Fixes,” “Reviewed-by”) for categorizing commits and refer to the CVE ID when patching vulnerabilities. Finally, researchers are encouraged to develop new algorithms for identifying KVICs automatically, based on our dataset and proposed metrics.
To conclude, we make the following contributions in this article:
Systematic and Sound KVIC Identification Approach. We propose a semi-automatic approach to identify KVICs by using KVFCs as a bridge with a rather low false positive rate. In particular, a total of 10 methods (6 for identifying KVFCs and 4 for identifying KVICs) are proposed.
Public Dataset. Using our identification approach, we identify 1,240 Linux kernel KVICs from 1,335 CVE IDs, with 2,290 KVFCs as the bridge, which spans the time period over the past ten years. We will release it to the community for further research.
Comprehensive Study. We conduct a comprehensive study on 1,240 KVICs and quantify their particular characteristics. This provides a deeper understanding of KVICs in the Linux kernel.
Insightful Findings and Suggestions. We obtain many interesting findings and insights, which enables us to propose several suggestions for reducing the induction of vulnerabilities.
3 Data Collection
To understand the KVICs, we first need to collect the dataset. However, the major difficulty comes from the lack of explicit flag indicating that one commit is an KVIC. Furthermore, vulnerabilities vary a lot, making it non-trivial to find a precise mechanism that can identify the KVICs automatically and precisely, which is still an open question. To address these challenges, we collect the KVICs in three steps.
Figure 2 shows the process. First, we need to collect the vulnerabilities. We use the CVE list [
4] as the database and filter out the vulnerabilities of the Linux kernel by searching for related keywords (
Section 3.1). This can help to identify the candidate CVE IDs for the Linux kernel. Given one vulnerability (CVE ID) of the Linux kernel, there are no direct approaches to infer the corresponding KVIC. This is because the correlation between the KVIC and the vulnerabilities is indirect and implicit. To solve this problem, we use the KVFCs, which are identified with six different methods (
Section 3.2), as the bridge. After identifying KVFCs, we propose four different methods to identify the KVICs (
Section 3.3).
3.1 Kernel Vulnerability Collection
The CVE list is built by CVE Numbering Authorities, which is responsible for assigning and publishing CVEs. Thus, it is the official site for CVE data and should have no false negatives. We use the keyword Linux kernel to search for vulnerabilities of the Linux Kernel and 3,464 CVEs are located. Note that we disregard CVEs assigned before July 2011 as the Linux kernel before version 3.0 are no longer maintained. To this end, we have 2,694 CVEs in total (from 2011 to October 2022). Though the other software (e.g., Android kernel) may also contain the related keywords (i.e., Linux kernel), the false positives can be filtered out during the process of identifying KVFCs and KVICs.
3.2 KVFC Identification
After identifying the CVEs of the Linux kernel, we need to identify the KVFCs. Given one CVE, the major difficulty comes from identifying the KVFC as there is no explicit relationship between one CVE and the corresponding fix commit. Though the commit message usually indicates the purpose of the commit, not all the authors explicitly describe that the designated commit targets any specific CVEs. One possible reason for this might be that the authors were not aware of the existance of a corresponding CVE. Furthermore, reading the commit message requires manual effort and is error-prone. Different sites (e.g., NVD [
9]) or vendors (e.g., Ubuntu [
10]) may maintain the information of CVEs. However, the provided information is incomplete and they may have conflicts between each other. In this case, we cannot completely trust them.
To address the above-mentioned challenges, we collect the KVFCs with six different methods (i.e., \(FM1\), \(FM2\), \(FM3\), \(FM4\), \(FM5\), and \(FM6\)). Each method will identify KVFCs for a set of CVEs. After that, we merge the results obtained from the five different methods. For one specific CVE, if there are KVFCs identified by different methods, we check whether they have conflicts. If so, we will manually verify them. To make the manual verification process reliable, two authors in this article, manually verify the conflicted results separately. After that, the manually checked results from two authors are compared. If the results are consistent, it will be kept. Otherwise, the two authors will discuss and check the detailed commit content to make the final decision. In the following, we detail how each method works and the results.
3.2.1 KVFC Identification Method 1: Commit Message.
Commit messages in a version control system provide a detailed description of the corresponding commits [
11,
45], where developers may explicitly claim a commit is to fix a specific CVE. With this observation, we first search all commit messages of the Linux kernel and check whether they contain the CVE IDs we obtained in
Section 3.1. After that, we conduct further filtering by selecting the ones that contain specific keywords (i.e.,
fix(ed)/patch(ed)/mitigate(d)) of fixing purposes. In total, we find 443 commits that contain CVE IDs. Among the 443 commits, 195 of them have the required keywords (i.e.,
fix(ed)/patch(ed)/mitigate(d)) in the same sentence containing the CVE ID. These commits are treated as the KVFCs. For the remaining 248 commits, we manually verify the result. Finally, we have 312 KVFCs for 288 CVEs with method 1.
3.2.2 KVFC Identification Method 2: Commit URL.
The CVE Program [
4] is an authority that maintains related information for CVEs. In particular, it provides references for each CVE in the form of links pointing to blogs, email histories, fix commits, and so forth. We build a crawler to automatically examine each CVE listed by the CVE Program, extract the ones that represent a commit for the Linux kernel, and treat them as fix commits. Specifically, the link should start with
http://git.kernel.org/pub/scm/linux/kernel/git/ or
https://github.com/torvalds/linux/ and contain the
Secure Hash Algorithm-1 (SHA-1) value indicating a specific commit. For example,
https://github.com/torvalds/linux/commit/dd504589577d is a commit for fixing the CVE-2015-8970 [
2]. In total, we find 1,359 CVEs, which map to 1,427 KVFCs. Note that method 2 may have false positives as the CVE can have some insidious misinformation or the repository history gets rewritten, leading to obsoleted commit hashes. However, to make the holistic results reliable, we have further verification by checking for unanimous agreement among all our KVFC identification methods.
3.2.3 KVFC Identification Method 3: Commit Tag.
Apart from the CVE Program, NVD [
9] also provides some reference links for each CVE. Furthermore, NVD assigns a tag named “Patch” for commits that are used to fix the corresponding CVE. We first find candidate KVFCs for each CVE by extracting links with the “Patch” tag. After that, we check whether the link represents a commit for the Linux kernel by using the same policy as introduced in
Section 3.2.2. If so, we treat it as an KVFC. With this method, we automatically locate 1,127 KVFCs, which map to 1,066 CVEs.
3.2.4 KVFC Identification Method 4: Vendors.
We surveyed many popular vendors including Debian, SUSE, ArchLinux, Red Hat, Gentoo, Android, and Ubuntu. We noticed that only Ubuntu [
10] and Android [
1] provide explicit mapping relationships between CVEs and their corresponding KVFCs. The maintained information is reliable as it has been verified by the vendors’ security experts. Thus, we utilize the information provided by these two vendors.
Specifically, Android releases its security bulletins including the CVE that is related to the upstream kernel (i.e., Linux kernel). The corresponding fix commit is usually attached. Ubuntu maintains a mapping between KVFCs and their corresponding KVICs for each CVE, as shown in
Figure 3. From the Android security Bulletins, we located 144 KVFCs, which map to 135 CVEs. From Ubuntu, we located 1,920 KVFCs, which map to 1,687 CVEs.
3.2.5 KVFC Identification Method 5: Research community.
Besides above methods, we also consider the manually curated mappings from CVE IDs to KVFCs from the research community. Alexopoulos et al. [
12] build a dataset of KVFC mappings to study the lifetime of vulnerabilities and release it to public. We incorporated it into our research and obtained 1,473 CVEs that map to 1,528 KVFCs.
3.2.6 KVFC Identification Method 6: Linux Kernel CVEs Project.
The Linux Kernel CVEs project [
7] is another source that keeps tracks of KVFCs. According to its official introductions, its identified KVFCs were automatically generated through a set of tools that includes unpublished ones. We incorporated the data from this project into our research and obtained 1,959 CVEs that map to 1,863 KVFCs.
3.2.7 Determining final KVFCs.
To make our collected data reliable, we combine the results extracted from the six methods and compare them between each other. If there are conflicts among the results of at least two methods, we will manually check it. In this case, we can reduce the false positives. Our empirical study tolerates false negatives as the collected data does not have bias on any particular feature of the commit itself. Thus, the learned lessons and the obtained insights can represent the whole KVICs of Linux kernel.
Table 1 shows the final results. In total, 2,290 commits are identified as KVFCs for 2,120 CVEs. Among the 2,290 KVFCs, 538 ones that map to 278 CVEs were verified manually due to conflicts between different methods. Note that this is a relatively low rate of disagreement and thus our method does not require a significant amount of manual efforts, and can scale to other projects. Among the 278 CVEs, 221 ones have two different results from all of the six methods while 53 CVEs (i.e., 19.06%) have three different results. Only 4 CVEs (i.e., 1.44%) have four different results, suggesting that most of our methods can reach the agreement on most of the CVEs.
3.3 KVIC Identification
After collecting the KVFCs, we now describe how the KVICs are identified. Similar to identifying KVFCs, there is no explicit relationship between an KVFC and its corresponding KVIC. Meanwhile, existing algorithms (e.g., SZZ [
42]) do not work well in practice [
49]. Furthermore, many sites that maintain the CVE information do not provide information for KVICs. In this case, we rely on KVFCs as the bridge and utilize four different methods to identify the KVICs. We will also merge the results from three different methods under unanimous agreement of the methods. If there are conflicts between any two methods, we will manually verify them. We detail the three methods in the following.
3.3.1 KVIC Identification Method 1: Fixes tag.
According to the Linux kernel documentation, when attempting to fix a bug induced by specific commits, it is suggested to add a “
Fixes:” tag with the first 12 characters of the SHA-1 ID of the KVIC. Utilizing this tagging convention, we can pinpoint the KVICs. In this case, we first match the keyword “
Fixes:” in the commit message of the identified KVFCs. After locating “
Fixes:” tags, we retrieve the SHA-1 ID, which represents the KVIC, that follows the tag.
Figure 4 shows an example, where given
d1f82808877 as a KVFC for CVE-2021-3491, “
ddf0322db79c” is identified as an KVIC since it appeared after a “
Fixes:” tag in the commit message of a KVFC. With this method, we locate 624 KVICs for 612 CVEs. Note that there are more KVICs than CVEs since KVICs and CVEs are not a one-to-one correspondence, i.e., one KVIC could correspond to more than one vulnerability, and vice versa. However, many authors may not strictly follow the aforementioned tagging convention, since it is not a mandatory requirement. In this case, many of the KVICs cannot be explored by identifying the keyword
Fixes. To make our study comprehensive, we need new methods to complement the existing result.
3.3.2 KVIC Identification Method 2: Commit ID.
Though the commit message may not contain “
Fixes” tag, the authors may discuss how the vulnerability is induced in the commit message by referring specific commit ID. In this case, we scan all the Linux kernel commits and check whether they contain any SHA-1 hash value. If so, we further check whether it represents a commit ID.
Figure 5 shows an example. Though the author does not add the “Fixes” tag in the message of commit
3f190e3aec21, he discussed how the vulnerability (i.e., CVE-2017-8063) is induced. According to line 5, commit
17ce039b4e54 is identified as the KVIC. With this method, 145 KVICs are located, which map to 133 CVEs.
3.3.3 KVIC Identification Method 3: Vendors.
As mentioned in
Figure 3, some downstream vendors (e.g., Ubuntu) also maintain the KVIC information. Thus, we crawled the information from their official site. Since the maintained information of Ubuntu may not always be correct, we first check whether the KVFC is identified correctly, which means checking if they are included by our final 2,094 KVFCs. If not, the provided KVIC will not be considered. In this case, we have 1,028 KVICs, which map to 1,071 CVEs.
3.3.4 KVIC Identification Method 4: Linux Kernel CVEs Project.
As discussed in
Section 3.2.6 the Linux Kernel CVEs project maintains a public dataset for KVFCs. In addition, it also keeps track of KVICs, and we incorporate their data to our research. With this method, 1,036 KVICs are located, which map to 1,940 CVEs.
3.3.5 Determining final KVICs.
Similar to the steps of collecting KVFCs, we merge the KVICs obtained from the above mentioned three methods and manually check the conflicts.
Table 2 summarizes the results. In total, we identified 1,240 KVICs that induce 1,335 CVEs. About 260 KVICs that map to 213 CVEs are verified manually.
It is worth noting that one CVE can map to more than one KVICs. This is because there are some cases where multiple KVICs collectively introduced one CVE vulnerability. For example, CVE-2021-23134 is a vulnerability caused by two KVICs (i.e., c33b1cc62ac05c1dbb1cdafe2eb66da01c76ca8d and 8a4cd82d62b5ec7e5482333a72b58a4eea4979f0), where these two commits both attempted to fix a refcount leak bug, but collectively introduced a use-after-free issue. This issue was later fixed by commit c61760e6940dd4039a7f5e84a6afc9cdbf4d82b6. Overall, most (i.e., 1,261) of the 1,335 identified CVEs correspond to only one KVIC, while only few (i.e., 74) have multiple KVICs.
Note that there may be false negatives and false positives for the collected data. However, we argue that we tolerate the false negatives as we do not have any preferences during the data collection process. Meanwhile, the identified 1,240 KVICs are representative for empirical study. By carefully studying these KVICs, we can find the insights, which can represent the whole dataset. The false positive ratio should be quite low as we use six different methods in the process of collecting KVFCs and four different methods in the process of collecting KVICs. We further cross verified the results manually if there exists conflicts. In case that the manual verification process may result in biased results, all the verification processes are conducted by two authors separately.
In particular, we also counted the number of different results for each CVE. We find that for the KVICs, 201 out of the conflicted 213 CVEs have two different answers among all of the 4 methods. The other 12 CVEs have three different results among the four methods. Taking CVE-2019-3819 as an example, its associated KVFC is 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035. Our method 1 (i.e., Fixes tag) identifies cd667ce24796700e1a0e6e7528efc61c96ff832e and 717adfdaf14704fd3ec7fa2c04520c0723247eac as the KVIC, while the methods 3 and 4 (i.e., Vendors and Linux Kernel CVEs Project) only identified 717adfdaf14704fd3ec7fa2c04520c0723247eac. This is potentially due to one of the vendors missing one of the KVICs, and then the Linux Kernel CVEs Project extracted such information from the vendor, thus both missing the extra KVIC. After manual inspection, we identified that both KVICs identified by method 1 should be included in the final dataset.
3.4 RQs
We answer three RQs to understand the KVICs.
First, KVICs may exhibit specific characteristics related to factors such as modified size, content, complexity, and the severity of induced CVEs. Gaining a comprehensive understanding of these characteristics is crucial for guiding future efforts in KVIC detection and deepening our understanding of their causes. This forms the basis for our first research question.
Second, different KVICs may have different purposes. To understand how a KVIC is induced, we need to understand their intended functionalities in the first place. This can help maintainers to better address what kinds of commits may be KVICs. Consequently, we study the following research question.
Third, open source software like the Linux kernel is highly complex and it is common for authors or maintainers to induce KVICs. The induced vulnerabilities vary a lot in terms of their root causes and it is difficult to propose a precise and general algorithm that prevents the induction of vulnerabilities. Hence, we focus on studying the human factors of KVICs, especially the involved persons and their familiarity with the KVICs. This provides insights on which aspects of the human factors plays a more critical part in making a commit more or less secure. Based on this, we make suggestions on how to mitigate KVICs.
4 Characteristics of KVIC (RQ1)
In this section, we characterize KVICs from various perspectives, including the modified size, modified file’s complexity, KVICs’ complexity, modified contents, type of Common Weakness Enumerations (CWEs), and severity.
4.1 Modified Size of KVICs
Previous studies [
33,
34] showed that adding or deleting more lines renders the commit more prone to induce defects as large modification is usually complex. To explore whether KVICs make large modifications, we compare the 1,240 KVICs with the general commits (
Section 2.1).
Figures 6–
8 show the
Cumulative Distribution Function (CDF) plot of this comparison, where the blue line with circular markers represents the KVICs and the orange line with triangle markers represents all the other commits.
Figures 10–
13,
18, and
19 follow the same format.
According to
Figure 6, KVICs add and delete more lines compared with the general commits. Apart from the total lines of code, we also compare the changed conditional statements (
if or
loop). According to
Figures 7 and
8, KVICs modify more conditional statements than general commits. In particular, there is a much larger gap in the case of the number of added conditional statements, compared with that of the deleted ones. We suspect that this is because the addition of conditional statements may not be fully tested, as they may introduce new logical flows with flaws that could evade existing tests. On the other hand, deleting existing statements does not drastically reduce the range of effectiveness of existing tests.
Furthermore, the total number of modified lines of code can introduce confounding effects. To mitigate this effect, we present the plots for the number of modified conditional statements, encompassing both if and loop statements, in
Figure 9(a) and (b). These figures demonstrate that KVICs exhibit a higher proportion of modifications to both if and loop statements when normalized per 10,000 lines of modified code. This finding further strengthens our previous observations.
Furthermore, we also examine the difference between KVICs and the general commits along the time scale. We plot the number of modified lines on a yearly basis for both KVICs (
Figure 11) and general commits (
Figure 10). It can be seen that over the past 10 years, KVICs are tending to be induced by commits that modify fewer lines, while general commits exhibit nearly no fluctuation. At the same time, the Linux kernel is getting to contain more lines and becoming more complex. In addition, the number of modified lines of KVICs is consistently larger than that of the general commits. Since the Linux kernel is getting more complicated, it is becoming more likely for smaller and simpler commits to induce vulnerabilities.
4.2 Modified Files’ Complexity of KVICs
Apart from the modified size, the original size of changed files can also influence the possibility of inducing vulnerabilities [
28]. The intuition is that the complexity of a file is usually associated with the number of lines of code. The more complex a file is, the wider attack surface there will be, resulting in a higher chance for the authors or reviewers to neglect potential vulnerabilities. In this case, for each KVIC, we first locate the modified files. We then revert them to the version before each KVIC, and count the total number of lines of code for all the modified files. Since reverting the code for each commit is time consuming, we randomly selected a set of samples of sizes 1,000, 3,000, 5,000, 10,000, and 30,000 from general commits. We calculated the original lines of changed files of both KVICs and the sampled commits.
Figure 12 shows the CDF plot. We noticed that the samples converge around the same curve regardless of the sample size, which shows that our samples are representative of the general commits and do not involve drastic variance. The CDF plots of the samples are clearly different from that of the KVICs, which shows that the files changed by KVICs contain more lines of code compared with general commits. In particular, 50% KVICs changed the files that consisting of more than 2,899 lines of code. This number drops to 1,597 for the samples on average. This indicates that changing complex files is associated with being more likely to induce vulnerabilities.
4.3 Complexity of KVICs
Shannon’s entropy can be used to quantify the complexity of commits. As shown by previous work [
17,
23], complex commits are usually defect-prone. This makes sense as complex commits may change many files and make modifications across various subsystems, where the authors and reviewers may not have the complete knowledge on all the modified codes, resulting in faults. We also study the impact of commit complexity on inducing new vulnerabilities. Formally, given a commit
\(C\), we define the complexity of
\(C\) with its entropy
\(Entropy_{C}\) as:
where
\(n\) denotes the number of modified files by
\(C\), and
\(P_{i}\) denotes the proportion of modification for file i, namely the number of modified lines of code specifically in file
\(i\) divided by the total number of modified lines of code on all files by
\(C\).
Figure 13 shows the CDF plot comparing the entropy of KVICs with the general commits. In particular, the entropies of 19.9% of the general commits are greater than 1, while for KVICs this number is 37.1%. When we mention entropy greater than 1, it indicates a more widespread and dispersed distribution of changes across different files. Therefore, our results suggest that, relative to general commits, KVICs exhibit a higher degree of dispersion in their changes across files. It is important to note that the entropy values themselves do not have fixed thresholds or specific interpretations. In our study, we use entropy to provide more specific comparisons between KVICs and general commits.
As a previous work pointed out [
23], the varying number of files in a software system is another factor that needs to be accounted for when computing entropy, and the resulting entropy value is called “Normalized Static Entropy.” This is because the number of files affects the maximum possible entropy that a certain modification could obtain, where further dividing by the maximum possible entropy (i.e.,
\(\log_{2}n\)) brings the comparison down to the same scale. The Normalized Static Entropy, H, is computed as follows:
Then we compute the entropy using the above equations following the same paradiagm as in
Figure 13. The result is shown in
Figure 14. It can be observed from
Figures 13 and
14 that there is still a clear trend of KVICs having higher entropy compared with the general commits under the metric of normalized static entropy, thus reinforcing the conclusion that changes made by KVICs are more widely distributed across different files.
4.4 Modified Content of KVICs
The Linux kernel consists of various modules (e.g., file system, network), which are maintained by different developers. Since some functionalities vary across modules, modifications to some modules may be more likely to induce vulnerabilities compared with the others. Based on this insight, we examined the modified contents of each KVIC at three different granularities, which are the modified files, directories, and subsystems.
In the Linux kernel, one file is usually responsible for one particular functionality. Meanwhile, files under the same directory usually have similar functionalities and they collectively form a rather complex module. As for the subsystems, we denote them as the top-level directories of the Linux kernel (e.g.,
net,
driver), whose scope is much larger than that of the directories. For example, the subsystem
net is responsible for network-related functionalities while
driver contains different kinds of drivers.
Figures 15–
17, shows three ranked plots for the top-ten files, directories, and subsystems that are most frequently modified by KVICs. We also show the frequency of modification for the general commits, normalized via dividing by the number of total commits and multiplying by the number of KVIC commits. This normalization brings the comparison between the KVICs and the general commits to the same scale (i.e., the number of modifications per 1,240 commits).
Note that we introduced two optimizations. First, we observe that directories or subsystems may contain varying numbers of files, and each file can have a different number of lines of code. This introduces bias, as directories or subsystems with more files or lines of code are often more likely to be modified. To address this bias, we normalize the number of modifications to a directory or subsystem by dividing it by the total number of lines of code across all files within that directory or subsystem. To enhance readability in the figures, we have multiplied the resulting numbers by 1,000. However, it is essential to understand that this adjustment does not alter the relative scale of the data. Second, some directories may contain only few files, resulting in a relatively high normalized value if some of its contained files get modified by KVICs. To address this, we exclude directories that consist of fewer than 10 files.
From
Figure 15, 25 KVICs modified the
kernel/bpf/verifier.c, which is the highest among all files. Meanwhile, 21 KVICs modified the
arch/x86/kvm/x86.c, which shows that the x86 architecture of kvm is not robust. As for the directories, which is shown in
Figure 16, we noticed that
net/ipv6/netfilter,
net/ipv4/netfilter, and
arch/x86/entry, are the top three directories modified by KVICs, indicating that developers should pay more attention to the security effects while changing files under such directories. In addition, six are under network system (i.e.,
net/core,
net/xfrm,
net/ax25,
net/ipv6/netfilter,
net/ipv4/netfilter, and
net/rds), and one of the top vulnerable directories are under the file system (i.e.,
fs/proc)
Figure 17 shows that
init contains the second most of the files changed by KVICs. This deserves our attention as the KVICs that changes the files in
init can influence the initialization process of the Linux kernel, resulting in a huge impact. These modules (e.g., file system, network) tend to have more dependencies, which makes it vulnerability-prone, as the author needs to be familiar with all the related modules to ensure security. Another reason is that these modules were tested thoroughly via various fuzzers over the years and more vulnerabilities have been detected. In the future, further testing on other modules will be needed.
4.5 Type and Severity of KVICs
Different KVICs may induce vulnerabilities of different types and severities. In this article, we use the CWE [
6] to categorize vulnerabilities and the CVSS version 3.0 [
5] to evaluate the severity of a CVE. To accomplish this, we crawled the CWE and severity information from NVD [
9].
Table 3 shows the result. In total, 9 CWEs correspond to more than 800 KVICs, which are listed separately in the table.
Among them, Use after free corresponds to the highest number of KVICs (i.e., 178) and more than half (i.e., 416, 476, 119, 401, 787, and 125) of the CWEs are related to the memory. CVEs belonging to these CWEs have relatively high severities. For example, CWE 787 has the highest average severity (i.e., 7.50) and CWE 119 has the second highest (i.e., 7.32). Furthermore, we observed that the KVICs belonging to CWE 401 (“missing release of memory after Effective Lifetime”) add, delete, and change noticeably the most number of lines among all CWEs. Meanwhile, it also has the highest entropy. This is because the KVICs under this CWE may change many files and the lifetime of the memory may propagate across different files, resulting in missing memory release. In addition, the large number of changed lines is also a challenge for maintainers to find the issue.
5 Purposes of KVIC (RQ2)
In this section, we aim to understand the purpose of KVICs. We classify purposes into the following categories based on previous studies [
24,
40,
43]:
–
Correction: To fix an implementation bug or disclosed vulnerabilities. The commit messages usually contain the keywords like “fix,” “patch,” and so forth. It may also contain the “Fixes:” tag that follows with a commit ID.
–
Feature Addition: To introduce new features or add supports for a new entity (e.g., driver, modules). The commit messages usually contain keywords such as “add,” “support,” “implementation,” “introduce,” “initial,” and so forth.
–
Merging: To merge commits without adding new code.
–
Documentation: To add new technical documentation, code annotation, or “README” files. They usually do not introduce any new functionalities.
–
Optimization: To optimize the current kernel code. The optimizations are usually done via refactoring or cleaning up codes, changing configurations, or rewriting the existing APIs. The commit messages usually contain keywords like “better,” “rework,” “refactor,” “cleanup,” and so forth.
–
Testing: To add some (unit) testing codes to evaluate the functionalities, performance, and robustness of code.
We labeled the 1,240 KVICs and classified them into the above mentioned categories, following three steps. First, we check the title of each commit, as it usually provides enough information for us to determine its category. For example, the commit
6f78193ee9ea has the title “
HID: corsair: Add Corsair Vengeance K90 driver,” where we can easily label it as “feature addition.” Second, if the title is not clear enough, we will read the commit message carefully and understand the functionality of the commit. We are able to label the purposes of most commits with the above two steps. However, if we are still unsure of the purpose of the commit, we will take the third step to check the commit code in detail. In this case, we are able to label the purposes of all KVICs. We also sampled 1,240 commits for comparison since labeling the purpose of all the general commits are time consuming. Our previous evaluation (
Section 4.2) shows the sampled commits are representative.
Table 4 shows the result. We notice that the KVICs mainly lay in three different categories. This makes sense as
merging will not introduce new lines of code while
documentation works on non-functional codes, which should not induce new vulnerabilities.
Testing is for testing purposes and is not likely to induce vulnerabilities, either. In particular, we notice that
feature addition accounts for 50.5% (626/1,240) of the KVICs, but only 19.2% (238/1,240) of the sample commits, which is a huge difference. This indicates that adding new features or modules is associated with a higher chance of being vulnerable and deserves maintainers’ attention. We also found that 19.0% (236/1,240) of the KVICs aim to fix bugs. Though the reason why fixes could induce bugs are studied well previously [
52], this problem still exists in the Linux kernel and can induce vulnerabilities.
7 Suggestions
Understanding the KVICs can help to build better defect prediction tools and reduce potential human errors. Based on the insights and findings that we derived, we propose several suggestions.
First, commits with specific characteristics (i.e., change many conditional statements or large files, has large entropy) are more likely to induce vulnerabilities and should be carefully reviewed.
Second, pay more attention to commits that add new features, as KVICs are characterized by having a much higher proportion of the “feature addition” category.
Third, involve more reviewers in commits, as a major contrast between KVICs and the general commits in terms of human factors is that KVICs have much fewer reviewers. Admittedly, there are so many commits, and the number of reviewers is proportionally low compared to that of the commits; however, it is promising to see that there has been a clear trend of more reviewers being involved in commits over the past decade. On the other hand, we do not see a similar trend for KVICs, which means that KVICs still lack extensive reviewing and this could be one reason behind the introduction of vulnerabilities.
Fourth, commits that change particular modules (i.e., init, net, kvm, mm, and fs) should be more carefully reviewed, since we find that this kind of commits have a much higher chance to induce defects compared with others.
Fifth, involve more people with higher commit knowledge in the patch process. We notice that the maintainer of a commit may not always have the highest commit knowledge. In this case, authors or maintainers could consult people (i.e., reviewers) with higher commit knowledge for feedback. While it might be unrealistic to expect the most knowledgeable maintainer to always be available for involvement, our intention is not to guarantee their constant availability. Rather, our proposed commit knowledge scoring system can recommend maintainers with higher levels of knowledge compared to random assignment. By considering the commit knowledge, we anticipate an improvement in the selection of maintainers, increasing the likelihood of involving individuals who possess greater knowledge and expertise in the patch process, reducing the induction of vulnerabilities.
Sixth, implement a more strict commit messaging policy, where developers should use particular tags (e.g., “Fixes,” “Reviewed-by”) for categorizing commits and refer to the CVE ID when patching vulnerabilities. While it is true that the suggestion of including fix tags in commit messages is already part of the Linux kernel’s patch submission guidance, our analysis revealed that a significant number of KVFCs did not adhere to this practice and lacked the inclusion of fix tags. Specifically, out of the 2,290 identified KVFCs, only 312 followed this suggestion. This observation underscores the need to explore alternative methods to raise awareness and emphasize the importance of using fix tags among Linux kernel maintainers. Having this done would help to provide rich structural information about the commits for future research and commit reviews. It is worth noting that this suggestion has also been proposed in existing study [
31].
Seventh, researchers are encouraged to develop new algorithms for identifying KVICs automatically. This helps to collect more dataset and contribute to a deeper understanding of vulnerabilities. Our study of KVICs provides both a comprehensive dataset and useful metrics that can be adopted by related work in the future.
8 Discussion
Threats to Validity. Threats to validity mainly lie in the identified KVICs. First, our data can have false negatives as some KVICs may not be referred to in the commits or not listed in the vendors’ site. However, this does not constitute a significant threat as our major goal is to study the
general characteristics of KVICs. Since we already collected many KVICs without any bias and have arrived at statistically significant conclusions, a few missing KVICs will not drastically overturn our results. Admittedly, a more relaxed methodology could find a higher number of KVFCs and KVICs, but that could in turn contain more false positives. Since our main goal is to examine the general characteristics of KVICs, we wanted to reduce the number of false positives since they could bias the result of our study. Second, our data may have false positives. To reduce this threat, we employed various data collection methods and checked for conflicts among these different methods. We only adopted the data that received unanimous agreement from all methods (including an existing dataset that was curated by other researchers), and used a two-author manual verification process which requires minimal manual efforts to resolve conflicts. This greatly reduces false positives and robustifies our analysis of KVICs. Third, we discuss the quality of our manual collection process. Manual efforts were only involved for 20% (260/1,240) KVICs and 23% (538/2,290) KVFCs, and most of the collection process was automated. For the manual efforts, we involved two authors to conduct manual labeling individually, which is used by previous studies [
20]. The final labeling was decided only upon both authors’ mutual agreement.
Generalizability of Our Findings to Other Open-source software (OSS) Projects. We focus on the Linux kernel as it is one of the largest and most popular open source software with well organized structure and maintenance. However, our analysis framework, derived insights, and proposed suggestions can be applied to the other open source software as they follow a similar development process as that of the Linux kernel. Meanwhile we believe that our findings are inherent to this development process instead of to a certain software project. For example, regarding our finding that KVICs in the Linux kernel lack reviewing, this should raise concerns for developers from other projects as the significance of commit reviewing is project-independent. Similarly, our study of other aspects of KVICs (e.g., complexity, entropy, purpose, human factors, etc.) can also provide valuable insights to other OSS projects.
Novelty of Our Work. First, almost all the previous studies focused on bugs while we examine KVICs. This makes our work fundamentally different from previous ones. Different from bugs, which is just a functional deviation from the expectation, vulnerabilities come with the potential of being exploited by attackers to result in serious consequences. Second, we propose systematic methodology on collecting KVICs and construct the most complete dataset. Furthermore, we will publish the whole dataset to the community to motivate future research. Third, we evaluate the characteristics, purposes, and human factors of KVICs from many different perspectives with different metrics. Some of them (e.g., commit knowledge) are newly proposed by this paper. Though not all of the findings are surprising, we support them to be true with solid, quantitative data. To this end, we believe that our findings and the proposed dataset is valuable to the whole community.
Applicability of Our Findings to the Linux Kernel Maintenance. In this article, we proposed many insights for the Linux kernel, which can be applied in the future kernel development and maintenance process. First, we found that many commits did not follow the suggestion of the kernel patch guide [
8] that specific tags (e.g., “fixes”) should be provided. If this lack of structural information is amended, future work on vulnerability analysis and in-time detection will be easier to carry out. This finding urges the kernel maintainers to develop a scheme that enforces a stronger structural information in the commit message. Second, the dataset that we release will help the kernel maintainers and researchers with further vulnerability detection and analysis. In particular, the insights that we find will also help with feature selection for tools like defect prediction. Third, our proposed metrics of commit knowledge can be adopted by the kernel maintainers to build a reviewer assignment system. This can be done by keeping track of the knowledge of reviewers and trying to pair up the most knowledgeable reviewers to each commit. Our finding of a noticeable reviewer knowledge gap also urges the instrumentation of such a tool.
9 Related Work
Vulnerability Empirical Study. Many empirical studies are conducted on software defection. For example, Yin et al. [
52] studied how fixes become bugs. They analyzed many bugs and find that fixer’s knowledge is one of the reasons for inducing bugs. Our study focus on vulnerabilities instead of bugs and proposes the concept of commit knowledge, which is more specific. The bugs of different systems were also widely studied (e.g., TensorFlow [
53], autonomous driving software [
20], GPU [
51]). Meanwhile, researchers also studied other types of bugs [
15,
20,
22,
38,
47,
53]. Different from bugs, which is a functional deviation from the expectation, vulnerabilities come with the potential of being exploited by attackers to result in serious consequences. Alexopoulos et al. [
12] studied vulnerabilities with a focus on their lifetime. Other empirical studies [
29,
41] focus on security patches across various platforms, which differs from our study in that we mainly look at the KVICs instead of patches.
Vulnerability Dataset. There has been multiple dataset proposed by the research community for different OSS systems. Alexopoulos et al. [
12] built a dataset that maps CVEs to their fixing commits, and estimated vulnerable commits via a ”git blame” based approach. Nikitopoulos et al. [
36] built a dataset targeting machine learning applications, which only maps KVFCs to the modified files, instead of commits. Zhou et al. [
54] used two deep neural networks to identify KVFCs, and found many patches from real-world projects. However, these work mainly focus on building dataset of patches that fix vulnerabilities. Our work complements the existing ones by presenting a large dataset of KVICs, coupled with their corresponding fixing commits and CVE IDs.
Software Defect Prediction. There are many works studying how to detect software defects [
19,
21,
25,
26,
30,
32,
35,
37,
44,
46,
48,
54]. Most of them use machine learning algorithms including decision trees, SVM, and deep neural networks. Our empirical study contributes to feature-based approaches (e.g., decision trees) by providing valuable insights on the characteristics of KVICs, and to data-driven approaches (e.g., deep neural networks) by building a large dataset of KVICs with their associated CVEs and KVFCs. In addition, our proposed analysis framework and the associated metrics provide insights on evaluating and improving the existing vulnerability prediction tools.
Vulnerability Origin Detection. There are many works [
16,
18,
27,
39,
42,
50] proposed to locate the origin of the vulnerabilities or bugs. However, most of them are proved to not work well in practice [
49]. Meanwhile, tools like V0Finder [
50] aims to locate the affected software’s instead of vulnerability inducing commits. Recently, SZZUnleashed [
14] open sourced the implementation of SZZ algorithm while the state-of-the-art tool V-SZZ [
13] focus on identifying the vulnerability inducing commits based on the previous model [
39]. However, we found both SZZUnleashed and V-SZZ do not work well on identifying the KVICs. Specifically, we feed the KVFCs of our identified 1,240 KVICs to these tools. SZZUnleashed and V-SZZ can only identify 201 (16.2%) and 696 (56.1%) of them. Note we randomly select 20 samples from the left KVICs that cannot be identified by these tools for manual verification and all of them are indeed missed by these tools. Our investigation shows the previous vulnerability origin detection tools still have a large room for enhancement. Our identified KVICs can be a good dataset and help to speed up the further research.