research-article

Open access

Understanding Vulnerability Inducing Commits of the Linux Kernel

Authors:

Tao Wu,

Yajin ZhouAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 7

Article No.: 170, Pages 1 - 28

https://doi.org/10.1145/3672452

Published: 27 September 2024 Publication History

PDF eReader

Abstract

The Linux kernel is popular and well-maintained. Over the past decade, around 860 thousand commits were merged with hundreds of vulnerabilities (i.e., 223 on average) disclosed every year, taking the total lines of code to 35.1 million in 2022. Many algorithms have been proposed to detect the vulnerabilities, but few studied how they were induced. To fill this gap, we conduct the first empirical study on the Kernel Vulnerability Inducing Commits (KVIC), the commits that induced vulnerabilities in the Linux kernel. We utilized six different methods on identifying the Kernel Vulnerability Fixing Commits (KVFCs), the commits that fix vulnerabilities in the Linux kernel, and proposed the other four different methods for identifying KVICs by using the identified KVFCs as a bridge. In total, we constructed the first dataset of KVICs with 1,240 KVICs for 1,335 CVEs. We conducted a thorough analysis on the characteristics, purposes, and involved human factors of the KVICs and obtained many interesting findings and insights. For example, KVICs usually have limited reviewers and can still be induced by experienced authors or maintainers. Based on these insights, we proposed several suggestions to the Linux community to help mitigate the induction of KVICs.

1 Introduction

The Linux kernel is popular, well-maintained, and widely used. As one of the largest open-source software projects, the Linux kernel has many vulnerabilities induced among the huge number of commits [26]. As shown in Figure 1, as the Linux kernel is seeing an steadily increasing number of commits, there are also a number of vulnerabilities (i.e., around 100 to 500) induced per year. Previous works [32, 35, 44, 54] proposed many different algorithms to detect these vulnerabilities. However, due to the complexity of the Linux kernel and the limitation of the proposed tools, many vulnerabilities still evade detection. This results in the fact that vulnerabilities exist in the Linux kernel for quite a long time (i.e., \(1{,}732.97\) days on average [12]) before being detected and patched, which makes the kernel insecure. Thus, understanding Kernel Vulnerability Inducing Commits (KVICs) is important. It can help increase the precision of defect prediction [25, 33, 34, 37, 46] and prevent vulnerabilities from being induced, making the kernel more robust, reliable, and secure. This motivates our study on KVICs. We believe it can provide deep insights and solid guidance for future researchers and help enhance existing defect prediction models on

Fig. 1.

It is non-trivial to understand KVICs. First, there are millions of commits in the Linux kernel, without any explicit flag indicating if one commit is a KVIC. Manually identifying KVICs is extremely time-consuming and error-prone. Meanwhile, existing algorithms (e.g., SZZ [42]) do not work well in practice [49]. Second, KVICs may vary a lot with different attributes. It is not an easy task to determine from which perspective to study and understand them.

To this end, we carefully designed steps for collecting KVICs. Specifically, we first identify the Linux kernel vulnerabilities from the Common Vulnerabilities and Exposures (CVE) list [4], which is an enumeration of CVE IDs and their corresponding vulnerabilities. Since direct identification of KVICs via CVE IDs is rather difficult, we locate the Kernel Vulnerability Fixing Commits (KVFCs) first and then use them as a bridge to pinpoint KVICs. We propose a semi-automatic methodology by combining six different methods.

Specifically, we located the KVFCs from the commit message (Section 3.2.1), commit URL (Section 3.2.2), commit tag (Section 3.2.3), downstream vendors (Section 3.2.4), the research community (Section 3.2.5), and the Linux Kernel CVEs project (Section 3.2.6). After that, we merged the obtained KVFCs and manually verified the conflicts (if any). Finally, we located 2,290 KVFCs for 2,120 CVEs of the Linux kernel from July 2011 (release date of version 3.0) to October 2022, approximately spanning the range of the past decade. We thereafter propose four different methods for identifying KVICs from the obtained KVFCs. Specifically, we locate the KVICs by checking the “Fixes” tag (Section 3.3.1), searching for the commit IDs from commit messages (Section 3.3.2), referring to specific vendor information (Section 3.3.3), and the Linux Kernel CVEs project (Section 3.3.4). In this case, we collected 1,240 KVICs for 1,335 CVEs. Finally, we aim at understanding the KVICs by answering the following three Research Questions (RQs):

–

What are the characteristics of the KVICs?

–

What are the purposes of KVICs?

–

How do experience, expertise, and the roles of involved person, influence KVICs?

Our empirical study on the KVICs results in many useful insights and findings. First, we find that KVICs usually change more lines, conditional statements, and files compared to non-KVICs, and the changed files are more complex than general files. Second, we notice that changing particular modules (e.g., network, file system, kvm, and init) is more likely to induce vulnerabilities. Third, we find that about half of the KVICs aim to add new features, and commits attempting to fix bugs can still induce vulnerabilities. Finally, we explore the roles of people involved in the Linux kernel patch process, and noticed that KVICs have a limited number of reviewers. We also examine the number of reviewers along the time scale and find that KVICs do not follow the trend of steadily involving more reviewers over the past decade. We also propose the concept of commit knowledge and find that maintainers may not actually be very familiar with the commit they are working with.

Based on the above insights, we propose many insights and suggestions, which could be applied in the future kernel development and maintenance process to help reduce the induction of vulnerabilities. First, commits with specific characteristics (e.g., change many conditional statements) should be carefully reviewed. Second, pay more attention to commits that add new features. Third, involve more reviewers for the commits instead of carbon copying the commits to other people. Fourth, commits that change particular modules (i.e., init, net, kvm, mm, and fs) should be more carefully reviewed. Fifth, involve more people with higher commit knowledge. Sixth, developers are suggested to use particular tags (e.g., “Fixes,” “Reviewed-by”) for categorizing commits and refer to the CVE ID when patching vulnerabilities. Finally, researchers are encouraged to develop new algorithms for identifying KVICs automatically, based on our dataset and proposed metrics.

To conclude, we make the following contributions in this article:

Systematic and Sound KVIC Identification Approach. We propose a semi-automatic approach to identify KVICs by using KVFCs as a bridge with a rather low false positive rate. In particular, a total of 10 methods (6 for identifying KVFCs and 4 for identifying KVICs) are proposed.

Public Dataset. Using our identification approach, we identify 1,240 Linux kernel KVICs from 1,335 CVE IDs, with 2,290 KVFCs as the bridge, which spans the time period over the past ten years. We will release it to the community for further research.

Comprehensive Study. We conduct a comprehensive study on 1,240 KVICs and quantify their particular characteristics. This provides a deeper understanding of KVICs in the Linux kernel.

Insightful Findings and Suggestions. We obtain many interesting findings and insights, which enables us to propose several suggestions for reducing the induction of vulnerabilities.

2 Background

2.1 Terms

To make the paper easier to follow, we define the following terms:

KVFC. KVFC denotes the Kernel Vulnerability Fixing Commit. It aims to fix specific vulnerabilities in the Linux kernel. One vulnerability usually maps to one KVFC. However, due to incomplete fixes, one vulnerability may map to more than one KVFCs.

KVIC. KVIC represents the Kernel Vulnerability Inducing Commit. It can be a commit with any purpose but induces vulnerabilities to the Linux kernel. For example, CVE-2021-28691 is a vulnerability that was caused by the commit 2ac061ce97f413bfbbdd768f7d2e0fda2e8170df (i.e., an KVIC), and was fixed by the commit 107866a8eb0b664675a260f1ba0655010fac1e08 (i.e., an KVFC). In particular, the KVIC attempted to clean up the existing codes and deleted some important features, which led to a use-after-free issue. Accordingly, the KVFC re-introduced some of the core features and fixed the use-after-free issue. Usually, one KVIC maps to one vulnerability. However, some KVICs can induce more than one vulnerability and require multiple commits to fix. Similarly, one KVFC can map to several KVICs as it can fix multiple vulnerabilities within one commit. In this article, KVIC is our main study target.

General Commits. In this article, we specifically study commits starting from the version 3.0 of the Linux kernel, since older versions are no longer being maintained and are rarely used. To this end, we call all commits from the version 3.0 onward to be the general commits, and compare KVICs to this population to understand how KVICs are different from the Linux kernel general commits.

2.2 The Linux Kernel Patch Process

To make the paper easier to follow, we first summarize the following terms (more details are given later in this section):

Author. The author of a commit is the one who submits a commit to the Linux kernel.

Maintainer. The maintainer of a commit is the one who checks the proposed patch, discusses concerns with authors, and merges the patch to the kernel codes.

In order to better understand how KVICs are induced, we need to understand the Linux kernel patch process [8]. The Linux community has compiled a set of guidelines on the kernel development process to facilitate the management of such a complex system. In particular, when submitting patches, the authors first need to carefully describe the purpose, try to make the patch focus on a specific purpose in case of a potentially large commit, and carefully check the format. After that, authors can send the commit to the kernel list (e.g., [email protected]). Meanwhile, authors are also required to send the patch to the Linux maintainers due to the large volume in the mailing list. After receiving the patch, the maintainer will check the proposed patch and discuss with authors if there are any concerns. After several iterations of discussion, the maintainers can merge the patch to the kernel if all the concerns are addressed.

3 Data Collection

To understand the KVICs, we first need to collect the dataset. However, the major difficulty comes from the lack of explicit flag indicating that one commit is an KVIC. Furthermore, vulnerabilities vary a lot, making it non-trivial to find a precise mechanism that can identify the KVICs automatically and precisely, which is still an open question. To address these challenges, we collect the KVICs in three steps.

Figure 2 shows the process. First, we need to collect the vulnerabilities. We use the CVE list [4] as the database and filter out the vulnerabilities of the Linux kernel by searching for related keywords (Section 3.1). This can help to identify the candidate CVE IDs for the Linux kernel. Given one vulnerability (CVE ID) of the Linux kernel, there are no direct approaches to infer the corresponding KVIC. This is because the correlation between the KVIC and the vulnerabilities is indirect and implicit. To solve this problem, we use the KVFCs, which are identified with six different methods (Section 3.2), as the bridge. After identifying KVFCs, we propose four different methods to identify the KVICs (Section 3.3).

Fig. 2.

3.1 Kernel Vulnerability Collection

The CVE list is built by CVE Numbering Authorities, which is responsible for assigning and publishing CVEs. Thus, it is the official site for CVE data and should have no false negatives. We use the keyword Linux kernel to search for vulnerabilities of the Linux Kernel and 3,464 CVEs are located. Note that we disregard CVEs assigned before July 2011 as the Linux kernel before version 3.0 are no longer maintained. To this end, we have 2,694 CVEs in total (from 2011 to October 2022). Though the other software (e.g., Android kernel) may also contain the related keywords (i.e., Linux kernel), the false positives can be filtered out during the process of identifying KVFCs and KVICs.

3.2 KVFC Identification

After identifying the CVEs of the Linux kernel, we need to identify the KVFCs. Given one CVE, the major difficulty comes from identifying the KVFC as there is no explicit relationship between one CVE and the corresponding fix commit. Though the commit message usually indicates the purpose of the commit, not all the authors explicitly describe that the designated commit targets any specific CVEs. One possible reason for this might be that the authors were not aware of the existance of a corresponding CVE. Furthermore, reading the commit message requires manual effort and is error-prone. Different sites (e.g., NVD [9]) or vendors (e.g., Ubuntu [10]) may maintain the information of CVEs. However, the provided information is incomplete and they may have conflicts between each other. In this case, we cannot completely trust them.

To address the above-mentioned challenges, we collect the KVFCs with six different methods (i.e., \(FM1\), \(FM2\), \(FM3\), \(FM4\), \(FM5\), and \(FM6\)). Each method will identify KVFCs for a set of CVEs. After that, we merge the results obtained from the five different methods. For one specific CVE, if there are KVFCs identified by different methods, we check whether they have conflicts. If so, we will manually verify them. To make the manual verification process reliable, two authors in this article, manually verify the conflicted results separately. After that, the manually checked results from two authors are compared. If the results are consistent, it will be kept. Otherwise, the two authors will discuss and check the detailed commit content to make the final decision. In the following, we detail how each method works and the results.

3.2.1 KVFC Identification Method 1: Commit Message.

Commit messages in a version control system provide a detailed description of the corresponding commits [11, 45], where developers may explicitly claim a commit is to fix a specific CVE. With this observation, we first search all commit messages of the Linux kernel and check whether they contain the CVE IDs we obtained in Section 3.1. After that, we conduct further filtering by selecting the ones that contain specific keywords (i.e., fix(ed)/patch(ed)/mitigate(d)) of fixing purposes. In total, we find 443 commits that contain CVE IDs. Among the 443 commits, 195 of them have the required keywords (i.e., fix(ed)/patch(ed)/mitigate(d)) in the same sentence containing the CVE ID. These commits are treated as the KVFCs. For the remaining 248 commits, we manually verify the result. Finally, we have 312 KVFCs for 288 CVEs with method 1.

3.2.2 KVFC Identification Method 2: Commit URL.

The CVE Program [4] is an authority that maintains related information for CVEs. In particular, it provides references for each CVE in the form of links pointing to blogs, email histories, fix commits, and so forth. We build a crawler to automatically examine each CVE listed by the CVE Program, extract the ones that represent a commit for the Linux kernel, and treat them as fix commits. Specifically, the link should start with http://git.kernel.org/pub/scm/linux/kernel/git/ or https://github.com/torvalds/linux/ and contain the Secure Hash Algorithm-1 (SHA-1) value indicating a specific commit. For example, https://github.com/torvalds/linux/commit/dd504589577d is a commit for fixing the CVE-2015-8970 [2]. In total, we find 1,359 CVEs, which map to 1,427 KVFCs. Note that method 2 may have false positives as the CVE can have some insidious misinformation or the repository history gets rewritten, leading to obsoleted commit hashes. However, to make the holistic results reliable, we have further verification by checking for unanimous agreement among all our KVFC identification methods.

3.2.3 KVFC Identification Method 3: Commit Tag.

Apart from the CVE Program, NVD [9] also provides some reference links for each CVE. Furthermore, NVD assigns a tag named “Patch” for commits that are used to fix the corresponding CVE. We first find candidate KVFCs for each CVE by extracting links with the “Patch” tag. After that, we check whether the link represents a commit for the Linux kernel by using the same policy as introduced in Section 3.2.2. If so, we treat it as an KVFC. With this method, we automatically locate 1,127 KVFCs, which map to 1,066 CVEs.

3.2.4 KVFC Identification Method 4: Vendors.

We surveyed many popular vendors including Debian, SUSE, ArchLinux, Red Hat, Gentoo, Android, and Ubuntu. We noticed that only Ubuntu [10] and Android [1] provide explicit mapping relationships between CVEs and their corresponding KVFCs. The maintained information is reliable as it has been verified by the vendors’ security experts. Thus, we utilize the information provided by these two vendors.

Specifically, Android releases its security bulletins including the CVE that is related to the upstream kernel (i.e., Linux kernel). The corresponding fix commit is usually attached. Ubuntu maintains a mapping between KVFCs and their corresponding KVICs for each CVE, as shown in Figure 3. From the Android security Bulletins, we located 144 KVFCs, which map to 135 CVEs. From Ubuntu, we located 1,920 KVFCs, which map to 1,687 CVEs.

Fig. 3.

3.2.5 KVFC Identification Method 5: Research community.

Besides above methods, we also consider the manually curated mappings from CVE IDs to KVFCs from the research community. Alexopoulos et al. [12] build a dataset of KVFC mappings to study the lifetime of vulnerabilities and release it to public. We incorporated it into our research and obtained 1,473 CVEs that map to 1,528 KVFCs.

3.2.6 KVFC Identification Method 6: Linux Kernel CVEs Project.

The Linux Kernel CVEs project [7] is another source that keeps tracks of KVFCs. According to its official introductions, its identified KVFCs were automatically generated through a set of tools that includes unpublished ones. We incorporated the data from this project into our research and obtained 1,959 CVEs that map to 1,863 KVFCs.

3.2.7 Determining final KVFCs.

To make our collected data reliable, we combine the results extracted from the six methods and compare them between each other. If there are conflicts among the results of at least two methods, we will manually check it. In this case, we can reduce the false positives. Our empirical study tolerates false negatives as the collected data does not have bias on any particular feature of the commit itself. Thus, the learned lessons and the obtained insights can represent the whole KVICs of Linux kernel. Table 1 shows the final results. In total, 2,290 commits are identified as KVFCs for 2,120 CVEs. Among the 2,290 KVFCs, 538 ones that map to 278 CVEs were verified manually due to conflicts between different methods. Note that this is a relatively low rate of disagreement and thus our method does not require a significant amount of manual efforts, and can scale to other projects. Among the 278 CVEs, 221 ones have two different results from all of the six methods while 53 CVEs (i.e., 19.06%) have three different results. Only 4 CVEs (i.e., 1.44%) have four different results, suggesting that most of our methods can reach the agreement on most of the CVEs.

Table 1.

	Description	Identified KVFC	CVEs
\(FM1\)	Commit Message	312	288
\(FM2\)	Commit URL	1,427	1,359
\(FM3\)	Commit Tag	1,127	1,066
\(FM4_{a}\)	Android Information	144	135
\(FM4_{b}\)	Ubuntu Information	1,920	1,687
\(FM5\)	Fellow Researchers	1,528	1,473
\(FM6\)	Linux Kernel CVEs Project	1,863	1,959
Merged Total	N/A	2,290 (538)	2,120 (278)

Table 1. Summary of the KVFC Collection

3.3 KVIC Identification

After collecting the KVFCs, we now describe how the KVICs are identified. Similar to identifying KVFCs, there is no explicit relationship between an KVFC and its corresponding KVIC. Meanwhile, existing algorithms (e.g., SZZ [42]) do not work well in practice [49]. Furthermore, many sites that maintain the CVE information do not provide information for KVICs. In this case, we rely on KVFCs as the bridge and utilize four different methods to identify the KVICs. We will also merge the results from three different methods under unanimous agreement of the methods. If there are conflicts between any two methods, we will manually verify them. We detail the three methods in the following.

3.3.1 KVIC Identification Method 1: Fixes tag.

According to the Linux kernel documentation, when attempting to fix a bug induced by specific commits, it is suggested to add a “Fixes:” tag with the first 12 characters of the SHA-1 ID of the KVIC. Utilizing this tagging convention, we can pinpoint the KVICs. In this case, we first match the keyword “Fixes:” in the commit message of the identified KVFCs. After locating “Fixes:” tags, we retrieve the SHA-1 ID, which represents the KVIC, that follows the tag. Figure 4 shows an example, where given d1f82808877 as a KVFC for CVE-2021-3491, “ddf0322db79c” is identified as an KVIC since it appeared after a “Fixes:” tag in the commit message of a KVFC. With this method, we locate 624 KVICs for 612 CVEs. Note that there are more KVICs than CVEs since KVICs and CVEs are not a one-to-one correspondence, i.e., one KVIC could correspond to more than one vulnerability, and vice versa. However, many authors may not strictly follow the aforementioned tagging convention, since it is not a mandatory requirement. In this case, many of the KVICs cannot be explored by identifying the keyword Fixes. To make our study comprehensive, we need new methods to complement the existing result.

Fig. 4.

3.3.2 KVIC Identification Method 2: Commit ID.

Though the commit message may not contain “Fixes” tag, the authors may discuss how the vulnerability is induced in the commit message by referring specific commit ID. In this case, we scan all the Linux kernel commits and check whether they contain any SHA-1 hash value. If so, we further check whether it represents a commit ID. Figure 5 shows an example. Though the author does not add the “Fixes” tag in the message of commit 3f190e3aec21, he discussed how the vulnerability (i.e., CVE-2017-8063) is induced. According to line 5, commit 17ce039b4e54 is identified as the KVIC. With this method, 145 KVICs are located, which map to 133 CVEs.

Fig. 5.

3.3.3 KVIC Identification Method 3: Vendors.

As mentioned in Figure 3, some downstream vendors (e.g., Ubuntu) also maintain the KVIC information. Thus, we crawled the information from their official site. Since the maintained information of Ubuntu may not always be correct, we first check whether the KVFC is identified correctly, which means checking if they are included by our final 2,094 KVFCs. If not, the provided KVIC will not be considered. In this case, we have 1,028 KVICs, which map to 1,071 CVEs.

3.3.4 KVIC Identification Method 4: Linux Kernel CVEs Project.

As discussed in Section 3.2.6 the Linux Kernel CVEs project maintains a public dataset for KVFCs. In addition, it also keeps track of KVICs, and we incorporate their data to our research. With this method, 1,036 KVICs are located, which map to 1,940 CVEs.

3.3.5 Determining final KVICs.

Similar to the steps of collecting KVFCs, we merge the KVICs obtained from the above mentioned three methods and manually check the conflicts. Table 2 summarizes the results. In total, we identified 1,240 KVICs that induce 1,335 CVEs. About 260 KVICs that map to 213 CVEs are verified manually.

Table 2.

	Description	Identified KVIC	CVEs
\(IM1\)	Fixes Tag	624	698
\(IM2\)	Commit ID	145	136
\(IM3\)	Vendors	1,028	1,071
\(IM4\)	Linux Kernel CVEs Project	1,036	1,940
Total	N/A	1,240 (260)	1,335 (213)

Table 2. Summary of the KVIC Collection

It is worth noting that one CVE can map to more than one KVICs. This is because there are some cases where multiple KVICs collectively introduced one CVE vulnerability. For example, CVE-2021-23134 is a vulnerability caused by two KVICs (i.e., c33b1cc62ac05c1dbb1cdafe2eb66da01c76ca8d and 8a4cd82d62b5ec7e5482333a72b58a4eea4979f0), where these two commits both attempted to fix a refcount leak bug, but collectively introduced a use-after-free issue. This issue was later fixed by commit c61760e6940dd4039a7f5e84a6afc9cdbf4d82b6. Overall, most (i.e., 1,261) of the 1,335 identified CVEs correspond to only one KVIC, while only few (i.e., 74) have multiple KVICs.

Note that there may be false negatives and false positives for the collected data. However, we argue that we tolerate the false negatives as we do not have any preferences during the data collection process. Meanwhile, the identified 1,240 KVICs are representative for empirical study. By carefully studying these KVICs, we can find the insights, which can represent the whole dataset. The false positive ratio should be quite low as we use six different methods in the process of collecting KVFCs and four different methods in the process of collecting KVICs. We further cross verified the results manually if there exists conflicts. In case that the manual verification process may result in biased results, all the verification processes are conducted by two authors separately.

In particular, we also counted the number of different results for each CVE. We find that for the KVICs, 201 out of the conflicted 213 CVEs have two different answers among all of the 4 methods. The other 12 CVEs have three different results among the four methods. Taking CVE-2019-3819 as an example, its associated KVFC is 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035. Our method 1 (i.e., Fixes tag) identifies cd667ce24796700e1a0e6e7528efc61c96ff832e and 717adfdaf14704fd3ec7fa2c04520c0723247eac as the KVIC, while the methods 3 and 4 (i.e., Vendors and Linux Kernel CVEs Project) only identified 717adfdaf14704fd3ec7fa2c04520c0723247eac. This is potentially due to one of the vendors missing one of the KVICs, and then the Linux Kernel CVEs Project extracted such information from the vendor, thus both missing the extra KVIC. After manual inspection, we identified that both KVICs identified by method 1 should be included in the final dataset.

3.4 RQs

We answer three RQs to understand the KVICs.

First, KVICs may exhibit specific characteristics related to factors such as modified size, content, complexity, and the severity of induced CVEs. Gaining a comprehensive understanding of these characteristics is crucial for guiding future efforts in KVIC detection and deepening our understanding of their causes. This forms the basis for our first research question.

RQ1: What are the characteristics of the KVICs?

Second, different KVICs may have different purposes. To understand how a KVIC is induced, we need to understand their intended functionalities in the first place. This can help maintainers to better address what kinds of commits may be KVICs. Consequently, we study the following research question.

RQ2: What are the purposes of KVICs?

Third, open source software like the Linux kernel is highly complex and it is common for authors or maintainers to induce KVICs. The induced vulnerabilities vary a lot in terms of their root causes and it is difficult to propose a precise and general algorithm that prevents the induction of vulnerabilities. Hence, we focus on studying the human factors of KVICs, especially the involved persons and their familiarity with the KVICs. This provides insights on which aspects of the human factors plays a more critical part in making a commit more or less secure. Based on this, we make suggestions on how to mitigate KVICs.

RQ3: How do experience, expertise, and the roles of involved person, influence KVICs?

4 Characteristics of KVIC (RQ1)

In this section, we characterize KVICs from various perspectives, including the modified size, modified file’s complexity, KVICs’ complexity, modified contents, type of Common Weakness Enumerations (CWEs), and severity.

4.1 Modified Size of KVICs

Previous studies [33, 34] showed that adding or deleting more lines renders the commit more prone to induce defects as large modification is usually complex. To explore whether KVICs make large modifications, we compare the 1,240 KVICs with the general commits (Section 2.1). Figures 6–8 show the Cumulative Distribution Function (CDF) plot of this comparison, where the blue line with circular markers represents the KVICs and the orange line with triangle markers represents all the other commits. Figures 10–13, 18, and 19 follow the same format.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

Fig. 17.

Fig. 18.

Fig. 19.

According to Figure 6, KVICs add and delete more lines compared with the general commits. Apart from the total lines of code, we also compare the changed conditional statements (if or loop). According to Figures 7 and 8, KVICs modify more conditional statements than general commits. In particular, there is a much larger gap in the case of the number of added conditional statements, compared with that of the deleted ones. We suspect that this is because the addition of conditional statements may not be fully tested, as they may introduce new logical flows with flaws that could evade existing tests. On the other hand, deleting existing statements does not drastically reduce the range of effectiveness of existing tests.

Furthermore, the total number of modified lines of code can introduce confounding effects. To mitigate this effect, we present the plots for the number of modified conditional statements, encompassing both if and loop statements, in Figure 9(a) and (b). These figures demonstrate that KVICs exhibit a higher proportion of modifications to both if and loop statements when normalized per 10,000 lines of modified code. This finding further strengthens our previous observations.

Furthermore, we also examine the difference between KVICs and the general commits along the time scale. We plot the number of modified lines on a yearly basis for both KVICs (Figure 11) and general commits (Figure 10). It can be seen that over the past 10 years, KVICs are tending to be induced by commits that modify fewer lines, while general commits exhibit nearly no fluctuation. At the same time, the Linux kernel is getting to contain more lines and becoming more complex. In addition, the number of modified lines of KVICs is consistently larger than that of the general commits. Since the Linux kernel is getting more complicated, it is becoming more likely for smaller and simpler commits to induce vulnerabilities.

Finding 1: KVICs change more lines of code, especially the conditional statements, than general commits. Meanwhile, the size of KVICs are getting smaller over the years.

4.2 Modified Files’ Complexity of KVICs

Apart from the modified size, the original size of changed files can also influence the possibility of inducing vulnerabilities [28]. The intuition is that the complexity of a file is usually associated with the number of lines of code. The more complex a file is, the wider attack surface there will be, resulting in a higher chance for the authors or reviewers to neglect potential vulnerabilities. In this case, for each KVIC, we first locate the modified files. We then revert them to the version before each KVIC, and count the total number of lines of code for all the modified files. Since reverting the code for each commit is time consuming, we randomly selected a set of samples of sizes 1,000, 3,000, 5,000, 10,000, and 30,000 from general commits. We calculated the original lines of changed files of both KVICs and the sampled commits. Figure 12 shows the CDF plot. We noticed that the samples converge around the same curve regardless of the sample size, which shows that our samples are representative of the general commits and do not involve drastic variance. The CDF plots of the samples are clearly different from that of the KVICs, which shows that the files changed by KVICs contain more lines of code compared with general commits. In particular, 50% KVICs changed the files that consisting of more than 2,899 lines of code. This number drops to 1,597 for the samples on average. This indicates that changing complex files is associated with being more likely to induce vulnerabilities.

Finding 2: Changing complex files is associated with being more likely to induce vulnerabilities.

4.3 Complexity of KVICs

Shannon’s entropy can be used to quantify the complexity of commits. As shown by previous work [17, 23], complex commits are usually defect-prone. This makes sense as complex commits may change many files and make modifications across various subsystems, where the authors and reviewers may not have the complete knowledge on all the modified codes, resulting in faults. We also study the impact of commit complexity on inducing new vulnerabilities. Formally, given a commit \(C\), we define the complexity of \(C\) with its entropy \(Entropy_{C}\) as:

\begin{align*}Entropy_{C}=-\sum_{i=1}^{n}(P_{i}\times Log_{2}P_{i}),\end{align*}

where \(n\) denotes the number of modified files by \(C\), and \(P_{i}\) denotes the proportion of modification for file i, namely the number of modified lines of code specifically in file \(i\) divided by the total number of modified lines of code on all files by \(C\). Figure 13 shows the CDF plot comparing the entropy of KVICs with the general commits. In particular, the entropies of 19.9% of the general commits are greater than 1, while for KVICs this number is 37.1%. When we mention entropy greater than 1, it indicates a more widespread and dispersed distribution of changes across different files. Therefore, our results suggest that, relative to general commits, KVICs exhibit a higher degree of dispersion in their changes across files. It is important to note that the entropy values themselves do not have fixed thresholds or specific interpretations. In our study, we use entropy to provide more specific comparisons between KVICs and general commits.

As a previous work pointed out [23], the varying number of files in a software system is another factor that needs to be accounted for when computing entropy, and the resulting entropy value is called “Normalized Static Entropy.” This is because the number of files affects the maximum possible entropy that a certain modification could obtain, where further dividing by the maximum possible entropy (i.e., \(\log_{2}n\)) brings the comparison down to the same scale. The Normalized Static Entropy, H, is computed as follows:

\begin{align*}H(P) & =\frac{1}{Max\,Entropy\,for\,Distribution}*H_{n}(P) \\& =\frac{1}{log_{2}n}*H_{n}(P) \\& =-\frac{1}{log_{2}n}*\sum_{k=1}^{n}(p_{k}\times Log_{2}p_{k}) \\& =-\sum_{k=1}^{n}(p_{k}\times Log_{n}p_{k}).\end{align*}

Then we compute the entropy using the above equations following the same paradiagm as in Figure 13. The result is shown in Figure 14. It can be observed from Figures 13 and 14 that there is still a clear trend of KVICs having higher entropy compared with the general commits under the metric of normalized static entropy, thus reinforcing the conclusion that changes made by KVICs are more widely distributed across different files.

Finding 3: KVICs have larger entropy compared with the general commits, indicating that changes made by KVICs are more widely distributed across different files.

4.4 Modified Content of KVICs

The Linux kernel consists of various modules (e.g., file system, network), which are maintained by different developers. Since some functionalities vary across modules, modifications to some modules may be more likely to induce vulnerabilities compared with the others. Based on this insight, we examined the modified contents of each KVIC at three different granularities, which are the modified files, directories, and subsystems.

In the Linux kernel, one file is usually responsible for one particular functionality. Meanwhile, files under the same directory usually have similar functionalities and they collectively form a rather complex module. As for the subsystems, we denote them as the top-level directories of the Linux kernel (e.g., net, driver), whose scope is much larger than that of the directories. For example, the subsystem net is responsible for network-related functionalities while driver contains different kinds of drivers. Figures 15–17, shows three ranked plots for the top-ten files, directories, and subsystems that are most frequently modified by KVICs. We also show the frequency of modification for the general commits, normalized via dividing by the number of total commits and multiplying by the number of KVIC commits. This normalization brings the comparison between the KVICs and the general commits to the same scale (i.e., the number of modifications per 1,240 commits).

Note that we introduced two optimizations. First, we observe that directories or subsystems may contain varying numbers of files, and each file can have a different number of lines of code. This introduces bias, as directories or subsystems with more files or lines of code are often more likely to be modified. To address this bias, we normalize the number of modifications to a directory or subsystem by dividing it by the total number of lines of code across all files within that directory or subsystem. To enhance readability in the figures, we have multiplied the resulting numbers by 1,000. However, it is essential to understand that this adjustment does not alter the relative scale of the data. Second, some directories may contain only few files, resulting in a relatively high normalized value if some of its contained files get modified by KVICs. To address this, we exclude directories that consist of fewer than 10 files.

From Figure 15, 25 KVICs modified the kernel/bpf/verifier.c, which is the highest among all files. Meanwhile, 21 KVICs modified the arch/x86/kvm/x86.c, which shows that the x86 architecture of kvm is not robust. As for the directories, which is shown in Figure 16, we noticed that net/ipv6/netfilter, net/ipv4/netfilter, and arch/x86/entry, are the top three directories modified by KVICs, indicating that developers should pay more attention to the security effects while changing files under such directories. In addition, six are under network system (i.e., net/core, net/xfrm, net/ax25, net/ipv6/netfilter, net/ipv4/netfilter, and net/rds), and one of the top vulnerable directories are under the file system (i.e., fs/proc) Figure 17 shows that init contains the second most of the files changed by KVICs. This deserves our attention as the KVICs that changes the files in init can influence the initialization process of the Linux kernel, resulting in a huge impact. These modules (e.g., file system, network) tend to have more dependencies, which makes it vulnerability-prone, as the author needs to be familiar with all the related modules to ensure security. Another reason is that these modules were tested thoroughly via various fuzzers over the years and more vulnerabilities have been detected. In the future, further testing on other modules will be needed.

Finding 4: Changing the code of network, file system, kvm, and init is more associated with inducing vulnerabilities. Apart from this, particular files (e.g., kernel/bpf/verifier.c) should also be reviewed carefully.

4.5 Type and Severity of KVICs

Different KVICs may induce vulnerabilities of different types and severities. In this article, we use the CWE [6] to categorize vulnerabilities and the CVSS version 3.0 [5] to evaluate the severity of a CVE. To accomplish this, we crawled the CWE and severity information from NVD [9]. Table 3 shows the result. In total, 9 CWEs correspond to more than 800 KVICs, which are listed separately in the table.

Table 3.

CWE Num.	# KVICs	A. Lines	D. Lines	C. Lines	Entropy	Severity
416	178	592.9	35.4	628.3	0.770	6.94
476	120	876.0	45.8	921.8	0.983	5.73
119	107	1,443.7	61.1	1,504.8	1.011	7.32
362	89	494.7	48.9	543.6	0.885	6.43
200	84	1,430.8	45.6	1,476.4	0.950	5.39
401	84	3,760.4	497.5	4,257.9	1.198	6.00
787	75	1,111.8	79.8	1,191.6	0.986	7.50
20	72	458.9	52.9	511.8	0.909	6.93
125	63	1,071.3	14.3	1,085.6	0.845	6.79
Others	584	603.9	65.2	669.1	0.916	6.39
Total/Average	1,456	667.0	68.4	735.4	0.920	6.41

Table 3. The Statistics of the CWEs That Contain More Than 50 KVICs and the Average Value of the Related Characteristics

At the last row, we show the total number for the “# KVICs” metric, and the average number for other metrics. Note that the total adds up to 1,496 KVICs since one KVIC can introduce more than one vulnerabilities, which map to multiple CWEs. “A. Lines” means added lines, “D. Lines” means deleted lines, and “C. Lines” means changed lines. The names of the CWE numbers are listed following. 416: Use After Free; 476: NULL Pointer Dereference; 119: Improper Restriction of Operations within the Bounds of a Memory Buffer; 362: Race Condition; 200: Exposure of Sensitive Information to an Unauthorized Actor; 787: Out-of-bounds Write; 401: Missing Release of Memory after Effective Lifetime; 20: Improper Input Validation; and 125: Out-of-bounds Read.

Among them, Use after free corresponds to the highest number of KVICs (i.e., 178) and more than half (i.e., 416, 476, 119, 401, 787, and 125) of the CWEs are related to the memory. CVEs belonging to these CWEs have relatively high severities. For example, CWE 787 has the highest average severity (i.e., 7.50) and CWE 119 has the second highest (i.e., 7.32). Furthermore, we observed that the KVICs belonging to CWE 401 (“missing release of memory after Effective Lifetime”) add, delete, and change noticeably the most number of lines among all CWEs. Meanwhile, it also has the highest entropy. This is because the KVICs under this CWE may change many files and the lifetime of the memory may propagate across different files, resulting in missing memory release. In addition, the large number of changed lines is also a challenge for maintainers to find the issue.

Finding 5: Memory related vulnerabilities, which usually have higher severities, are still the most popular in the Linux kernel. Meanwhile, KVICs that trigger CWE 401 (Missing Release of Memory after Effective Lifetime) have the highest entropy and number of modified lines.

5 Purposes of KVIC (RQ2)

In this section, we aim to understand the purpose of KVICs. We classify purposes into the following categories based on previous studies [24, 40, 43]:

–

Correction: To fix an implementation bug or disclosed vulnerabilities. The commit messages usually contain the keywords like “fix,” “patch,” and so forth. It may also contain the “Fixes:” tag that follows with a commit ID.

–

Feature Addition: To introduce new features or add supports for a new entity (e.g., driver, modules). The commit messages usually contain keywords such as “add,” “support,” “implementation,” “introduce,” “initial,” and so forth.

–

Merging: To merge commits without adding new code.

–

Documentation: To add new technical documentation, code annotation, or “README” files. They usually do not introduce any new functionalities.

–

Optimization: To optimize the current kernel code. The optimizations are usually done via refactoring or cleaning up codes, changing configurations, or rewriting the existing APIs. The commit messages usually contain keywords like “better,” “rework,” “refactor,” “cleanup,” and so forth.

–

Testing: To add some (unit) testing codes to evaluate the functionalities, performance, and robustness of code.

We labeled the 1,240 KVICs and classified them into the above mentioned categories, following three steps. First, we check the title of each commit, as it usually provides enough information for us to determine its category. For example, the commit 6f78193ee9ea has the title “HID: corsair: Add Corsair Vengeance K90 driver,” where we can easily label it as “feature addition.” Second, if the title is not clear enough, we will read the commit message carefully and understand the functionality of the commit. We are able to label the purposes of most commits with the above two steps. However, if we are still unsure of the purpose of the commit, we will take the third step to check the commit code in detail. In this case, we are able to label the purposes of all KVICs. We also sampled 1,240 commits for comparison since labeling the purpose of all the general commits are time consuming. Our previous evaluation (Section 4.2) shows the sampled commits are representative. Table 4 shows the result. We notice that the KVICs mainly lay in three different categories. This makes sense as merging will not introduce new lines of code while documentation works on non-functional codes, which should not induce new vulnerabilities. Testing is for testing purposes and is not likely to induce vulnerabilities, either. In particular, we notice that feature addition accounts for 50.5% (626/1,240) of the KVICs, but only 19.2% (238/1,240) of the sample commits, which is a huge difference. This indicates that adding new features or modules is associated with a higher chance of being vulnerable and deserves maintainers’ attention. We also found that 19.0% (236/1,240) of the KVICs aim to fix bugs. Though the reason why fixes could induce bugs are studied well previously [52], this problem still exists in the Linux kernel and can induce vulnerabilities.

Table 4.

Category	Optimization	Correction	Feature Addition
All KVICs	378 \(\|\) 30.5%	236 \(\|\) 19.0%	626 \(\|\) 50.5%
Sampled Commits	592 \(\|\) 47.7%	345 \(\|\) 27.8%	238 \(\|\) 19.2%

Table 4. The Purposes of KVICs Compared with Sampled Commits

1,104 of the samples’ purposes are Optimization, Correction, and Feature Addition. The other 56 commits belong to the other three purposes (i.e., Merging, Documentation, and Testing).

Finding 6: About half of the KVICs aim to add new features. Meanwhile, attempting to fix bugs or vulnerabilities can still further induce vulnerabilities.

6 Human factors of KVICs (RQ3)

6.1 Author and Maintainer Experience

The author and maintainer play a critical part. For a commit, the author represents the one who submitted the initial commit while the maintainer represents the one who accepted it and got it merged. We calculated the average number of KVICs each author and maintainer induced per 100 commits with the formula \(\frac{V_{p}}{T_{p}}\times 100\). Note that for person “p,” \(V_{p}\) denotes the number of induced vulnerabilities while \(T_{p}\) T is the total number of commits authored or maintained. According to Tables 5 and 6, people with rich experience (i.e. people who have been involved in active authoring and maintaining, which is reflected by the number of times that they have contributed as an author or maintainer) can still induce vulnerabilities with a relatively high frequency and severity. For example, M1 is very experienced and have maintained 1,286 commits. However, M1 still induces KVICs with a high frequency (i.e., 1.17 per 100 commits and 22.2 per 100 K LoC) and relatively high severity (i.e., 6.61 on average). We also observed that experienced authors or maintainers have a lower likelihood of introducing vulnerabilities while fixing bugs compared to when they are adding optimizations or implementing new features. For instance, approximately half of the top 10 authors and maintainers did not introduce any vulnerabilities while addressing bugs, whereas 9 out of the top 10 authors and maintainers introduced vulnerabilities when adding new features.

Table 5.

Author	\(V\)	\(P\)	\(V_{norm}\)	\(T\)	\(A\)	\(S\)	Maintainer	\(V\)	\(P\)	\(V_{norm}\)	\(T\)	\(A\)	\(S\)
A0	3	(2, 0, 1)	47.78	101	2.97	6.25	M0	2	(1, 0, 1)	36.49	107	1.87	5.50
A1	3	(1, 0, 2)	13.78	106	2.83	7.48	M1	15	(10, 1, 4)	22.20	1286	1.17	6.61
A2	4	(1, 1, 2)	70.56	144	2.78	6.90	M2	4	(2, 0, 2)	1.62	461	0.87	7.35
A3	4	(3, 1, 0)	19.76	160	2.5	7.65	M3	2	(0, 0, 2)	10.18	240	0.83	7.50
A4	4	(2, 1, 1)	40.27	171	2.34	5.50	M4	2	(0, 1, 1)	4.90	252	0.79	5.50
A5	5	(2, 0, 3)	36.61	230	2.17	6.87	M5	6	(0, 5, 1)	17.22	781	0.77	6.96
A6	4	(0, 1, 3)	22.66	196	2.04	5.82	M6	7	(2, 2, 3)	0.12	913	0.77	6.99
A7	2	(0, 0, 2)	22.18	101	1.98	3.30	M7	3	(3, 0, 0)	7.10	444	0.68	7.00
A8	6	(3, 0, 3)	36.55	306	1.96	6.56	M8	12	(6, 3, 3)	2.27	1867	0.64	6.67
A9	2	(1, 0, 1)	21.25	108	1.85	5.50	M9	1	(0, 0, 1)	0.03	156	0.64	7.10

Table 5. The Table of Top 10 Authors and Maintainers That Induced Vulnerabilities

“V” is the number of induced vulnerabilities, “P” is a tuple that lists the breakdown of the purposes of the induced vulnerabilities, in the order of (optimization, correction, feature addition), “\(V_{norm}\)” is the number of induced vulnerabilities per 100,000 modified lines (i.e., normalized), “T” is the total number of commits authored or maintained, “A” is the average number of vulnerabilities per 100 commits, and “S” is the average severities per KVIC. We anonymized names for ethics concerns, and removed people with T \({\lt}\) 100 for statistical significance.

Table 6.

Author	\(V\)	\(P\)	\(V_{norm}\)	\(T\)	\(A\)	\(S\)	Maintainer	\(V\)	\(P\)	\(V_{norm}\)	\(T\)	\(A\)	\(S\)
A0	2	(2, 0, 0)	122.17	204	0.98	6.05	M0	2	(1, 0, 1)	36.49	107	1.87	5.50
A1	4	(1, 1, 2)	70.56	144	2.78	6.90	M1	15	(10, 1, 4)	22.20	1286	1.17	6.61
A2	3	(2, 0, 1)	47.78	101	2.97	6.25	M2	6	(0, 5, 1)	17.22	781	0.77	6.96
A3	3	(2, 0, 1)	41.11	211	1.42	8.37	M3	16	(3, 0, 13)	13.11	3155	0.51	5.86
A4	2	(1, 0, 1)	40.91	161	1.24	6.00	M4	3	(2, 1, 0)	11.30	642	0.47	7.65
A5	4	(2, 1, 1)	40.27	171	2.34	5.50	M5	2	(0, 0, 2)	10.18	240	0.83	7.50
A6	3	(0, 0, 3)	40.04	306	0.98	7.50	M6	1	(0, 0, 1)	9.96	221	0.45	9.80
A7	6	(0, 1, 5)	37.40	386	1.55	5.96	M7	24	(9, 7, 8)	9.29	4148	0.58	5.85
A8	5	(2, 0, 3)	36.61	230	2.17	6.87	M8	15	(0, 4, 11)	8.77	3892	0.39	6.17
A9	6	(3, 0, 3)	36.55	306	1.96	6.56	M9	6	(0, 1, 5)	8.46	1507	0.40	6.23

Table 6. This Table Follows the Same Format as Table 5, with the Only Difference That the Rows are Now Ranked by \(V_{norm}\)

Finding 7: Experienced authors and maintainers can still induce vulnerabilities with high frequency and severity, especially while adding optimization or implementing new features.

6.2 Commit Knowledge

In the Linux kernel, each submitted patch has to be merged by assigned maintainers. Though the maintainers are very experienced, they may not be familiar with specific commits that make changes to specific files. Based on previous work [52], We propose the concept of commit knowledge to quantify how familiar one person is with a specific commit. Our intuition is that the more often a person has contributed to a file, the more knowledge they have toward a commit that changes this file. Formally, given one commit \(C\) and a person \(P\), we denote the commit knowledge of \(P\) toward \(C\) as \(Kn(C)_{P}\), and it is defined by the formula (1) below:

\begin{align}Kn(C)_{P}=\sum_{i=1}^{n}\frac{Lines(F_{i})}{Lines(C)}\times Kn(F_{i})_{P},\end{align}

(1)

where \(n\) is the total number of changed files by commit \(C\), \(Lines(F_{i})\) is the number of changed lines by file \(F_{i}\), \(Lines(C)\) denotes the total number of changed lines in commit C, and \(Kn(F_{i})_{P}\) denotes the knowledge of the person \(P\) toward the specific file \(F_{i}\). In particular, \(Kn(F)_{P}\) is defined as \(N\) where \(N\) denotes the number of times the person \(P\) has contributed to the file \(F\) as an author. This means that a person accumulated one unit of knowledge by making each contribution as an author.

Figure 18 shows the CDF plot of the maintainer’ knowledge and the ideal knowledge for the 1,240 KVICs we collected. Note that the ideal knowledge is the maximum knowledge among all developers toward one specific commit. The one who has the maximum knowledge is more familiar with the commit compared with the others and is considered the ideal maintainer. We notice that the ideal knowledge is much higher compared with the actual maintainers’ knowledge, indicating that the maintainers who commit the patch are not always the most experienced one. Specifically, the maintainers’ knowledge is 34.94 on average while the ideal one is 86.75, which is more than two times compared with the actual one.

Considering that an author can also have accumulated experience by contributing as a maintainer, we extend the definition of committer knowledge by considering a committer’s past experience as a maintainer. In particular, we revise \(Kn(F)_{P}\) as \(N+0.5\times M\) where M denotes how many times a person \(P\) has contributed to the file \(F\) as a maintainer. Note that 0.5 is an experience value as the maintainer who reviewed the code was less involved compared with the authors. Figure 19 shows the result when we consider the maintainers’ experience. There is still a gap between the knowledge of the actual maintainers and the ideal ones. On average, the maintainers’ knowledge is 80.92 while the ideal one is 138.73. It is important to note that the values of knowledge do not have fixed thresholds. It is used to evaluate the familiarity of a person toward a specific commit.

It is worth noting that time can be a confounding factor in the analysis process. To address this, we updated our computation formula to include a time-decaying factor. Specifically, instead of 1, we use the value max(1 \(-\) \(\alpha\), 0) as the experience a person accumulates from making a contribution. Here, \(\alpha\) captures the concept of decay of knowledge, which is computed as a ratio of the number of days from the contribution to the commit date and divided by 3,650 days, which is around 10 years. Note that 3650 days is used here as we mainly focus on the KVICs in the past ten years. A person who just contributed should have no decay rate (i.e., \(\alpha=0\)) while a contribution 10 years ago is not counted in the experience (i.e., \(\alpha=1\)). For example, if a person made a contribution 9 years ago, then its experience would be max(1 \(-\) 3,285/3,650, 0) \(=\) 0.1, which is rather small as it has decayed with time. In addition, the knowledge of a person toward a file (i.e., \(Kn(F_{i})_{P}\)) is adjusted to be the sum of experience values from all of its past contributions. After considering the time-decaying factor, we find that the ideal knowledge for KVICs is 10.18, while the actual knowledge of the selected maintainers is 5.26. This result is consistent with our previous conclusion, suggesting that the knowledge levels of the ideal experienced maintainers are approximately twice as high as those of the selected maintainers. Meanwhile, we sampled 10,000 general commits and found their commit knowledge is 7.29, which is higher than the KVICs. This further signifies that the induction of vulnerability is associated with a lower commit knowledge.

Finding 8: The maintainers of KVICs may not have the highest commit knowledge, which means they may not be the most familiar one with the submitted changes.

6.3 Roles of people involved in KVICs

Apart from authors and maintainers, there may exist many other roles involved in the patching process. They are “Acked-by,” “Cc,” “Tested-by,” “Reviewed-by,” “Reported-by,” and “Suggested-by.” “Acked-by” denotes that one person has reviewed the commit and accepted the change. For example, “it looks good to me” is usually converted to “Acked-by.” “Cc” means the commit is notified to the people. However, they may not provide any comments. “Reviewed-by” is similar to “Acked-by” but it is more formal and stronger. “Reviewed-by” means the reviewer is satisfied with the commit while “Acked-by” may only acknowledge part of the commit. The remaining three roles (i.e., “Tested-by,” “Reported-by,” “Suggested-by”) are self-explanatory. In summary, people in these roles can join the discussion of the commit and influence the final decision.

Table 7 shows the average number of roles per commit of KVICs compared with the general commits. Among all the roles, we notice “Reviewed-by” and “Tested-by” have less number of average persons involved in KVICs. In particular, the gap for the “Reviewed-by” role is the largest, where KVICs have 41% (i.e., \(\frac{0.29-0.17}{0.29}\)) fewer reviewers than the general commits. To further investigate the effect of reviewers, we also examined the trend of the number of reviewers along the time scale. Figure 20 shows the result. For the general commits, the number of reviewers steadily increases from 0.099 in 2011 to 0.481 in 2022, which is a 385% growth. On the contrary, there is not such a clear trend for KVICs and its number of reviewers oscillates back and forth. These observations indicate that less number of reviewers is noticeably associated with the induction of vulnerabilities. This shows that one simple yet neglected reason behind vulnerabilities can be just the lack of reviewing.

Table 7.

Category	Acked	Cc	Reported	Reviewed	Suggested	Tested
KVICs	0.17	0.94	0.06	0.17	0.02	0.04
General Commits	0.17	0.59	0.05	0.29	0.01	0.06

Table 7. The Average Number of Specific Roles

Fig. 20.

Finding 9: KVICs do not have enough reviewers compared to other commits in general, even if the Linux kernel in general is involving more and more reviewers over the years.

7 Suggestions

Understanding the KVICs can help to build better defect prediction tools and reduce potential human errors. Based on the insights and findings that we derived, we propose several suggestions.

First, commits with specific characteristics (i.e., change many conditional statements or large files, has large entropy) are more likely to induce vulnerabilities and should be carefully reviewed.

Second, pay more attention to commits that add new features, as KVICs are characterized by having a much higher proportion of the “feature addition” category.

Third, involve more reviewers in commits, as a major contrast between KVICs and the general commits in terms of human factors is that KVICs have much fewer reviewers. Admittedly, there are so many commits, and the number of reviewers is proportionally low compared to that of the commits; however, it is promising to see that there has been a clear trend of more reviewers being involved in commits over the past decade. On the other hand, we do not see a similar trend for KVICs, which means that KVICs still lack extensive reviewing and this could be one reason behind the introduction of vulnerabilities.

Fourth, commits that change particular modules (i.e., init, net, kvm, mm, and fs) should be more carefully reviewed, since we find that this kind of commits have a much higher chance to induce defects compared with others.

Fifth, involve more people with higher commit knowledge in the patch process. We notice that the maintainer of a commit may not always have the highest commit knowledge. In this case, authors or maintainers could consult people (i.e., reviewers) with higher commit knowledge for feedback. While it might be unrealistic to expect the most knowledgeable maintainer to always be available for involvement, our intention is not to guarantee their constant availability. Rather, our proposed commit knowledge scoring system can recommend maintainers with higher levels of knowledge compared to random assignment. By considering the commit knowledge, we anticipate an improvement in the selection of maintainers, increasing the likelihood of involving individuals who possess greater knowledge and expertise in the patch process, reducing the induction of vulnerabilities.

Sixth, implement a more strict commit messaging policy, where developers should use particular tags (e.g., “Fixes,” “Reviewed-by”) for categorizing commits and refer to the CVE ID when patching vulnerabilities. While it is true that the suggestion of including fix tags in commit messages is already part of the Linux kernel’s patch submission guidance, our analysis revealed that a significant number of KVFCs did not adhere to this practice and lacked the inclusion of fix tags. Specifically, out of the 2,290 identified KVFCs, only 312 followed this suggestion. This observation underscores the need to explore alternative methods to raise awareness and emphasize the importance of using fix tags among Linux kernel maintainers. Having this done would help to provide rich structural information about the commits for future research and commit reviews. It is worth noting that this suggestion has also been proposed in existing study [31].

Seventh, researchers are encouraged to develop new algorithms for identifying KVICs automatically. This helps to collect more dataset and contribute to a deeper understanding of vulnerabilities. Our study of KVICs provides both a comprehensive dataset and useful metrics that can be adopted by related work in the future.

8 Discussion

Threats to Validity. Threats to validity mainly lie in the identified KVICs. First, our data can have false negatives as some KVICs may not be referred to in the commits or not listed in the vendors’ site. However, this does not constitute a significant threat as our major goal is to study the general characteristics of KVICs. Since we already collected many KVICs without any bias and have arrived at statistically significant conclusions, a few missing KVICs will not drastically overturn our results. Admittedly, a more relaxed methodology could find a higher number of KVFCs and KVICs, but that could in turn contain more false positives. Since our main goal is to examine the general characteristics of KVICs, we wanted to reduce the number of false positives since they could bias the result of our study. Second, our data may have false positives. To reduce this threat, we employed various data collection methods and checked for conflicts among these different methods. We only adopted the data that received unanimous agreement from all methods (including an existing dataset that was curated by other researchers), and used a two-author manual verification process which requires minimal manual efforts to resolve conflicts. This greatly reduces false positives and robustifies our analysis of KVICs. Third, we discuss the quality of our manual collection process. Manual efforts were only involved for 20% (260/1,240) KVICs and 23% (538/2,290) KVFCs, and most of the collection process was automated. For the manual efforts, we involved two authors to conduct manual labeling individually, which is used by previous studies [20]. The final labeling was decided only upon both authors’ mutual agreement.

Generalizability of Our Findings to Other Open-source software (OSS) Projects. We focus on the Linux kernel as it is one of the largest and most popular open source software with well organized structure and maintenance. However, our analysis framework, derived insights, and proposed suggestions can be applied to the other open source software as they follow a similar development process as that of the Linux kernel. Meanwhile we believe that our findings are inherent to this development process instead of to a certain software project. For example, regarding our finding that KVICs in the Linux kernel lack reviewing, this should raise concerns for developers from other projects as the significance of commit reviewing is project-independent. Similarly, our study of other aspects of KVICs (e.g., complexity, entropy, purpose, human factors, etc.) can also provide valuable insights to other OSS projects.

Novelty of Our Work. First, almost all the previous studies focused on bugs while we examine KVICs. This makes our work fundamentally different from previous ones. Different from bugs, which is just a functional deviation from the expectation, vulnerabilities come with the potential of being exploited by attackers to result in serious consequences. Second, we propose systematic methodology on collecting KVICs and construct the most complete dataset. Furthermore, we will publish the whole dataset to the community to motivate future research. Third, we evaluate the characteristics, purposes, and human factors of KVICs from many different perspectives with different metrics. Some of them (e.g., commit knowledge) are newly proposed by this paper. Though not all of the findings are surprising, we support them to be true with solid, quantitative data. To this end, we believe that our findings and the proposed dataset is valuable to the whole community.

Applicability of Our Findings to the Linux Kernel Maintenance. In this article, we proposed many insights for the Linux kernel, which can be applied in the future kernel development and maintenance process. First, we found that many commits did not follow the suggestion of the kernel patch guide [8] that specific tags (e.g., “fixes”) should be provided. If this lack of structural information is amended, future work on vulnerability analysis and in-time detection will be easier to carry out. This finding urges the kernel maintainers to develop a scheme that enforces a stronger structural information in the commit message. Second, the dataset that we release will help the kernel maintainers and researchers with further vulnerability detection and analysis. In particular, the insights that we find will also help with feature selection for tools like defect prediction. Third, our proposed metrics of commit knowledge can be adopted by the kernel maintainers to build a reviewer assignment system. This can be done by keeping track of the knowledge of reviewers and trying to pair up the most knowledgeable reviewers to each commit. Our finding of a noticeable reviewer knowledge gap also urges the instrumentation of such a tool.

9 Related Work

Vulnerability Empirical Study. Many empirical studies are conducted on software defection. For example, Yin et al. [52] studied how fixes become bugs. They analyzed many bugs and find that fixer’s knowledge is one of the reasons for inducing bugs. Our study focus on vulnerabilities instead of bugs and proposes the concept of commit knowledge, which is more specific. The bugs of different systems were also widely studied (e.g., TensorFlow [53], autonomous driving software [20], GPU [51]). Meanwhile, researchers also studied other types of bugs [15, 20, 22, 38, 47, 53]. Different from bugs, which is a functional deviation from the expectation, vulnerabilities come with the potential of being exploited by attackers to result in serious consequences. Alexopoulos et al. [12] studied vulnerabilities with a focus on their lifetime. Other empirical studies [29, 41] focus on security patches across various platforms, which differs from our study in that we mainly look at the KVICs instead of patches.

Vulnerability Dataset. There has been multiple dataset proposed by the research community for different OSS systems. Alexopoulos et al. [12] built a dataset that maps CVEs to their fixing commits, and estimated vulnerable commits via a ”git blame” based approach. Nikitopoulos et al. [36] built a dataset targeting machine learning applications, which only maps KVFCs to the modified files, instead of commits. Zhou et al. [54] used two deep neural networks to identify KVFCs, and found many patches from real-world projects. However, these work mainly focus on building dataset of patches that fix vulnerabilities. Our work complements the existing ones by presenting a large dataset of KVICs, coupled with their corresponding fixing commits and CVE IDs.

Software Defect Prediction. There are many works studying how to detect software defects [19, 21, 25, 26, 30, 32, 35, 37, 44, 46, 48, 54]. Most of them use machine learning algorithms including decision trees, SVM, and deep neural networks. Our empirical study contributes to feature-based approaches (e.g., decision trees) by providing valuable insights on the characteristics of KVICs, and to data-driven approaches (e.g., deep neural networks) by building a large dataset of KVICs with their associated CVEs and KVFCs. In addition, our proposed analysis framework and the associated metrics provide insights on evaluating and improving the existing vulnerability prediction tools.

Vulnerability Origin Detection. There are many works [16, 18, 27, 39, 42, 50] proposed to locate the origin of the vulnerabilities or bugs. However, most of them are proved to not work well in practice [49]. Meanwhile, tools like V0Finder [50] aims to locate the affected software’s instead of vulnerability inducing commits. Recently, SZZUnleashed [14] open sourced the implementation of SZZ algorithm while the state-of-the-art tool V-SZZ [13] focus on identifying the vulnerability inducing commits based on the previous model [39]. However, we found both SZZUnleashed and V-SZZ do not work well on identifying the KVICs. Specifically, we feed the KVFCs of our identified 1,240 KVICs to these tools. SZZUnleashed and V-SZZ can only identify 201 (16.2%) and 696 (56.1%) of them. Note we randomly select 20 samples from the left KVICs that cannot be identified by these tools for manual verification and all of them are indeed missed by these tools. Our investigation shows the previous vulnerability origin detection tools still have a large room for enhancement. Our identified KVICs can be a good dataset and help to speed up the further research.

10 Conclusion

We conducted the first comprehensive study on 1,240 KVICs, which are identified by combining 10 different methods. Specifically, we extracted various characteristics of the KVICs from a diverse range of perspectives, including the modified size, conditional statements, content, complexity, CVE types, and CVE severity. In addition, we also studied the purpose of these KVICs and the involved human factors. Our study results in many interesting findings and we propose several suggestions based these findings. The dataset of this paper is available at https://github.com/jinan789/Understanding-Vulnerability-Inducing-Commits-of-the-Linux-Kernel.

Acknowledgement

We would like to thank the anonymous reviewers and editors for their comments that greatly helped improve the presentation of this paper.

References

[1]

2024. Android Security Bulletin. Retrieved from https://source.android.com/security/bulletin

Abstract

1 Introduction

2 Background

2.1 Terms

2.2 The Linux Kernel Patch Process

3 Data Collection

3.1 Kernel Vulnerability Collection

3.2 KVFC Identification

3.2.1 KVFC Identification Method 1: Commit Message.

3.2.2 KVFC Identification Method 2: Commit URL.

3.2.3 KVFC Identification Method 3: Commit Tag.

3.2.4 KVFC Identification Method 4: Vendors.

3.2.5 KVFC Identification Method 5: Research community.

3.2.6 KVFC Identification Method 6: Linux Kernel CVEs Project.

3.2.7 Determining final KVFCs.

3.3 KVIC Identification

3.3.1 KVIC Identification Method 1: Fixes tag.

3.3.2 KVIC Identification Method 2: Commit ID.

3.3.3 KVIC Identification Method 3: Vendors.

3.3.4 KVIC Identification Method 4: Linux Kernel CVEs Project.

3.3.5 Determining final KVICs.

3.4 RQs

4 Characteristics of KVIC (RQ1)

4.1 Modified Size of KVICs

4.2 Modified Files’ Complexity of KVICs

4.3 Complexity of KVICs

4.4 Modified Content of KVICs

4.5 Type and Severity of KVICs

5 Purposes of KVIC (RQ2)

6 Human factors of KVICs (RQ3)

6.1 Author and Maintainer Experience

6.2 Commit Knowledge

6.3 Roles of people involved in KVICs

7 Suggestions

8 Discussion

9 Related Work

10 Conclusion

Acknowledgement

References

Index Terms

Recommendations

Diminisher: A Linux Kernel Based Countermeasure for TAA Vulnerability

KernJC: Automated Vulnerable Environment Generation for Linux Kernel Vulnerabilities

EntryBleed: A Universal KASLR Bypass against KPTI on Linux

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations