Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3663533.3664036acmconferencesArticle/Chapter ViewAbstractPublication PagespromiseConference Proceedingsconference-collections
research-article
Open access

MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery

Published: 10 July 2024 Publication History

Abstract

Vulnerability datasets have become an important instrument in software security research, being used to develop automated, machine learning-based vulnerability detection and patching approaches. Yet, any limitations of these datasets may translate into inadequate performance of the developed solutions. For example, the limited size of a vulnerability dataset may restrict the applicability of deep learning techniques. In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397% increase in CVEs, a 295% increase in covered open-source projects, and a 480% increase in commit fixes. Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

References

[1]
2022. Machine Learning for Source Code Vulnerability Detection: What Works and What Isn’t There Yet. IEEE Security & Privacy, 20, 5 (2022), 60–76.
[2]
Jafar Akhoundali, Sajad Rahim Nouri, Kristian F. D. Rietveld, and Olga GADYATSKAYA. 2024. MoreFixes: Largest CVE dataset with fixes. https://doi.org/10.5281/zenodo.11199120
[3]
Jafar Akhoundali, Sajad Rahim Nouri, Kristian F. D. Rietveld, and Olga GADYATSKAYA. 2024. Source code for "MoreFixes". https://doi.org/10.5281/zenodo.11110595
[4]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEFixes: Automated collection of vulnerabilities and their fixes from open-source software. In Proc. of PROMISE. 30–39.
[5]
Tim Boland and Paul E Black. 2012. Juliet 1. 1 C/C++ and Java test suite. Computer, 45, 10 (2012), 88–90.
[6]
Nicholas Chan and John A Chandy. 2022. Extracting vulnerabilities from GitHub commits. In Proc. of SANER. 235–239.
[7]
Yizheng Chen, Zhoujie Ding, Xinyun Chen, and David Wagner. 2023. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. arXiv preprint arXiv:2304.00409.
[8]
Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2022. Neural transfer learning for repairing security vulnerabilities in C code. IEEE Transactions on Software Engineering, 49, 1 (2022), 147–165.
[9]
Roland Croft, M Ali Babar, and M Mehdi Kholoosi. 2023. Data quality for software vulnerability datasets. In Proc. of ICSE. 121–133.
[10]
Trevor Dunlap, Elizabeth Lin, William Enck, and Bradley Reaves. 2023. VFCFinder: Seamlessly pairing security advisories and patches. arXiv preprint arXiv:2311.01532.
[11]
Soufian El Yadmani, Robin The, and Olga Gadyatskaya. 2022. Beyond the Surface: Investigating Malicious CVE Proof of Concept Exploits on GitHub. arXiv preprint arXiv:2210.08374.
[12]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proc. of MSR. 508–512.
[13]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Yuki Kume, Van Nguyen, Dinh Phung, and John Grundy. 2024. AIBughunter: A practical tool for predicting, classifying and repairing software vulnerabilities. Empirical Software Engineering, 29, 1 (2024), 4.
[14]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: A T5-based automated software vulnerability repair. In Proc. of ESEC/FSE. 935–947.
[15]
Michael Fu, Chakkrit Tantithamthavorn, Van Nguyen, and Trung Le. 2023. ChatGPT for vulnerability detection, classification, and repair: How far are we? arXiv preprint arXiv:2310.09810.
[16]
Anastasiia Grishina. 2022. Enabling automatic repair of source code vulnerabilities using data-driven methods. In Proc. of ICSE: Companion Proceedings. 275–277.
[17]
Yuejun Guo and Seifeddine Bettaieb. 2023. An Investigation of Quality Issues in Vulnerability Detection Datasets. In Proc. of EuroS&P Workshops. 29–33.
[18]
Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, and Damian A. Tamburri. 2021. Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories. arxiv:2103.13375
[19]
Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023. An empirical study on fine-tuning large language models of code for automated program repair. In Proc. of ASE. 1162–1174.
[20]
Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2023. Automated Bug Generation in the era of Large Language Models. arXiv preprint arXiv:2310.02407.
[21]
Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2023. Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv preprint arXiv:2311.16169.
[22]
Maryna Kluban, Mohammad Mannan, and Amr Youssef. 2023. On Detecting and Measuring Exploitable JavaScript Functions in Real-World Applications. ACM TOPS.
[23]
Arina Kudriavtseva and Olga Gadyatskaya. 2024. You cannot improve what you do not measure: A triangulation study of software security metrics. In Proc. of SAC. ACM.
[24]
Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. In Proc. of CCS. 2201–2215.
[25]
Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Xuan-Bach D Le, and David Lo. 2022. Vulcurator: a vulnerability-fixing commit detector. In Proc. of ESEC/FSE. 1726–1730.
[26]
Giang Nguyen-Truong, Hong Jin Kang, David Lo, Abhishek Sharma, Andrew E Santosa, Asankhaya Sharma, and Ming Yi Ang. 2022. Hermes: Using commit-issue linking to detect vulnerability-fixing commits. In Proc. of SANER. 51–62.
[27]
Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: A cross-language vulnerability dataset with commit data. In Proc. of ESEC/FSE. 1565–1569.
[28]
Yu Nong, Richard Fang, Guangbei Yi, Kunsong Zhao, Xiapu Luo, Feng Chen, and Haipeng Cai. 2023. VGX: Large-scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses. arXiv preprint arXiv:2310.15436.
[29]
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2023. VULGEN: Realistic Vulnerability Generation Via Pattern Mining and Deep Learning. In Proc. of ICSE.
[30]
Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated dataset of fixes to vulnerabilities of open-source software. In Proc. of MSR. 383–387.
[31]
Moumita Das Purba, Arpita Ghosh, Benjamin J Radford, and Bill Chu. 2023. Software vulnerability detection using large language models. In Proc. of ISSRE Workshops. 112–119.
[32]
Sofia Reis and Rui Abreu. 2021. A ground-truth dataset of real security patches. arXiv preprint arXiv:2110.09635.
[33]
Sofia Reis, Rui Abreu, Hakan Erdogmus, and Corina Păsăreanu. 2022. SECOM: Towards a convention for security commit messages. In Proc. of MSR. 764–765.
[34]
Antonino Sabetta, Serena Elisa Ponta, Rocio Cabrera Lozoya, Michele Bezzi, Tommaso Sacchetti, Matteo Greco, Gergő Balogh, Péter Hegedűs, Rudolf Ferenc, and Ranindya Paramitha. 2024. Known Vulnerabilities of Open Source Projects: Where Are the Fixes? IEEE Security & Privacy.
[35]
Felix Schuckert, Basel Katt, and Hanno Langweg. 2023. Insecurity Refactoring: Automated Injection of Vulnerabilities in Source Code. Computers & Security, 128 (2023), 103121.
[36]
K Sivakumar and K Garg. 2007. Constructing a “common cross-site scripting vulnerabilities enumeration (CXE)” using CWE and CVE. In Proc. of ICISS. 277–291.
[37]
Bertrand Stivalet and Elizabeth Fong. 2016. Large scale generation of complex and faulty PHP test cases. In Proc. of ICST. 409–415.
[38]
Shiyu Sun, Shu Wang, Xinda Wang, Yunlong Xing, Elisa Zhang, and Kun Sun. 2023. Exploring Security Commits in Python. In Proc. of ICSME. 171–181.
[39]
Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C Cordeiro, and Vasileios Mavroeidis. 2023. The FormAI dataset: Generative AI in software security through the lens of formal verification. In Proc. of PROMISE. 33–43.
[40]
Hieu Dinh Vo, Thanh Trong Vu, and Son Nguyen. 2023. Silent Vulnerability-fixing Commit Identification Based on Graph Neural Networks. arXiv preprint arXiv:2309.08225.
[41]
Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. 2024. A Repository-Level Dataset For Detecting, Classifying and Repairing Software Vulnerabilities. arXiv preprint arXiv:2401.13169.
[42]
Xinda Wang, Kun Sun, Archer Batcheller, and Sushil Jajodia. 2019. Detecting" 0-day" vulnerability: An empirical study of secret security patch in OSS. In Proc. of DSN. 485–492.
[43]
Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. arXiv preprint arXiv:2305.18607.

Cited By

View all

Index Terms

  1. MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PROMISE 2024: Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering
    July 2024
    65 pages
    ISBN:9798400706752
    DOI:10.1145/3663533
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. CVE
    2. Vulnerability dataset
    3. dataset
    4. open-source
    5. real-world vulnerability dataset
    6. software repository mining

    Qualifiers

    • Research-article

    Funding Sources

    • Dutch Research Council (NWO)

    Conference

    PROMISE '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 98 of 213 submissions, 46%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 895
      Total Downloads
    • Downloads (Last 12 months)895
    • Downloads (Last 6 weeks)162
    Reflects downloads up to 06 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media