research-article

Open access

MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery

Authors:

Jafar Akhoundali,

Sajad Rahim Nouri,

Kristian Rietveld,

Olga GadyatskayaAuthors Info & Claims

PROMISE 2024: Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering

Pages 42 - 51

https://doi.org/10.1145/3663533.3664036

Published: 10 July 2024 Publication History

PDF eReader

Abstract

Vulnerability datasets have become an important instrument in software security research, being used to develop automated, machine learning-based vulnerability detection and patching approaches. Yet, any limitations of these datasets may translate into inadequate performance of the developed solutions. For example, the limited size of a vulnerability dataset may restrict the applicability of deep learning techniques. In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397% increase in CVEs, a 295% increase in covered open-source projects, and a 480% increase in commit fixes. Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

References

[1]

2022. Machine Learning for Source Code Vulnerability Detection: What Works and What Isn’t There Yet. IEEE Security & Privacy, 20, 5 (2022), 60–76.

Abstract

References

Cited By

Index Terms

Recommendations

CVEfixes: automated collection of vulnerabilities and their fixes from open-source software

IPOD: A Large-scale Industrial and Professional Occupation Dataset

IncluSet: A Data Surfacing Repository for Accessibility Datasets

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations