Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

There is a newer version of the record available.

Published May 17, 2024 | Version v1
Dataset Open

MoreFixes: Largest CVE dataset with fixes

  • 1. ROR icon Leiden University
  • 2. Islamic Azad University, Ramsar

Description

In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering.  As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits.
  Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397% increase in CVEs, a 295% increase in covered open-source projects, and a 480% increase in commit fixes.
  Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata.
  Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

`cvedataset-patches.zip` file contains fix patches, and `dump_morefixes_27-03-2024_19_52_58.sql.zip` contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

This product uses the NVD API but is not endorsed or certified by the NVD.

This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

 

To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

POSTGRES_USER=postgrescvedumper
POSTGRES_DB=postgrescvedumper
POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

 

Files

cvedataset-patches.zip

Files (5.1 GB)

Name Size Download all
md5:d1ec5c32b24e0f0393fba5eacc3ba8aa
1.9 GB Preview Download
md5:10673aaa111ba637e658915e48a8812c
3.2 GB Preview Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.11110595 (DOI)
Is documented by
Conference paper: 10.1145/3663533.3664036 (DOI)

Software

Repository URL
https://github.com/JafarAkhondali/Morefixes
Programming language
Python
Development Status
Active