research-article

Open access

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Authors:

David WagnerAuthors Info & Claims

RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

Pages 654 - 668

https://doi.org/10.1145/3607199.3607242

Published: 16 October 2023 Publication History

All formats PDF

Abstract

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined.

Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects.

We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

References

[1]

Al Bessey, Ken Block, Ben Chelf, Andy Chou, Seth Hallem Bryan Fulton, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM 53, 2 (February 2010).

Digital Library

[2]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39.

Digital Library

[3]

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, and Baishakhi Ray. 2022. NatGen: generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 18–30.

Digital Library

[4]

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2021. Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering (2021).

[5]

Alexis Challande, Robin David, and Guénaël Renault. 2022. Building a Commit-level Dataset of Real-world Vulnerabilities. In Proceedings of the Twelveth ACM Conference on Data and Application Security and Privacy. 101–106.

Digital Library

[6]

The MITRE Corporation. Last accessed on March 28, 2023. 2022 CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html

[7]

Roland Croft, M Ali Babar, and Mehdi Kholoosi. 2023. Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.

Digital Library

[10]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

[11]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).

[12]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).

[13]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).

[14]

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2244–2258.

[15]

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).

[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[17]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).

[18]

Yisroel Mirsky, George Macon, Michael Brown, Carter Yagemann, Matthew Pruett, Evan Downing, Sukarno Mertoguno, and Wenke Lee. 2023. VulChecker: Graph-based Vulnerability Localization in Source Code. In USENIX Security 2023.

[19]

Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1565–1569.

Digital Library

[20]

National Institute of Standards and Technology. Last accessed on March 19, 2023. National Vulnerability Database. https://nvd.nist.gov/

[21]

National Institute of Standards and Technology. Last accessed on March 19, 2023. NIST Software Assurance Reference Dataset. https://samate.nist.gov/SARD

[22]

Vadim Okun, Aurelien Delaitre, Paul E Black, 2013. Report on the static analysis tool exposition (sate) iv. NIST Special Publication 500 (2013), 297.

[23]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[24]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.

Digital Library

[25]

Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, 757–762.

[26]

Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An empirical study of deep learning models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

Digital Library

[27]

Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. 2022. Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual Computer Security Applications Conference. 481–496.

Digital Library

[28]

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. 2021. Patchdb: A large-scale security patch dataset. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 149–160.

[29]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).

[30]

Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. 2022. A Systematic Evaluation of Large Language Models of Code. arXiv preprint arXiv:2202.13169 (2022).

[31]

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590–604.

Digital Library

[32]

Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: a dataset built for AI-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111–120.

Digital Library

[33]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems 32 (2019).

Cited By

Sheneamer A(2024)Vulnerable JavaScript functions detection using stacking of convolutional neural networksPeerJ Computer Science10.7717/peerj-cs.183810(e1838)Online publication date: 29-Feb-2024
https://doi.org/10.7717/peerj-cs.1838
Grahn DChen LZhang J(2024)Vul-Mixer: Efficient and Effective Machine Learning–Assisted Software Vulnerability DetectionElectronics10.3390/electronics1313253813:13(2538)Online publication date: 28-Jun-2024
https://doi.org/10.3390/electronics13132538
Akhoundali JNouri SRietveld KGadyatskaya OShang WLamothe MWan Z(2024)MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository DiscoveryProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3663533.3664036(42-51)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663533.3664036
Show More Cited By

Index Terms

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning-based Vulnerability Detection in Binary Code
ICMLC '22: Proceedings of the 2022 14th International Conference on Machine Learning and Computing

Cyberattacks typically exploit software vulnerabilities to compromise computers and smart devices. To address vulnerabilities, many approaches have been developed to detect vulnerabilities using deep learning. However, most learning-based approaches ...
Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Software system complexity and security vulnerability diversity are plausible sources of the persistent challenges in software vulnerability research. Applying deep learning methods for automatic vulnerability detection has been proven an effective ...
Poison Attack and Poison Detection on Deep Source Code Processing Models
In the software engineering (SE) community, deep learning (DL) has recently been applied to many source code processing tasks, achieving state-of-the-art results. Due to the poor interpretability of DL models, their security vulnerabilities require ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

October 2023

769 pages

ISBN:9798400707650

DOI:10.1145/3607199

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

RAID 2023

RAID 2023: The 26th International Symposium on Research in Attacks, Intrusions and Defenses

October 16 - 18, 2023

Hong Kong, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
3,814
Total Downloads

Downloads (Last 12 months)3,814
Downloads (Last 6 weeks)465

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sheneamer A(2024)Vulnerable JavaScript functions detection using stacking of convolutional neural networksPeerJ Computer Science10.7717/peerj-cs.183810(e1838)Online publication date: 29-Feb-2024
https://doi.org/10.7717/peerj-cs.1838
Grahn DChen LZhang J(2024)Vul-Mixer: Efficient and Effective Machine Learning–Assisted Software Vulnerability DetectionElectronics10.3390/electronics1313253813:13(2538)Online publication date: 28-Jun-2024
https://doi.org/10.3390/electronics13132538
Akhoundali JNouri SRietveld KGadyatskaya OShang WLamothe MWan Z(2024)MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository DiscoveryProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3663533.3664036(42-51)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663533.3664036
Wang XHu RGao CWen XChen YLiao QRoychoudhury APaiva AAbreu RStorey M(2024)ReposVul: A Repository-Level High-Quality Vulnerability DatasetProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3647634(472-483)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3647634
Wen XGao CLuo FWang HLi GLiao Q(2024)LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability TypesIEEE Transactions on Software Engineering10.1109/TSE.2024.338236150:6(1325-1339)Online publication date: Jun-2024
https://doi.org/10.1109/TSE.2024.3382361
Liu Z(2024)A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527236(1-10)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527236
Ferrag MNdhlovu MTihanyi NCordeiro LDebbah MLestable TThandi N(2024)Revolutionizing Cyber Threat Detection With Large Language Models: A Privacy-Preserving BERT-Based Lightweight Model for IoT/IIoT DevicesIEEE Access10.1109/ACCESS.2024.336346912(23733-23750)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3363469
Guo YBettaieb SCasino F(2024)A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road aheadInternational Journal of Information Security10.1007/s10207-024-00888-yOnline publication date: 23-Jul-2024
https://doi.org/10.1007/s10207-024-00888-y
Dunlap TMeyers JReaves BEnck W(2024)Pairing Security Advisories with Vulnerable Functions Using Open-Source LLMsDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-031-64171-8_18(350-369)Online publication date: 17-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-64171-8_18
Beddies CEylert BKubica S(2024)The Necessity of Secure IT Infrastructures in Healthcare Through AI Vulnerability AnalysisEngineering Methodologies for Medicine and Sports10.1007/978-3-031-63755-1_23(298-310)Online publication date: 19-Jul-2024
https://doi.org/10.1007/978-3-031-63755-1_23
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents