research-article

Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)

Authors:

Shangguang Wang,

Haoyu WangAuthors Info & Claims

ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 52 - 63

https://doi.org/10.1145/3597926.3598037

Published: 13 July 2023 Publication History

Abstract

Software system complexity and security vulnerability diversity are plausible sources of the persistent challenges in software vulnerability research. Applying deep learning methods for automatic vulnerability detection has been proven an effective means to complement traditional detection approaches. Unfortunately, lacking well-qualified benchmark datasets could critically restrict the effectiveness of deep learning-based vulnerability detection techniques. Specifically, the long-term existence of erroneous labels in the existing vulnerability datasets may lead to inaccurate, biased, and even flawed results.

In this paper, we aim to obtain an in-depth understanding and explanation of the label error causes. To this end, we systematically analyze the diversified datasets used by state-of-the-art learning-based vulnerability detection approaches, and examine their techniques for collecting vulnerable source code datasets. We find that label errors heavily impact the mainstream vulnerability detection models, with a worst-case average F1 drop of 20.7%. As mitigation, we introduce two approaches to dataset denoising, which will enhance the model performance by an average of 10.4%. Leveraging dataset denoising methods, we provide a feasible solution to obtain high-quality labeled datasets.

References

[1]

American Information Technology Laboratory. 2017. Software Assurance Reference Dataset. https://samate.nist.gov/SARD/index.php

[2]

American Information Technology Laboratory. 2021. National Vulnerability Database. https://nvd.nist.gov/

[3]

Apple Inc. 2021. Clang static analyzer. https://clang-analyzer.llvm.org

[4]

Sicong Cao, Xiaobing Sun, Lili Bo, Rongxin Wu, Bin Li, and Chuanqi Tao. 2022. MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks. arXiv preprint arXiv:2203.02660, https://doi.org/10.1145/3510003.3510219

Digital Library

[5]

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2021. Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering, https://doi.org/10.1109/TSE.2021.3087402

[6]

Xiao Cheng, Xu Nie, Ningke Li, Haoyu Wang, Zheng Zheng, and Yulei Sui. 2022. How About Bug-Triggering Paths? - Understanding and Characterizing Learning-Based Vulnerability Detectors. IEEE Transactions on Dependable and Secure Computing, 1–18. https://doi.org/10.1109/TDSC.2022.3192419

Digital Library

[7]

Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and Yulei Sui. 2021. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology (TOSEM), 30, 3 (2021), 1–33. https://doi.org/10.1145/3436877

Digital Library

[8]

Xiao Cheng, Haoyu Wang, Jiayi Hua, Miao Zhang, Guoai Xu, Li Yi, and Yulei Sui. 2019. Static Detection of Control-Flow-Related Vulnerabilities Using Graph Embedding. In 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). 41–50. https://doi.org/10.1109/ICECCS.2019.00012

[9]

Cppcheck. 2021. CppCheck. https://cppcheck.sourceforge.io/

[10]

CVE Details Website. 2021. Common Vulnerabilities and Exposures. http://https://www.cvedetails.com/

[11]

David A. Wheeler. 2021. Flawfinder. https://dwheeler.com/flawfinder/

[12]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512. https://doi.org/10.1145/3379597.3387501

Digital Library

[13]

Antonios Gkortzis, Dimitris Mitropoulos, and Diomidis Spinellis. 2018. VulinOSS: a dataset of security vulnerabilities in open-source systems. In Proceedings of the 15th International conference on mining software repositories. 18–21. https://doi.org/10.1145/3196398.3196454

Digital Library

[14]

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31 (2018).

[15]

David Hin, Andrey Kan, Huaming Chen, and M Ali Babar. 2022. LineVD: Statement-level Vulnerability Detection using Graph Neural Networks. arXiv preprint arXiv:2203.05181, https://doi.org/10.1145/3524842.3527949

Digital Library

[16]

Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning. 4804–4815.

[17]

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. 2304–2313.

[18]

Junnan Li, Richard Socher, and Steven C. H. Hoi. 2020. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. ArXiv, abs/2002.07394 (2020).

[19]

Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303. https://doi.org/10.1145/3468264.3468597

Digital Library

[20]

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, https://doi.org/10.1109/TDSC.2021.3051525

[21]

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681, https://doi.org/10.14722/ndss.2018.23158

[22]

Yuxi Ling, Kailong Wang, Guangdong Bai, Haoyu Wang, and Jin Song Dong. ASE 2022. Are They Toeing the Line? Diagnosing Privacy Compliance Violations among Browser Extensions. https://doi.org/10.1145/3551349.3560436

Digital Library

[23]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. ArXiv, abs/2102.04664 (2021).

[24]

Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1565–1569. https://doi.org/10.1145/3468264.3473122

Digital Library

[25]

Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident Learning: Estimating Uncertainty in Dataset Labels. J. Artif. Int. Res., 70 (2021), 1373–1411. issn:1076-9757 https://doi.org/10.1613/jair.1.12125

Digital Library

[26]

Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 426–437. https://doi.org/10.1145/2810103.2813604

Digital Library

[27]

Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, 383–387.

Digital Library

[28]

Redis Ltd. 2011. Redis. https://github.com/redis/redis

[29]

Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA). 757–762. https://doi.org/10.1109/ICMLA.2018.00120

[30]

Antonino Sabetta and Michele Bezzi. 2018. A practical approach to the automatic classification of security-relevant commits. In 2018 IEEE International conference on software maintenance and evolution (ICSME). 579–582. https://doi.org/10.1109/ICSME.2018.00058

[31]

Huanting Wang, Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Yansong Feng, Lizhong Bian, and Zheng Wang. 2020. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Transactions on Information Forensics and Security, 16 (2020), 1943–1958. https://doi.org/10.1109/TIFS.2020.3044773

Digital Library

[32]

Kailong Wang, Yuxi Ling, Yanjun Zhang, Haoyu Wang, Guangdong Bai, Beng Chin Ooi, and Jin Song Dong. SIGMETRICS 2023. Characterizing Cryptocurrency-themed Malicious Browser Extensions. https://doi.org/10.1145/3570603

Digital Library

[33]

Kailong Wang, Junzhe Zhang, Guangdong Bai, Ryan Ko, and Jin Song Dong. 2021. It’s Not Just the Site, It’s the Contents: Intra-domain Fingerprinting Social Media Websites Through CDN Bursts. In WWW 2021. https://doi.org/10.1145/3442381.3450008

Digital Library

[34]

Liu Wang, Haoyu Wang, Xiapu Luo, and Yulei Sui. 2023. MalWhiteout: Reducing Label Errors in Android Malware Detection. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE ’22). Association for Computing Machinery, New York, NY, USA. Article 69, 13 pages. https://doi.org/10.1145/3551349.3560418

Digital Library

[35]

Xinda Wang, Kun Sun, Archer Batcheller, and Sushil Jajodia. 2019. Detecting" 0-day" vulnerability: An empirical study of secret security patch in OSS. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 485–492. https://doi.org/10.1109/DSN.2019.00056

[36]

Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. 2019. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 322–330. https://doi.org/10.1109/ICCV.2019.00041

[37]

Jiayun Xu, Yingjiu Li, and Robert Deng. 2021. Differential Training: A Generic Framework to Reduce Label Noises for Android Malware Detection. https://doi.org/10.14722/ndss.2021.24126

[38]

Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawendé F. Bissyandé, Jacques Klein, and John Grundy. 2021. On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection. ACM Trans. Softw. Eng. Methodol., 30, 3 (2021), Article 40, may, 38 pages. https://doi.org/10.1145/3446905

Digital Library

[39]

Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2a: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120. https://doi.org/10.1109/ICSE-SEIP52600.2021.00020

Digital Library

[40]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32 (2019).

[41]

Yaqin Zhou and Asankhaya Sharma. 2017. Automated identification of security issues from commit messages and bug reports. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 914–919. https://doi.org/10.1145/3106237.3117771

Digital Library

[42]

Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2019. μ VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection. IEEE Transactions on Dependable and Secure Computing, 1–1. issn:2160-9209 https://doi.org/10.1109/tdsc.2019.2942930

Digital Library

Cited By

Cheng BZhao SWang KWang MBai GFeng RGuo YMa LWang H(2024)Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based DetectorsACM Transactions on Software Engineering and Methodology10.1145/364154333:5(1-33)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3641543
Liu YTantithamthavorn CLiu YLi L(2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641540
Wang WLi YLi AZhang JMa WLiu YRoychoudhury APaiva AAbreu RStorey M(2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639217
Show More Cited By

Index Terms

Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)

Recommendations

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection
RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable ...
Learning-based Vulnerability Detection in Binary Code
ICMLC '22: Proceedings of the 2022 14th International Conference on Machine Learning and Computing

Cyberattacks typically exploit software vulnerabilities to compromise computers and smart devices. To address vulnerabilities, many approaches have been developed to detect vulnerabilities using deep learning. However, most learning-based approaches ...
VulHunter: An Automated Vulnerability Detection System Based on Deep Learning and Bytecode
Information and Communications Security
Abstract
The automatic detection of software vulnerability is undoubtedly an important research problem. However, existing solutions heavily rely on human experts to extract features and many security vulnerabilities may be missed (i.e., high false ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2023

1554 pages

ISBN:9798400702211

DOI:10.1145/3597926

General Chair:
René Just
University of Washington, USA
,
Program Chair:
Gordon Fraser
University of Passau, Germany

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISSTA '23

Sponsor:

SIGSOFT

ISSTA '23: 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

July 17 - 21, 2023

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '24

Sponsor:
sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
489
Total Downloads

Downloads (Last 12 months)409
Downloads (Last 6 weeks)32

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng BZhao SWang KWang MBai GFeng RGuo YMa LWang H(2024)Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based DetectorsACM Transactions on Software Engineering and Methodology10.1145/364154333:5(1-33)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3641543
Liu YTantithamthavorn CLiu YLi L(2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641540
Wang WLi YLi AZhang JMa WLiu YRoychoudhury APaiva AAbreu RStorey M(2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639217
Cheng JChen YCao YWang H(2024)A vulnerability detection framework by focusing on critical execution pathsInformation and Software Technology10.1016/j.infsof.2024.107517174(107517)Online publication date: Oct-2024
https://doi.org/10.1016/j.infsof.2024.107517
Guo YBettaieb SCasino F(2024)A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road aheadInternational Journal of Information Security10.1007/s10207-024-00888-yOnline publication date: 23-Jul-2024
https://doi.org/10.1007/s10207-024-00888-y
Li NWang SFeng MWang KWang MWang H(2023)MalWuKong: Towards Fast, Accurate, and Multilingual Detection of Malicious Code Poisoning in OSS Supply Chains2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00073(1993-2005)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00073

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents