Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3597503.3639116acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open access

VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

Published: 12 April 2024 Publication History

Abstract

Accompanying the successes of learning-based defensive software vulnerability analyses is the lack of large and quality sets of labeled vulnerable program samples, which impedes further advancement of those defenses. Existing automated sample generation approaches have shown potentials yet still fall short of practical expectations due to the high noise in the generated samples. This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets. Given a normal program, VGX identifies the code contexts in which vulnerabilities can be injected, using a customized Transformer featured with a new value-flow-based position encoding and pre-trained against new objectives particularly for learning code structure and context. Then, VGX materializes vulnerability-injection code editing in the identified contexts using patterns of such edits obtained from both historical fixes and human knowledge about real-world vulnerabilities.
Compared to four state-of-the-art (SOTA) (i.e., pattern-, Transformer-, GNN-, and pattern+Transformer-based) baselines, VGX achieved 99.09--890.06% higher F1 and 22.45%-328.47% higher label accuracy. For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair. Our results show SOTA techniques for these three application tasks achieved 19.15--330.80% higher F1, 12.86--19.31% higher top-10 accuracy, and 85.02--99.30% higher top-50 accuracy, respectively, by adding those samples to their original training data. These samples also helped a SOTA vulnerability detector discover 13 more real-world vulnerabilities (CVEs) in critical systems (e.g., Linux kernel) that would be missed by the original model.

References

[1]
2017. SARD: A Software Assurance Reference Dataset. https://samate.nist.gov/SARD/.
[2]
2022. 2022 CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html.
[3]
2022. CVE-2017-12991. https://github.com/the-tcpdump-group/tcpdump/commit/50a44b6b8e4f7c127440dbd4239cf571945cc1e7.
[4]
2022. Memory Leak. https://cwe.mitre.org/data/definitions/401.html.
[5]
2022. OpenBSD. https://github.com/bukhalo/openbsd-src/commit/a88c32bfabe8a7fd0b25703230d4adba1d204e0a.
[6]
2022. Race Condition. https://cwe.mitre.org/data/definitions/362.html.
[7]
2022. RawStudio. https://github.com/rawstudio/rawstudio/commit/04cf4f537ffdce5f3e5207bead0ac2d254114cc2.
[8]
2022. Use of Uninitialized Variables. https://cwe.mitre.org/data/definitions/457.html.
[9]
2023. Cybersecurity vulnerability statistics and facts of 2023. https://www.comparitech.com/blog/information-security/cybersecurity-vulnerability-statistics/.
[10]
2023. Data Quality Considerations for Machine Learning Models. https://towardsdatascience.com/data-quality-considerations-for-machine-learning-models-dcbe9cab34cb.
[11]
2023. How Much Data Is Needed For Machine Learning? https://graphite-note.com/how-much-data-is-needed-for-machine-learning.
[12]
2023. The Size and Quality of a Data Set. https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality.
[13]
Hiralal Agrawal, Richard A DeMillo, R_ Hathaway, William Hsu, Wynne Hsu, Edward W Krauser, Rhonda J Martin, Aditya P Mathur, and Eugene Spafford. 1989. Design of mutant operators for the C programming language. Technical Report. Technical Report SERC-TR-41-P, Software Engineering Research Center, Purdue.
[14]
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to fix bugs automatically. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1--27.
[15]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). 30--39.
[16]
Yingzhou Bi, Jiangtao Huang, Penghui Liu, and Lianmei Wang. 2023. Benchmarking Software Vulnerability Detection Techniques: A Survey. arXiv preprint arXiv:2303.16362 (2023).
[17]
Piotr Bojanowski, Édouard Grave, Armand Joulin, and Tomáš Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[18]
Haipeng Cai, Yu Nong, Yuzhe Ou, and Feng Chen. 2023. Generating Vulnerable Code via Learning-Based Program Transformations. In AI Embedded Assurance for Cyber Systems. Springer, 123--138.
[19]
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48, 09 (2022), 3280--3296.
[20]
Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2022. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering 49, 1 (2022), 147--165.
[21]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[22]
Michael L Collard, Michael John Decker, and Jonathan I Maletic. 2013. srcML: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In 2013 IEEE International Conference on Software Maintenance. 516--519.
[23]
Roland Croft, M Ali Babar, and M Mehdi Kholoosi. 2023. Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 121--133.
[24]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[25]
Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, and Ke Wang. 2020. Hoppity: Learning graph transformations to detect and fix bugs in programs. In International Conference on Learning Representations (ICLR).
[26]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR). 508--512.
[27]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536--1547.
[28]
Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 608--620.
[29]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 935--947.
[30]
Xiaoqin Fu and Haipeng Cai. 2019. A dynamic taint analyzer for distributed systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1115--1119.
[31]
Xiaoqin Fu and Haipeng Cai. 2021. FlowDist:Multi-Staged Refinement-Based Dynamic Information Flow Analysis for Distributed Software Systems. In 30th USENIX Security Symposium (USENIX Security 21). 2093--2110.
[32]
Xuanxuan Gao, Shi Jin, Chao-Kai Wen, and Geoffrey Ye Li. 2018. ComNet: Combination of deep learning and expert knowledge in OFDM receivers. IEEE Communications Letters 22, 12 (2018), 2627--2630.
[33]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, LIU Shujie, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations (ICLR).
[34]
Jacob A Harer, Onur Ozdemir, Tomo Lazovich, Christopher P Reale, Rebecca L Russell, Louis Y Kim, and Peter Chin. 2018. Learning to repair software vulnerabilities with generative adversarial networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7944--7954.
[35]
David Hin, Andrey Kan, Huaming Chen, and M Ali Babar. 2022. LineVD: statement-level vulnerability detection using graph neural networks. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 596--607.
[36]
Gael Lederrey, Tim Hillel, and Michel Bierlaire. 2022. DATGAN: Integrating expert knowledge into deep learning for synthetic tabular data. arXiv preprint arXiv:2203.03489 (2022).
[37]
Haeun Lee, Soomin Kim, and Sang Kil Cha. 2022. Fuzzle: Making a Puzzle for Fuzzers. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--12.
[38]
Wen Li, Li Li, and Haipeng Cai. 2022. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 847--859.
[39]
Wen Li, Li Li, and Haipeng Cai. 2022. PolyFax: a toolkit for characterizing multi-language software. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE-Demo). 1662--1666.
[40]
Wen Li, Jiang Ming, Xiapu Luo, and Haipeng Cai. 2022. {PolyCruise}: A {Cross-Language} Dynamic Information Flow Analysis. In 31st USENIX Security Symposium (USENIX Security 22). 2513--2530.
[41]
Wen Li, Jinyang Ruan, Guangbei Yi, Long Cheng, Xiapu Luo, and Haipeng Cai. 2023. PolyFuzz: Holistic Greybox Fuzzing of Multi-Language Systems. In 32nd USENIX Security Symposium (USENIX Security 23). 1379--1396.
[42]
Wen Li, Haoran Yang, Xiapu Luo, Long Cheng, and Haipeng Cai. 2023. PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1645--1659.
[43]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292--303.
[44]
Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin. 2021. Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2821--2837.
[45]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2244--2258.
[46]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. In Network and Distributed System Security (NDSS) Symposium.
[47]
Linus Eriksson. 2022. Tree-Sitter. https://github.com/tree-sitter/tree-sitter.
[48]
Zhenguang Liu, Peng Qian, Xiaoyang Wang, Yuan Zhuang, Lin Qiu, and Xun Wang. 2023. Combining Graph Neural Networks With Expert Knowledge for Smart Contract Vulnerability Detection. IEEE Transactions on Knowledge & Data Engineering 35, 02 (2023), 1296--1310.
[49]
Yisroel Mirsky, George Macon, Michael Brown, Carter Yagemann, Matthew Pruett, Evan Downing, Sukarno Mertoguno, and Wenke Lee. 2023. VulChecker: Graph-based Vulnerability Localization in Source Code. In USENIX Security Symposium.
[50]
National Institute of Standards and Technology (NIST). 2022. National Vulnerability Database (NVD). https://nvd.nist.gov.
[51]
Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. SPT-code: sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering (ICSE). 2006--2018.
[52]
Yu Nong and Haipeng Cai. 2020. A preliminary study on open-source memory vulnerability detectors. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 557--561.
[53]
Yu Nong, Haipeng Cai, Pengfei Ye, Li Li, and Feng Chen. 2021. Evaluating and comparing memory error vulnerability detectors. Information and Software Technology 137 (2021), 106614.
[54]
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2022. Generating realistic vulnerabilities via neural code editing: an empirical study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1097--1109.
[55]
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2023. VulGen: Realistic Vulnerability Generation Via Pattern Mining and Deep Learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2527--2539.
[56]
Yu Nong, Rainy Sharma, Abdelwahab Hamou-Lhadj, Xiapu Luo, and Haipeng Cai. 2022. Open science in software engineering: A study on deep learning-based vulnerability detection. IEEE Transactions on Software Engineering 49, 4 (2022), 1983--2005.
[57]
Vadim Okun, Aurelien Delaitre, Paul E Black, et al. 2013. Report on the static analysis tool exposition (sate) iv. NIST Special Publication 500 (2013), 297.
[58]
Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, and Zhi Jin. 2021. Integrating tree path in transformer for code representation. Advances in Neural Information Processing Systems 34 (2021), 9343--9354.
[59]
Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. 2021. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[60]
Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug synthesis: Challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 224--234.
[61]
Carson D Sestili, William S Snavely, and Nathan M VanHoudnos. 2018. Towards security defect prediction with AI. arXiv preprint arXiv:1808.09897 (2018).
[62]
Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. 2021. Patchdb: A large-scale security patch dataset. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 149--160.
[63]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696--8708.
[64]
Yueming Wu, Deqing Zou, Shihan Dou, Wei Yang, Duo Xu, and Hai Jin. 2022. Vul-CNN: An image-inspired scalable vulnerability detection system. In Proceedings of the 44th International Conference on Software Engineering. 2365--2376.
[65]
Ziyu Yao, Frank F Xu, Pengcheng Yin, Huan Sun, and Graham Neubig. 2021. Learning Structural Edits via Incremental Tree Transformations. In International Conference on Learning Representations (ICLR).
[66]
Lechen Yu, Joachim Protze, Oscar Hernandez, and Vivek Sarkar. 2020. A Study of Memory Anomalies in OpenMP Applications. In International Workshop on OpenMP. Springer, 328--342.
[67]
Shasha Zhang. 2021. A Framework of Vulnerable Code Dataset Generation by Open-Source Injection. In 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 1099--1103.
[68]
Zenong Zhang, Zach Patterson, Michael Hicks, and Shiyi Wei. 2022. FIXRE-VERTER: A Realistic Bug Injection Methodology for Benchmarking Fuzz Testing. In 31st USENIX Security Symposium (USENIX Security 22). 3699--3715.
[69]
Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111--120.
[70]
Xin Zhou and Rakesh M Verma. 2022. Vulnerability Detection via Multimodal Learning: Datasets and Analysis. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security (AsiaCCS). 1225--1227.
[71]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 10197--10207.
[72]
Deqing Zou, Yutao Hu, Wenke Li, Yueming Wu, Haojun Zhao, and Hai Jin. 2022. mVulPreter: A Multi-Granularity Vulnerability Detection System With Interpretations. IEEE Transactions on Dependable and Secure Computing (TDSC) 01 (2022), 1--12.

Cited By

View all
  • (2024)Improving VulRepair’s Perfect Prediction by Leveraging the LION OptimizerApplied Sciences10.3390/app1413575014:13(5750)Online publication date: 1-Jul-2024
  • (2024)VinJ: An Automated Tool for Large-Scale Software Vulnerability Data GenerationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663800(567-571)Online publication date: 10-Jul-2024
  • (2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
  • Show More Cited By

Index Terms

  1. VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
    May 2024
    2942 pages
    ISBN:9798400702174
    DOI:10.1145/3597503
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    In-Cooperation

    • Faculty of Engineering of University of Porto

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 April 2024

    Check for updates

    Badges

    Author Tags

    1. vulnerability dataset
    2. vulnerability injection
    3. data quality
    4. vulnerability analysis
    5. deep learning
    6. program generation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICSE '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 276 of 1,856 submissions, 15%

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)693
    • Downloads (Last 6 weeks)109
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Improving VulRepair’s Perfect Prediction by Leveraging the LION OptimizerApplied Sciences10.3390/app1413575014:13(5750)Online publication date: 1-Jul-2024
    • (2024)VinJ: An Automated Tool for Large-Scale Software Vulnerability Data GenerationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663800(567-571)Online publication date: 10-Jul-2024
    • (2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
    • (2024)Improving Long-Tail Vulnerability Detection Through Data Augmentation Based on Large Language Models2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00033(262-274)Online publication date: 6-Oct-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media