Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3691620.3695013acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Effective Vulnerable Function Identification based on CVE Description Empowered by Large Language Models

Published: 27 October 2024 Publication History

Abstract

Open-source software (OSS) has profoundly transformed the software development paradigm by facilitating effortless code reuse. However, in recent years, there has been an alarming increase in disclosed vulnerabilities within OSS, posing significant security risks to downstream users. Therefore, analyzing existing vulnerabilities and precisely assessing their threats to downstream applications become pivotal. Plenty of efforts have been made recently towards this problem, such as vulnerability reachability analysis and vulnerability reproduction. The key to these tasks is identifying the vulnerable function (i.e., the function where the root cause of a vulnerability resides). However, public vulnerability datasets (e.g., NVD) rarely include this information as pinpointing the exact vulnerable functions remains to be a longstanding challenge.
Existing methods mainly detect vulnerable functions based on vulnerability patches or Proof-of-Concept (PoC). However, such methods face significant limitations due to data availability and the requirement for extensive manual efforts, thus hindering scalability. To address this issue, we propose a novel approach VFFinder that localizes vulnerable functions based on Common Vulnerabilities and Exposures (CVE) descriptions and the corresponding source code utilizing Large Language Models (LLMs). Specifically, VFFinder adopts a customized in-context learning (ICL) approach based on CVE description patterns to enable LLM to extract key entities. It then performs priority matching with the source code to localize vulnerable functions. We assess the performance of VFFinder on 75 large open-source projects. The results demonstrate that VFFinder surpasses existing baselines significantly. Notably, the Top-1 and MRR metrics have been improved substantially, averaging 4.25X and 2.37X respectively. We also integrate VFFinder with Software Composition Analysis (SCA) tools, and the results show that our tool can reduce the false positive rates of existing SCA tools significantly.

References

[1]
Shamsa Abid. 2019. Recommending related functions from API usage-based function clone structures. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26--30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.). ACM, 1193--1195.
[2]
GitHub Advisory. 2024. https://github.com/advisories. Accessed: 2024-06.
[3]
Shayan A. Akbar and Avinash C. Kak. 2020. A Large-Scale Comparative Evaluation of IR-Based Tools for Bug Localization. In MSR '20: 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 29--30 June, 2020, Sunghun Kim, Georgios Gousios, Sarah Nadi, and Joseph Hejderup (Eds.). ACM, 21--31.
[4]
Autoencoder. 2024. https://en.wikipedia.org/wiki/Autoencoder. Accessed:2024-06.
[5]
Dependency Check. 2024. https://jeremylong.github.io/DependencyCheck/. Accessed: 2024-06.
[6]
Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, et al. 2024. Automatic root cause analysis via large language models for cloud incidents. In Proceedings of the Nineteenth European Conference on Computer Systems. 674--688.
[7]
Jiarun Dai, Yuan Zhang, Zheyue Jiang, Yingtian Zhou, Junyan Chen, Xinyu Xing, Xiaohan Zhang, Xin Tan, Min Yang, and Zhemin Yang. 2020. BScout: Direct Whole Patch Presence Test for Java Executables. In 29th USENIX Security Symposium, USENIX Security 2020, August 12--14, 2020, Srdjan Capkun and Franziska Roesner (Eds.). USENIX Association, 1147--1164. https://www.usenix.org/conference/usenixsecurity20/presentation/dai
[8]
National Vulnerability Database. 2024. https://nvd.nist.gov/. Accessed: 2024-06.
[9]
Snyk Vulnerability Database. 2024. https://security.snyk.io. Accessed: 2024-06.
[10]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17--21, 2023, René Just and Gordon Fraser (Eds.). ACM, 423--435.
[11]
Dependabot. 2024. https://github.com/dependabot. Accessed: 2024-06.
[12]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
[13]
Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. 2024. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. arXiv preprint arXiv:2406.03718 (2024).
[14]
Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. 2023. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11--15, 2023. IEEE, 761--773.
[15]
Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes. In 2013 10th Working Conference on Mining Software Repositories (MSR). 121--130.
[16]
Kaifeng Huang, Bihuan Chen, Linghao Pan, Shuai Wu, and Xin Peng. 2021. REPFINDER: Finding Replacements for Missing APIs in Library Update. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15--19, 2021. IEEE, 266--278.
[17]
Emanuele Iannone, Dario Di Nucci, Antonino Sabetta, and Andrea De Lucia. 2021. Toward Automated Exploit Generation for Known Vulnerabilities in Open-Source Libraries. In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021, Madrid, Spain, May 20--21, 2021. IEEE, 396--400.
[18]
Inline. 2024. https://en.wikipedia.org/wiki/Inline_expansion. Accessed: 2024-06.
[19]
IR. 2024. https://en.wikipedia.org/wiki/Information_retrieval. Accessed: 2024-06.
[20]
Jaccard. 2024. https://en.wikipedia.org/wiki/Jaccard_index. Accessed: 2024-06.
[21]
Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R. Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. Proc. ACM Softw. Eng. 1, FSE (2024), 137--160.
[22]
Zheyue Jiang, Yuan Zhang, Jun Xu, Qi Wen, Zhenghe Wang, Xiaohan Zhang, Xinyu Xing, Min Yang, and Zhemin Yang. 2020. PDiff: Semantic-based Patch Presence Testing for Downstream Kernels. In CCS '20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9--13, 2020, Jay Ligatti, Xinming Ou, Jonathan Katz, and Giovanni Vigna (Eds.). ACM, 1149--1163.
[23]
Hong Jin Kang, Truong Giang Nguyen, Xuan-Bach Dinh Le, Corina S. Pasareanu, and David Lo. 2022. Test mimicry to assess the exploitability of library vulnerabilities. In ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 -- 22, 2022. ACM, 276--288.
[24]
Hong Jin Kang, Truong Giang Nguyen, Xuan-Bach Dinh Le, Corina S. Pasareanu, and David Lo. 2022. Test mimicry to assess the exploitability of library vulnerabilities. In ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 -- 22, 2022, Sukyoung Ryu and Yannis Smaragdakis (Eds.). ACM, 276--288.
[25]
An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2017. Bug localization with combination of deep learning and information retrieval. In Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, May 22--23, 2017, Giuseppe Scanniello, David Lo, and Alexander Serebrenik (Eds.). IEEE Computer Society, 218--229.
[26]
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. UTANGO: untangling commits with context-aware, graph-based, code change clustering learning model. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14--18, 2022. ACM, 221--232.
[27]
Bo Lin, Shangwen Wang, Ming Wen, Liqian Chen, and Xiaoguang Mao. 2024. One Size Does Not Fit All: Multi-granularity Patch Generation for Better Automated Program Repair (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1554--1566.
[28]
Bingchang Liu, Guozhu Meng, Wei Zou, Qi Gong, Feng Li, Min Lin, Dandan Sun, Wei Huo, and Chao Zhang. 2020. A large-scale empirical study on vulnerability distribution within projects and the lessons learned. In ICSE '20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, Gregg Rothermel and Doo-Hwan Bae (Eds.). ACM, 1547--1559.
[29]
Wanwangying Ma, Lin Chen, Xiangyu Zhang, Yang Feng, Zhaogui Xu, Zhifei Chen, Yuming Zhou, and Baowen Xu. 2020. Impact analysis of cross-project bugs on software ecosystems. In ICSE '20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, Gregg Rothermel and Doo-Hwan Bae (Eds.). ACM, 100--111.
[30]
Rabaya Sultana Mim, Toukir Ahammed, and Kazi Sakib. 2023. Identifying Vulnerable Functions from Source Code using Vulnerability Reports. In Joint Proceedings of the 5th International Workshop on Experience with SQuaRE series and its Future Direction and the 11th International Workshop on Quantitative Approaches to Software Quality co-located with the 30th Asia Pacific Software Engineering Conference (APSEC 2023), Seoul, South Korea, December 4, 2023 (CEUR Workshop Proceedings, Vol. 3612), Tsuyoshi Nakajima, Toshihiro Komiyama, Horst Lichter, Thanwadee Sunetnanta, and Toni Anwar (Eds.). CEUR-WS.org, 66--73. https://ceur-ws.org/Vol-3612/QuASoQ_2023_Paper_04.pdf
[31]
MRR. 2024. https://en.wikipedia.org/wiki/Mean_reciprocal_rank. Accessed: 2024-06.
[32]
Profir-Petru Pârundefinedachi, Santanu Kumar Dash, Miltiadis Allamanis, and Earl T. Barr. 2020. Flexeme: Untangling Commits Using Lexical Flows. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 63--74.
[33]
Ivan Pashchenko, Henrik Plate, Serena Elisa Ponta, Antonino Sabetta, and Fabio Massacci. 2018. Vulnerable open source dependencies: counting those that matter. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11--12, 2018, Markku Oivo, Daniel Méndez Fernández, and Audris Mockus (Eds.). ACM, 42:1--42:10.
[34]
Ivan Pashchenko, Henrik Plate, Serena Elisa Ponta, Antonino Sabetta, and Fabio Massacci. 2022. Vuln4Real: A Methodology for Counting Actually Vulnerable Dependencies. IEEE Trans. Software Eng. 48, 5 (2022), 1592--1609.
[35]
Henrik Plate, Serena Elisa Ponta, and Antonino Sabetta. 2015. Impact assessment for vulnerabilities in open-source software libraries. In 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, Rainer Koschke, Jens Krinke, and Martin P. Robillard (Eds.). IEEE Computer Society, 411--420.
[36]
PoC. 2024. https://en.wikipedia.org/wiki/Proof_of_concept. Accessed: 2024-06.
[37]
Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2018. Beyond Metadata: Code-Centric and Usage-Based Analysis of Known Vulnerabilities in Open-Source Software. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23--29, 2018. IEEE Computer Society, 449--460.
[38]
Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2018. Beyond Metadata: Code-Centric and Usage-Based Analysis of Known Vulnerabilities in Open-Source Software. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23--29, 2018. IEEE Computer Society, 449--460.
[39]
Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. 2013. Improving bug localization using structured information retrieval. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11--15, 2013, Ewen Denney, Tevfik Bultan, and Andreas Zeller (Eds.). IEEE, 345--355.
[40]
Snyk. 2024. https://snyk.io/product/open-source-security-management/. Accessed: 2024-06.
[41]
The state of open source security. 2024. https://snyk.io/reports/open-source-security/. Accessed: 2024-06.
[42]
Maolin Sun, Yibiao Yang, Yang Wang, Ming Wen, Haoxiang Jia, and Yuming Zhou. 2023. SMT solver validation empowered by large pre-trained language models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1288--1300.
[43]
Xin Tan, Yuan Zhang, Chenyuan Mi, Jiajun Cao, Kun Sun, Yifan Lin, and Min Yang. 2021. Locating the Security Patches for Disclosed OSS Vulnerabilities with Vulnerability-Commit Correlation Ranking. In CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 -- 19, 2021, Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (Eds.). ACM, 3282--3299.
[44]
TF-IDF. 2024. https://en.wikipedia.org/wiki/Tf-idf. Accessed: 2024-06.
[45]
Yukiya Uneno, Osamu Mizuno, and Eun-Hye Choi. 2016. Using a Distributed Representation of Words in Localizing Relevant Files for Bug Reports. In 2016 IEEE International Conference on Software Quality, Reliability and Security, QRS 2016, Vienna, Austria, August 1--3, 2016. IEEE, 183--190.
[46]
Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, and Yinhao Xiao. 2023. DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. CoRR abs/2309.15324 (2023). arXiv:2309.15324
[47]
Shangwen Wang, Ming Wen, Bo Lin, and Xiaoguang Mao. 2021. Lightweight global and local contexts guided method name recommendation with prior knowledge. In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23--28, 2021, Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta (Eds.). ACM, 741--753.
[48]
Ming Wen, Rongxin Wu, and Shing-Chi Cheung. 2016. Locus: Locating Bugs from Software Changes. In Proceedings of the 31st International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3--7, 2016, David Lo, Sven Apel, and Sarfraz Khurshid (Eds.). ACM, 262--273.
[49]
Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exploring and Exploiting the Correlations between Bug-inducing and Bug-fixing Commits. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, Tallinn, Estonia, August 26--30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.). ACM, 326--337.
[50]
Wikipedia. 2024. https://en.wikipedia.org/wiki/Software_composition_analysis. Accessed: 2024-06.
[51]
Wikipedia. 2024. https://en.wikipedia.org/wiki/Cohen%27s_kappa. Accessed: 2024-06.
[52]
Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223.
[53]
Yulun Wu, Zeliang Yu, Ming Wen, Qiang Li, Deqing Zou, and Hai Jin. 2023. Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14--20, 2023. IEEE, 1046--1058.
[54]
Laerte Xavier, Aline Brito, André C. Hora, and Marco Túlio Valente. 2017. Historical and impact analysis of API breaking changes: A large-scale study. In IEEE 24th International Conference on Software Analysis, Evolution and Reengineering, SANER 2017, Klagenfurt, Austria, February 20--24, 2017, Martin Pinzger, Gabriele Bavota, and Andrian Marcus (Eds.). IEEE Computer Society, 138--147.
[55]
Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. CoRR abs/2304.00385 (2023). arXiv:2304.00385
[56]
Wenkai Xie, Xin Peng, Mingwei Liu, Christoph Treude, Zhenchang Xing, Xiaoxin Zhang, and Wenyun Zhao. 2020. API method recommendation via explicit matching of functionality verb phrases. In ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8--13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). ACM, 1015--1026.
[57]
Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu. 2022. Tracking patches for open source software vulnerabilities. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14--18, 2022, Abhik Roychoudhury, Cristian Cadar, and Miryung Kim (Eds.). ACM, 860--871.
[58]
Sofonias Yitagesu, Zhenchang Xing, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li, and Linyi Han. 2021. Unsupervised Labeling and Extraction of Phrase-based Concepts in Vulnerability Descriptions. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15--19, 2021. IEEE, 943--954.
[59]
Klaus Changsun Youm, June Ahn, Jeongho Kim, and Eunseok Lee. 2015. Bug Localization Based on Code Change Histories and Bug Reports. In 2015 Asia-Pacific Software Engineering Conference, APSEC 2015, New Delhi, India, December 1--4, 2015, Jing Sun, Y. Raghu Reddy, Arun Bahulkar, and Anjaneyulu Pasala (Eds.). IEEE Computer Society, 190--197.
[60]
Zeliang Yu, Ming Wen, Xiaochen Guo, and Hai Jin. 2024. Maltracker: A FineGrained NPM Malware Tracker Copiloted by LLM-Enhanced Dataset. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1759--1771.
[61]
Rodrigo Elizalde Zapata, Raula Gaikovina Kula, Bodin Chinthanet, Takashi Ishio, Kenichi Matsumoto, and Akinori Ihara. 2018. Towards Smoother Library Migrations: A Look at Vulnerable Dependency Migrations at Function Level for npm JavaScript Packages. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23--29, 2018. IEEE Computer Society, 559--563.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
October 2024
2587 pages
ISBN:9798400712487
DOI:10.1145/3691620
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Author Tags

  1. vulnerability analysis
  2. vulnerable function
  3. large language model

Qualifiers

  • Research-article

Funding Sources

  • Major Program (JD) of Hubei Province

Conference

ASE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 315
    Total Downloads
  • Downloads (Last 12 months)315
  • Downloads (Last 6 weeks)67
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media