Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3643916.3646556acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

SICode: Embedding-Based Subgraph Isomorphism Identification for Bug Detection

Published: 13 June 2024 Publication History

Abstract

Given a known buggy code snippet, searching for similar patterns in a target project to detect unknown bugs is a reasonable approach. In practice, a search unit, such as a function, may appear quite different from the buggy snippet but actually contains a similar buggy substructure. Utilizing subgraph isomorphism identification can effectively hunt potential bugs by checking whether an approximate copy of the buggy subgraph exists within the target code graphs. Regrettably, subgraph isomorphism identification is an NP-complete problem.
In this paper, we propose an embedding-based method, SICode, to efficiently perform subgraph isomorphism identification for code graphs. We train a graph embedding model and the subgraph isomorphism relationship between two graphs can be measured by comparing their embedding vectors. In this manner, we can efficiently identify potential buggy code graphs via vector arithmetic without solving an NP-complete problem. A cascading loss scheme is presented to ensure the identification performance.
SICode exhibits greater scalability than classic subgraph isomorphism algorithms, such as VF2, and maintains high precision and recall. Experiments also demonstrate that SICode offers advantages in detecting sub-structurally similar bugs. Our approach spotted 20 previously-unknown bugs in real-world projects, among which, 18 bugs were confirmed by their developers and ranked within the top ten results of retrieval. This result is very encouraging for detecting subtle sub-structurally similar bugs.

References

[1]
2019. fuzzyc2cpg: A fuzzy parser for C/C++ that creates semantic code property graphs. https://github.com/ShiftLeftSecurity/fuzzyc2cpg.
[2]
2021. NeuroMatch. http://snap.stanford.edu/subgraph-matching.
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017).
[4]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14--18, 2009, Vol. 382. ACM, 41--48.
[5]
Pan Bian, Bin Liang, Wenchang Shi, Jianjun Huang, and Yan Cai. 2018. Narminer: Discovering negative association rules from code for bug detection. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 411--422.
[6]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[7]
Benjamin Bowman and H Howie Huang. 2020. VGRAPH: A robust vulnerable code clone detection system using code property triplets. In 2020 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 53--69.
[8]
Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22--30 May 2021. IEEE, 1186--1197.
[9]
Luigi Pietro Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2001. An improved algorithm for matching large graphs. In 3rd IAPR-TC15 workshop on graph-based representations in pattern recognition. Citeseer, 149--159.
[10]
Heming Cui, Gang Hu, Jingyue Wu, and Junfeng Yang. 2013. Verifying systems rules using rule-directed symbolic execution. In ACM SIGPLAN Notices, Vol. 48. ACM, 329--342.
[11]
Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices 49, 6 (2014), 349--360.
[12]
Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. 2018. Path-based function embedding and its application to error-handling specification mining. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ES-EC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04--09, 2018, Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu (Eds.). ACM, 423--433.
[13]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS.
[14]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 480--491.
[15]
Jianjun Huang, Songming Han, Wei You, Wenchang Shi, Bin Liang, Jingzheng Wu, and Yanjun Wu. 2021. Hunting Vulnerable Smart Contracts via Graph Embedding Based Bytecode Matching. IEEE Transactions on Information Forensics and Security 16 (2021), 2144--2156.
[16]
Jiyong Jang, Maverick Woo, and David Brumley. 2012. ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions. login Usenix Mag. 37, 6 (2012).
[17]
Yuan Kang, Baishakhi Ray, and Suman Jana. 2016. APEx: Automated inference of error specifications for C APIs. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 472--482.
[18]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 595--614.
[19]
Zhenmin Li and Yuanyuan Zhou. 2005. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 306--315.
[20]
Bin Liang, Pan Bian, Yan Zhang, Wenchang Shi, Wei You, and Yan Cai. 2016. AntMiner: mining more bugs by reducing noise interference. In Proceedings of the 38th International Conference on Software Engineering. 333--344.
[21]
Zhaoyu Lou, Jiaxuan You, Chengtao Wen, Arquimedes Canedo, Jure Leskovec, et al. 2020. Neural Subgraph Matching. arXiv preprint arXiv:2007.03092 (2020).
[22]
Tasuku Nakagawa, Yoshiki Higo, and Shinji Kusumoto. 2021. NIL: large-scale detection of large-variance clones. In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 830--841.
[23]
NetworkX. 2022. VF2 algorithm. https://networkx.org/documentation/stable/reference/algorithms/isomorphism.vf2.html.
[24]
Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 426--437.
[25]
Siegfried Rasthofer, Steven Arzt, and Eric Bodden. 2014. A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks. In NDSS.
[26]
Haichuan Shang, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. 2008. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow. 1, 1 (2008), 364--375.
[27]
Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011).
[28]
Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /* iComment: Bugs or bad comments?*. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 145--158.
[29]
Lin Tan, Xiaolan Zhang, Xiao Ma, Weiwei Xiong, and Yuanyuan Zhou. 2008. AutoISES: Automatically Inferring Security Specification and Detecting Violations. In USENIX Security Symposium. 379--394.
[30]
Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. Lambdanet: Probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161 (2020).
[31]
Yang Xiao, Bihuan Chen, Chendong Yu, Zhengzi Xu, Zimu Yuan, Feng Li, Binghong Liu, Yang Liu, Wei Huo, Wei Zou, et al. 2020. MVP: Detecting vulnerabilities using patch-enhanced vulnerability signatures. In 29th USENIX Security Symposium (USENIX Security 20). 1165--1182.
[32]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net.
[33]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363--376.
[34]
Xi Xu, Qinghua Zheng, Zheng Yan, Ming Fan, Ang Jia, and Ting Liu. 2021. Interpretation-enabled Software Reuse Detection Based on a Multi-Level Birthmark Model. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 873--884.
[35]
Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. 2020. Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376--387.
[36]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590--604.
[37]
Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In Proceedings of the 5th USENIX conference on Offensive technologies. 13--13.
[38]
Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized vulnerability extrapolation using abstract syntax trees. In Proceedings of the 28th Annual Computer Security Applications Conference. 359--368.
[39]
Insu Yun, Changwoo Min, Xujie Si, Yeongjin Jang, Taesoo Kim, and Mayur Naik. 2016. APISan: Sanitizing API Usages through Semantic Cross-Checking. In 25th USENIX Security Symposium (USENIX Security 16). 363--378.
[40]
Xiaohui Zhang, Yuanjun Gong, Bin Liang, Jianjun Huang, Wei You, Wenchang Shi, and Jian Zhang. 2022. Hunting bugs with accelerated optimal graph vertex matching. In ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022, Sukyoung Ryu and Yannis Smaragdakis (Eds.). ACM, 64--76.
[41]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. arXiv preprint arXiv:1909.03496 (2019).
[42]
Yue Zou, Bihuan Ban, Yinxing Xue, and Yun Xu. 2020. CCGraph: a PDG-based code clone detector with approximate graph matching. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 931--942.

Index Terms

  1. SICode: Embedding-Based Subgraph Isomorphism Identification for Bug Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICPC '24: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension
      April 2024
      487 pages
      ISBN:9798400705861
      DOI:10.1145/3643916
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2024

      Check for updates

      Author Tags

      1. subgraph isomorphism
      2. bug detection
      3. cascading loss
      4. graph embedding

      Qualifiers

      • Research-article

      Conference

      ICPC '24
      Sponsor:

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 75
        Total Downloads
      • Downloads (Last 12 months)75
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 14 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media