Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3510003.3510627acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

ModX: binary level partially imported third-party library detection via program modularization and semantic matching

Published: 05 July 2022 Publication History

Abstract

With the rapid growth of software, using third-party libraries (TPLs) has become increasingly popular. The prosperity of the library usage has provided the software engineers with a handful of methods to facilitate and boost the program development. Unfortunately, it also poses great challenges as it becomes much more difficult to manage the large volume of libraries. Researches and studies have been proposed to detect and understand the TPLs in the software. However, most existing approaches rely on syntactic features, which are not robust when these features are changed or deliberately hidden by the adversarial parties. Moreover, these approaches typically model each of the imported libraries as a whole, therefore, cannot be applied to scenarios where the host software only partially uses the library code segments.
To detect both fully and partially imported TPLs at the semantic level, we propose ModX, a framework that leverages novel program modularization techniques to decompose the program into fine-grained functionality-based modules. By extracting both syntactic and semantic features, it measures the distance between modules to detect similar library module reuse in the program. Experimental results show that ModX outperforms other modularization tools by distinguishing more coherent program modules with 353% higher module quality scores and beats other TPL detection tools with on average 17% better in precision and 8% better in recall.

References

[1]
2011. IDA F.L.I.R.T. Technology: In-Depth. https://hex-rays.com/products/ida/tech/flirt/in_depth/
[2]
2020. 2019 State of the Software Supply Chain. https://www.sonatype.com/hubfs/SSC/2019%20SSC/SON_SSSC-Report-2019_jun16-DRAFT.pdf.
[3]
2020. 2020 Gartner Market Guide for Software Composition Analysis. https://go.snyk.io/2020-Gartner-Market-Guide.html.
[4]
2020. GitHub Octoverse 2020 Security Report. https://octoverse.github.com/static/github-octoverse-2020-security-report.pdf#page=10.
[5]
2020. WIKI: Single-responsibility principle. https://en.wikipedia.org/wiki/Single-responsibility_principle.
[6]
2021. Backdoor:Linux/Mirai. https://www.microsoft.com/en-us/wdsi/threats/threat-search?query=mirai.
[7]
2021. A hacker tool collection by Electrospaces, Insights in Signals Intelligence, Communications Security and Top Level Telecommunications equipment. https://t.co/69lmiMmo43.
[8]
2021. Mirai: a malware that turns networked devices into remotely controlled bots. https://en.wikipedia.org/wiki/Mirai_(malware).
[9]
2021. VirusShare: a repository of malware samples. https://virusshare.com/.
[10]
2021. VirusTotal. https://www.virustotal.com/gui/home/upload.
[11]
2021. VirusTotal: Analyze suspicious files and URLs to detect types of malware, automatically share them with the security community. https://www.virustotal.com/gui/file/a8d65593f6296d6d06230bcede53b9152842f1eee56a2a72b0a88c4f463a09c3/detection.
[12]
Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad Debbabi. 2018. Fossil: a resilient and efficient system for identifying foss functions in malware binaries. ACM Transactions on Privacy and Security (TOPS) 21, 2 (2018), 1--34.
[13]
Qusay Alsarhan, Bestoun S Ahmed, Miroslav Bures, and Kamal Zuhairi Zamli. 2020. Software Module Clustering: An In-Depth Literature Analysis. IEEE Transactions on Software Engineering (2020).
[14]
Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. 2016. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In 25th {USENIX} Security Symposium ({USENIX} Security 16). 583--600.
[15]
Alex Arenas, Jordi Duch, Alberto Fernández, and Sergio Gómez. 2007. Size reduction of complex networks preserving modularity. New Journal of Physics 9, 6 (2007), 176.
[16]
Fabrizio Biondi, Thomas Given-Wilson, Axel Legay, Cassius Puodzius, and Jean Quilbeuf. 2018. Tutorial: An overview of malware detection and evasion techniques. In International Symposium on Leveraging Applications of Formal Methods. Springer, 565--586.
[17]
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
[18]
Martial Bourquin, Andy King, and Edward Robbins. 2013. Binslayer: accurate comparison of binary executables. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. 1--10.
[19]
S Sibi Chakkaravarthy, D Sangeetha, and V Vaidehi. 2019. A Survey on malware analysis and mitigation techniques. Computer Science Review 32 (2019), 1--23.
[20]
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. Bingo: Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 678--689.
[21]
Andrei Costin, Jonas Zaddach, Aurélien Francillon, and Davide Balzarotti. 2014. A large-scale analysis of the security of embedded firmwares. In 23rd {USENIX} Security Symposium ({USENIX} Security 14). 95--110.
[22]
Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP).
[23]
Eelco Dolstra, Eelco Visser, and Merijn de Jonge. 2004. Imposing a memory management discipline on software deployment. In Proceedings. 26th International Conference on Software Engineering. IEEE, 583--592.
[24]
Ruian Duan, Ashish Bijlani, Meng Xu, Taesoo Kim, and Wenke Lee. 2017. Identifying open-source license violation and 1-day security risk at large scale. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 2169--2185.
[25]
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In Network and Distributed System Security Symposium.
[26]
Sultan S Alqahtani Ellis E Eghan and Juergen Rilling. [n.d.]. Recovering Semantic Traceability Links between APIs and Security Vulnerabilities: An Ontological Modeling Approach. ([n. d.]).
[27]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS.
[28]
Halvar Flake. 2004. Structural comparison of executable objects. In Detection of intrusions and malware & vulnerability assessment, GI SIG SIDAR workshop, DIMVA 2004. Gesellschaft für Informatik eV.
[29]
Kevin W Hamlen, Zhiqiang Lin, and Latifur Khan. 2019. Automated, Binary Evidence-based Attribution of Software Attacks. Technical Report. The University of Texas at Dallas Richardson, United States.
[30]
Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. ACM Computing Surveys (CSUR) 54, 3 (2021), 1--38.
[31]
Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Dolstra. 2011. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories. 63--72.
[32]
Yikun Hu, Yuanyuan Zhang, Juanru Li, Hui Wang, Bodong Li, and Dawu Gu. 2018. Binmatch: A semantics-based hybrid approach on binary code clone analysis. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 104--114.
[33]
Jinhuang Huang and Jing Liu. 2016. A similarity-based modularization quality measure for software module clustering problems. Information Sciences 342 (2016), 96--110.
[34]
Vishal Karande, Swarup Chandra, Zhiqiang Lin, Juan Caballero, Latifur Khan, and Kevin Hamlen. 2018. Bcd: Decomposing binary code into components using graph-based clustering. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security. 393--398.
[35]
Masoud Kargar, Ayaz Isazadeh, and Habib Izadkhah. 2019. Multi-programming language software systems modularization. Computers & Electrical Engineering 80 (2019), 106500.
[36]
Bisma S Khan and Muaz A Niazi. 2017. Network community detection: A review and visual survey. arXiv preprint arXiv:1708.00977 (2017).
[37]
John R Levine. 2001. Linkers & loaders. Morgan Kaufmann; 1st edition.
[38]
Menghao Li, Wei Wang, Pei Wang, Shuai Wang, Dinghao Wu, Jian Liu, Rui Xue, and Wei Huo. 2017. Libd: Scalable and precise third-party library detection in android markets. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 335--346.
[39]
T. Lutellier, D. Chollak, J. Garcia, L. Tan, and R. Kroeger. 2018. Measuring the Impact of Code Dependencies on Software Architecture Recovery Techniques. IEEE Transactions on Software Engineering 44, 99 (2018), 159--181.
[40]
Ziang Ma, Haoyu Wang, Yao Guo, and Xiangqun Chen. 2016. Libradar: fast and accurate detection of third-party libraries in android apps. In Proceedings of the 38th international conference on software engineering companion. 653--656.
[41]
Ali Safari Mamaghani and Mohammad Reza Meybodi. 2009. Clustering of software systems using new hybrid algorithms. In 2009 Ninth IEEE International Conference on Computer and Information Technology, Vol. 1. IEEE, 20--25.
[42]
Spiros Mancoridis, Brian S Mitchell, Yihfarn Chen, and Emden R Gansner. 1999. Bunch: A clustering tool for the recovery and maintenance of software system structures. In Proceedings IEEE International Conference on Software Maintenance-1999 (ICSM'99).'Software Maintenance for Business Change'(Cat. No. 99CB36360). IEEE, 50--59.
[43]
Onaiza Maqbool and Haroon Babri. 2007. Hierarchical clustering for software architecture recovery. IEEE Transactions on Software Engineering 33, 11 (2007), 759--780.
[44]
Jiang Ming, Dongpeng Xu, Yufei Jiang, and Dinghao Wu. 2017. BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking. In 26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 253--270. https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/ming
[45]
Brian S Mitchell and Spiros Mancoridis. 2006. On the automatic modularization of software systems using the bunch tool. IEEE Transactions on Software Engineering 32, 3 (2006), 193--208.
[46]
Sina Mohammadi and Habib Izadkhah. 2019. A new algorithm for software clustering considering the knowledge of dependency between artifacts in the source code. Information and Software Technology 105 (2019), 252--256.
[47]
Marion Neumann, Roman Garnett, Christian Bauckhage, and Kristian Kersting. 2016. Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102, 2 (2016), 209--245.
[48]
Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical review E 69, 6 (2004), 066133.
[49]
Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.
[50]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
[51]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In 2015 IEEE Symposium on Security and Privacy. IEEE, 709--724.
[52]
Kata Praditwong, Mark Harman, and Xin Yao. 2010. Software module clustering as a multi-objective search problem. IEEE Transactions on Software Engineering 37, 2 (2010), 264--282.
[53]
Claude Sammut and Geoffrey I. Webb (Eds.). 2010. TF-IDF. Springer US, Boston, MA, 986--987.
[54]
Noam Shalev and Nimrod Partush. 2018. Binary Similarity Detection Using Machine Learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security (Toronto, Canada) (PLAS '18). Association for Computing Machinery, New York, NY, USA, 42--47.
[55]
Wei Tang, Du Chen, and Ping Luo. 2018. Bcfinder: A lightweight and platform-independent tool to find third-party components in binaries. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 288--297.
[56]
Wei Tang, Ping Luo, Jialiang Fu, and Dan Zhang. 2020. LibDX: A Cross-Platform and Accurate System to Detect Third-Party Libraries in Binary Code. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 104--115.
[57]
Yang Xiao, Bihuan Chen, Chendong Yu, Zhengzi Xu, Zimu Yuan, Feng Li, Binghong Liu, Yang Liu, Wei Huo, Wei Zou, et al. 2020. {MVP}: Detecting Vulnerabilities using {Patch-Enhanced} Vulnerability Signatures. In 29th USENIX Security Symposium (USENIX Security 20). 1165--1182.
[58]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363--376.
[59]
Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. 2020. Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376--387.
[60]
Zhengzi Xu, Bihuan Chen, Mahinthan Chandramohan, Yang Liu, and Fu Song. 2017. Spain: security patch analysis for binaries towards understanding the pain and pills. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 462--472.
[61]
Yinxing Xue, Zhengzi Xu, Mahinthan Chandramohan, and Yang Liu. 2018. Accurate and scalable cross-architecture cross-os binary code search with emulation. IEEE Transactions on Software Engineering 45, 11 (2018), 1125--1149.
[62]
Can Yang, Jian Liu, Mengxia Luo, Xiaorui Gong, and Baoxu Liu. 2020. RouAlign: Cross-Version Function Alignment and Routine Recovery with Graphlet Edge Embedding. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 155--170.
[63]
Zimu Yuan, Muyue Feng, Feng Li, Gu Ban, Yang Xiao, Shiyang Wang, Qian Tang, He Su, Chendong Yu, Jiahuan Xu, et al. 2019. B2SFinder: Detecting Open-Source Software Reuse in COTS Software. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1038--1049.
[64]
Xian Zhan, Lingling Fan, Tianming Liu, Sen Chen, Li Li, Haoyu Wang, Yifei Xu, Xiapu Luo, and Yang Liu. 2020. Automated third-party library detection for android applications: Are we there yet?. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 919--930.
[65]
Dan Zhang, Ping Luo, Wei Tang, and Min Zhou. 2020. OSLDetector: identifying open-source libraries through binary analysis. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1312--1315.
[66]
Han Zhang, Abhijith Anilkumar, Matt Fredrikson, and Yuvraj Agarwal. 2021. Capture: Centralized Library Management for Heterogeneous IoT Devices. In USENIX Security Symposium.
[67]
Jiexin Zhang, Alastair R Beresford, and Stephan A Kollmann. 2019. Libid: reliable identification of obfuscated third-party android libraries. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 55--65.
[68]
Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. [n.d.]. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. representations 48 ([n. d.]), 50.

Cited By

View all
  • (2024)Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization MetricsMathematics10.3390/math1214220112:14(2201)Online publication date: 13-Jul-2024
  • (2024)VMud: Detecting Recurring Vulnerabilities with Multiple Fixing Functions via Function Selection and Semantic Equivalent Statement MatchingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3690372(3958-3972)Online publication date: 2-Dec-2024
  • (2024)DeLink: Source File Information Recovery in BinariesProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680338(1009-1021)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
ISBN:9781450392211
DOI:10.1145/3510003
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. program modularization
  2. semantic matcing
  3. third-party library detection

Qualifiers

  • Research-article

Funding Sources

  • Chinese Academy of Sciences
  • Natural Science Foundation of China
  • Ministry of Education, Singapore
  • NTU-DESAY SV Research Program
  • National Research Foundation, Singapore

Conference

ICSE '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)154
  • Downloads (Last 6 weeks)18
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization MetricsMathematics10.3390/math1214220112:14(2201)Online publication date: 13-Jul-2024
  • (2024)VMud: Detecting Recurring Vulnerabilities with Multiple Fixing Functions via Function Selection and Semantic Equivalent Statement MatchingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3690372(3958-3972)Online publication date: 2-Dec-2024
  • (2024)DeLink: Source File Information Recovery in BinariesProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680338(1009-1021)Online publication date: 11-Sep-2024
  • (2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • (2024)BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code MatchingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639100(1-13)Online publication date: 20-May-2024
  • (2024)LibvDiff: Library Version Difference Guided OSS Version Identification in BinariesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623336(1-12)Online publication date: 20-May-2024
  • (2024)BinAug: Enhancing Binary Similarity Analysis with Low-Cost Input RepairingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623328(1-13)Online publication date: 20-May-2024
  • (2024)Visualizing the Synchronicity of Fixed Issues Across Diverse Ecosystems2024 International Conference on Smart Computing, IoT and Machine Learning (SIML)10.1109/SIML61815.2024.10578149(124-129)Online publication date: 6-Jun-2024
  • (2024)Empirical Study of Software Composition Analysis Tools for C/C++ Binary ProgramsIEEE Access10.1109/ACCESS.2023.334122412(50418-50430)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media