research-article

Open access

DeLink: Source File Information Recovery in Binaries

Authors:

Limin SunAuthors Info & Claims

ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 1009 - 1021

https://doi.org/10.1145/3650212.3680338

Published: 11 September 2024 Publication History

Abstract

Program comprehension can help analysts understand the primary behavior of a binary and enhance the efficiency of reverse engineering analysis. The existing works focus on instruction translation and function name prediction. However, they are limited in understanding the entire program. The recovered source file information can offer insights into the primary behavior of a binary, serving as high-level program summaries. Nevertheless, the files recovered by the function clustering-based approach contain binary functions with discontinuous distributions, resulting in low accuracy. Additionally, there is no existing research related to predicting the names of these recovered files. To this end, we propose a framework for source file information recovery in binaries, DeLink. This framework first leverages a file structure recovery approach based on boundary location to recognize files within a binary. Then, it utilizes an encoder-decoder model to predict the names of these files. The experimental results show that our file structure recovery approach achieves an average improvement of 14% across six evaluation metrics and requires only an average time of 16.74 seconds, outperforming the state-of-the-art work in both recovery quality and efficiency. Additionally, our file name prediction model achieves 70.09% precision and 63.91% recall. Moreover, we demonstrate the effective application of DeLink in malware homology analysis.

References

[1]

2015. enki/libev: Full-featured high-performance event loop loosely modelled after libevent. https://github.com/enki/libev Last accessed 1 November 2023

[2]

2019. Using the GNU Compiler Collection (GCC): Code Gen Options. https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Code-Gen-Options.html#Code-Gen-Options Last accessed 1 January 2023

[3]

2019. Using the GNU Compiler Collection (GCC): Link Options. https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Link-Options.html#Link-Options Last accessed 1 January 2023

[4]

2022. avast/retdec: RetDec is a retargetable machine-code decompiler based on LLVM. https://github.com/avast/retdec Last accessed 1 March 2023

[5]

2023. Buildroot - Making Embedded Linux Easy. https://buildroot.org/ Last accessed 1 March 2023

[6]

2023. CALL — Call Procedure. https://www.felixcloutier.com/x86/call Last accessed 1 January 2023

[7]

2023. Denial-of-service attack - Wikipedia. https://en.wikipedia.org/wiki/Denial-of-service_attack Last accessed 1 November 2023

[8]

2023. DWARF 4 Standard. https://dwarfstd.org/dwarf4std.html Last accessed 1 November 2023

[9]

2023. Hex Rays - State-of-the-art binary code analysis solutions. https://hex-rays.com/ida-pro/ Last accessed 1 March 2023

[10]

2023. Hex-Rays Decompiler. https://hex-rays.com/decompiler/ Last accessed 1 March 2023

[11]

2023. Home - PyG. https://pyg.org/ Last accessed 1 November 2023

[12]

2023. Mirai (malware) - Wikipedia. https://en.wikipedia.org/wiki/Mirai_(malware) Last accessed 1 November 2023

[13]

2023. NationalSecurityAgency/ghidra: Ghidra is a software reverse engineering (SRE) framework. https://github.com/NationalSecurityAgency/ghidra/ Last accessed 1 March 2023

[14]

2023. objdump - Wikipedia. https://en.wikipedia.org/wiki/Objdump Last accessed 1 March 2023

[15]

2023. Position-independent code - Wikipedia. https://en.wikipedia.org/wiki/Position-independent_code Last accessed 1 January 2023

[16]

2023. PyTorch. https://pytorch.org/ Last accessed 1 November 2023

[17]

2023. Static (keyword) - Wikipedia. https://en.wikipedia.org/wiki/Static_(keyword) Last accessed 1 January 2023

[18]

2023. VirusShare.com. https://virusshare.com/ Last accessed 1 November 2023

[19]

2023. VirusTotal - Home. https://www.virustotal.com/gui/home/upload Last accessed 1 November 2023

[20]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, and Shyamal Anadkat. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, https://doi.org/10.48550/arXiv.2303.08774

[21]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, https://doi.org/10.48550/arXiv.1409.0473

[22]

Yude Bai, Zhenchang Xing, Xiaohong Li, Zhiyong Feng, and Duoyuan Ma. 2020. Unsuccessful story about few shot malware family classification and siamese network to the rescue. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1560–1571. https://doi.org/10.1145/3377811.3380354

Digital Library

[23]

Richard Bellman. 1954. The theory of dynamic programming. Bull. Amer. Math. Soc., 60, 6 (1954), 503–515.

[24]

Sandeep Bhatkar, Daniel C DuVarney, and R Sekar. 2005. Efficient Techniques for Comprehensive Protection from Memory Error Exploits. In USENIX Security Symposium. 10, 1.

[25]

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008, 10 (2008), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008

[26]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33 (2020), 1877–1901.

[27]

Juan Caballero, Noah M Johnson, Stephen McCamant, and Dawn Song. 2010. Binary Code Extraction and Interface Identification for Security Applications. In NDSS. 10, 391–408.

[28]

Kai Cheng, Yaowen Zheng, Tao Liu, Le Guan, Peng Liu, Hong Li, Hongsong Zhu, Kejiang Ye, and Limin Sun. 2023. Detecting Vulnerabilities in Linux-Based Embedded Firmware with SSE-Based On-Demand Alias Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 360–372. https://doi.org/10.1145/3597926.3598062

Digital Library

[29]

Zhui Deng, Xiangyu Zhang, and Dongyan Xu. 2013. Bistro: Binary component extraction and embedding for software security applications. In Computer Security–ESORICS 2013: 18th European Symposium on Research in Computer Security, Egham, UK, September 9-13, 2013. Proceedings 18. 200–218. https://doi.org/10.1007/978-3-642-40203-6_12

[30]

Luke Deshotels, Vivek Notani, and Arun Lakhotia. 2014. Droidlegacy: Automated familial classification of android malware. In Proceedings of ACM SIGPLAN on program protection and reverse engineering workshop 2014. 1–12. https://doi.org/10.1145/2556464.2556467

Digital Library

[31]

Ming Fan, Jun Liu, Xiapu Luo, Kai Chen, Zhenzhou Tian, Qinghua Zheng, and Ting Liu. 2018. Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Transactions on Information Forensics and Security, 13, 8 (2018), 1890–1905. https://doi.org/10.1109/TIFS.2018.2806891

[32]

Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33 (2020), 2881–2891.

[33]

Edward B Fowlkes and Colin L Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American statistical association, 78, 383 (1983), 553–569.

[34]

Han Gao, Shaoyin Cheng, Yinxing Xue, and Weiming Zhang. 2021. A lightweight framework for function name reassignment based on large-scale stripped binaries. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 607–619. https://doi.org/10.1145/3460319.3464804

Digital Library

[35]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30 (2017).

[36]

Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1667–1680. https://doi.org/10.1145/3243734.3243866

Digital Library

[37]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Digital Library

[38]

Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645. https://doi.org/10.1145/3548606.3560612

Digital Library

[39]

Vishal Karande, Swarup Chandra, Zhiqiang Lin, Juan Caballero, Latifur Khan, and Kevin Hamlen. 2018. Bcd: Decomposing binary code into components using graph-based clustering. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security. 393–398. https://doi.org/10.1145/3196494.3196504

Digital Library

[40]

Clemens Kolbitsch, Thorsten Holz, Christopher Kruegel, and Engin Kirda. 2010. Inspector gadget: Automated extraction of proprietary gadgets from malware binaries. In 2010 IEEE Symposium on Security and Privacy. 29–44. https://doi.org/10.1109/SP.2010.10

Digital Library

[41]

Yonghwi Kwon, Weihang Wang, Yunhui Zheng, Xiangyu Zhang, and Dongyan Xu. 2017. Cpr: cross platform binary code reuse via platform independent trace program. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 158–169. https://doi.org/10.1145/3092703.3092707

Digital Library

[42]

Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. In Proceedings of the 28th international conference on program comprehension. 184–195. https://doi.org/10.1145/3387904.3389268

Digital Library

[43]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, https://doi.org/10.48550/arXiv.1301.3781

[44]

Glenn W Milligan, Shih Chung Soon, and Lisa M Sokol. 1983. The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE transactions on pattern analysis and machine intelligence, 40–47. https://doi.org/10.1109/TPAMI.1983.4767342

Digital Library

[45]

Mark EJ Newman. 2004. Analysis of weighted networks. Physical review E, 70, 5 (2004), 056131. https://doi.org/10.1103/PhysRevE.70.056131

[46]

Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical review E, 69, 6 (2004), 066133. https://doi.org/10.1103/PhysRevE.69.066133

[47]

Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. 2021. Sok: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In 2021 IEEE symposium on security and privacy (SP). 833–851. https://doi.org/10.1109/SP40001.2021.00012

[48]

James Patrick-Evans, Lorenzo Cavallaro, and Johannes Kinder. 2020. Probabilistic naming of functions in stripped binaries. In Annual Computer Security Applications Conference. 373–385. https://doi.org/10.1145/3427228.3427265

Digital Library

[49]

Kexin Pei, Jonas Guan, David Williams-King, Junfeng Yang, and Suman Jana. 2020. Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770, https://doi.org/10.48550/arXiv.2010.00770

[50]

Yanchen Qiao, Xiaochun Yun, and Yongzheng Zhang. 2016. How to automatically identify the homology of different malware. In 2016 IEEE Trustcom/BigDataSE/ISPA. 929–936. https://doi.org/10.1109/TrustCom.2016.0158

[51]

Holakou Rahmanian and Manfred KK Warmuth. 2017. Online dynamic programming. Advances in Neural Information Processing Systems, 30 (2017).

[52]

William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66, 336 (1971), 846–850.

[53]

Frank Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65, 6 (1958), 386. https://doi.org/10.1037/h0042519

[54]

2010. TF–IDF, Claude Sammut and Geoffrey I. Webb (Eds.). Springer US, Boston, MA. 986–987. isbn:978-0-387-30164-8 https://doi.org/10.1007/978-0-387-30164-8_832

[55]

Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing functions in binaries with neural networks. In 24th USENIX security symposium (USENIX Security 15). 611–626.

[56]

Yi Song, Xiaoyuan Xie, Xihao Zhang, Quanming Liu, and Ruizhi Gao. 2022. Evolving ranking-based failure proximities for better clustering in fault isolation. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13. https://doi.org/10.1145/3551349.3556922

Digital Library

[57]

Prashast Srivastava, Stefan Nagy, Matthew Hicks, Antonio Bianchi, and Mathias Payer. 2022. One Fuzz Doesn’t Fit All: Optimizing Directed Fuzzing via Target-tailored Program State Restriction. In Proceedings of the 38th Annual Computer Security Applications Conference. 388–399. https://doi.org/10.1145/3564625.3564643

Digital Library

[58]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27 (2014).

[59]

Shuai Wang, Pei Wang, and Dinghao Wu. 2015. Reassembleable disassembling. In 24th USENIX Security Symposium (USENIX Security 15). 627–642.

[60]

Zhao Xiaolin, Zhang Yiman, Li Xuhui, and Chen Quanbao. 2018. Research on malicious code homology analysis method based on texture fingerprint clustering. In 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). 1914–1921. https://doi.org/10.1109/TrustCom/BigDataSE.2018.00291

[61]

Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin. 2018. Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823, https://doi.org/10.48550/arXiv.1804.00823

[62]

Can Yang, Zhengzi Xu, Hongxu Chen, Yang Liu, Xiaorui Gong, and Baoxu Liu. 2022. ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the 44th International Conference on Software Engineering. 1393–1405. https://doi.org/10.1145/3510003.3510627

Digital Library

[63]

Guoli Yang, Yuanji Kang, Xianqiang Zhu, Cheng Zhu, and Gaoxi Xiao. 2021. Info2vec: an aggregative representation method in multi-layer and heterogeneous networks. Information Sciences, 574 (2021), 444–460. https://doi.org/10.1016/j.ins.2021.06.013

Digital Library

[64]

Junyuan Zeng, Yangchun Fu, Kenneth A Miller, Zhiqiang Lin, Xiangyu Zhang, and Dongyan Xu. 2013. Obfuscation resilient binary code reuse through trace-oriented programming. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. 487–498. https://doi.org/10.1145/2508859.2516664

Digital Library

[65]

Mingwei Zhang, Rui Qiao, Niranjan Hasabnis, and R Sekar. 2014. A platform for secure static binary instrumentation. In Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. 129–140. https://doi.org/10.1145/2576195.2576208

Digital Library

Index Terms

DeLink: Source File Information Recovery in Binaries
1. Security and privacy
  1. Software and application security
    1. Software reverse engineering

Recommendations

Forensic APFS File Recovery
ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security

In forensic computing, especially in the field of postmortem file system forensics, the reconstruction of lost or deleted files plays a major role. The techniques that can be applied to this end strongly depend on the specifics of the file system in ...
Recovery techniques to improve file system reliability
Using the HFS+ journal for deleted file recovery

This paper describes research and analysis that were performed to identify a robust and accurate method for identifying and extracting the residual contents of deleted files stored within an HFS+ file system. A survey performed during 2005 of existing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 2024

1928 pages

ISBN:9798400706127

DOI:10.1145/3650212

General Chair:
Maria Christakis
TU Wien, Austria
,
Program Chair:
Michael Pradel
University of Stuttgart, Germany

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISSTA '24

Sponsor:

SIGSOFT

ISSTA '24: 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
310
Total Downloads

Downloads (Last 12 months)310
Downloads (Last 6 weeks)57

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten