Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3650212.3680338acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Open access

DeLink: Source File Information Recovery in Binaries

Published: 11 September 2024 Publication History

Abstract

Program comprehension can help analysts understand the primary behavior of a binary and enhance the efficiency of reverse engineering analysis. The existing works focus on instruction translation and function name prediction. However, they are limited in understanding the entire program. The recovered source file information can offer insights into the primary behavior of a binary, serving as high-level program summaries. Nevertheless, the files recovered by the function clustering-based approach contain binary functions with discontinuous distributions, resulting in low accuracy. Additionally, there is no existing research related to predicting the names of these recovered files. To this end, we propose a framework for source file information recovery in binaries, DeLink. This framework first leverages a file structure recovery approach based on boundary location to recognize files within a binary. Then, it utilizes an encoder-decoder model to predict the names of these files. The experimental results show that our file structure recovery approach achieves an average improvement of 14% across six evaluation metrics and requires only an average time of 16.74 seconds, outperforming the state-of-the-art work in both recovery quality and efficiency. Additionally, our file name prediction model achieves 70.09% precision and 63.91% recall. Moreover, we demonstrate the effective application of DeLink in malware homology analysis.

References

[1]
2015. enki/libev: Full-featured high-performance event loop loosely modelled after libevent. https://github.com/enki/libev Last accessed 1 November 2023
[2]
2019. Using the GNU Compiler Collection (GCC): Code Gen Options. https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Code-Gen-Options.html#Code-Gen-Options Last accessed 1 January 2023
[3]
2019. Using the GNU Compiler Collection (GCC): Link Options. https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Link-Options.html#Link-Options Last accessed 1 January 2023
[4]
2022. avast/retdec: RetDec is a retargetable machine-code decompiler based on LLVM. https://github.com/avast/retdec Last accessed 1 March 2023
[5]
2023. Buildroot - Making Embedded Linux Easy. https://buildroot.org/ Last accessed 1 March 2023
[6]
2023. CALL — Call Procedure. https://www.felixcloutier.com/x86/call Last accessed 1 January 2023
[7]
2023. Denial-of-service attack - Wikipedia. https://en.wikipedia.org/wiki/Denial-of-service_attack Last accessed 1 November 2023
[8]
2023. DWARF 4 Standard. https://dwarfstd.org/dwarf4std.html Last accessed 1 November 2023
[9]
2023. Hex Rays - State-of-the-art binary code analysis solutions. https://hex-rays.com/ida-pro/ Last accessed 1 March 2023
[10]
2023. Hex-Rays Decompiler. https://hex-rays.com/decompiler/ Last accessed 1 March 2023
[11]
2023. Home - PyG. https://pyg.org/ Last accessed 1 November 2023
[12]
2023. Mirai (malware) - Wikipedia. https://en.wikipedia.org/wiki/Mirai_(malware) Last accessed 1 November 2023
[13]
2023. NationalSecurityAgency/ghidra: Ghidra is a software reverse engineering (SRE) framework. https://github.com/NationalSecurityAgency/ghidra/ Last accessed 1 March 2023
[14]
2023. objdump - Wikipedia. https://en.wikipedia.org/wiki/Objdump Last accessed 1 March 2023
[15]
2023. Position-independent code - Wikipedia. https://en.wikipedia.org/wiki/Position-independent_code Last accessed 1 January 2023
[16]
2023. PyTorch. https://pytorch.org/ Last accessed 1 November 2023
[17]
2023. Static (keyword) - Wikipedia. https://en.wikipedia.org/wiki/Static_(keyword) Last accessed 1 January 2023
[18]
2023. VirusShare.com. https://virusshare.com/ Last accessed 1 November 2023
[19]
2023. VirusTotal - Home. https://www.virustotal.com/gui/home/upload Last accessed 1 November 2023
[20]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, and Shyamal Anadkat. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, https://doi.org/10.48550/arXiv.2303.08774
[21]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, https://doi.org/10.48550/arXiv.1409.0473
[22]
Yude Bai, Zhenchang Xing, Xiaohong Li, Zhiyong Feng, and Duoyuan Ma. 2020. Unsuccessful story about few shot malware family classification and siamese network to the rescue. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1560–1571. https://doi.org/10.1145/3377811.3380354
[23]
Richard Bellman. 1954. The theory of dynamic programming. Bull. Amer. Math. Soc., 60, 6 (1954), 503–515.
[24]
Sandeep Bhatkar, Daniel C DuVarney, and R Sekar. 2005. Efficient Techniques for Comprehensive Protection from Memory Error Exploits. In USENIX Security Symposium. 10, 1.
[25]
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008, 10 (2008), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008
[26]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33 (2020), 1877–1901.
[27]
Juan Caballero, Noah M Johnson, Stephen McCamant, and Dawn Song. 2010. Binary Code Extraction and Interface Identification for Security Applications. In NDSS. 10, 391–408.
[28]
Kai Cheng, Yaowen Zheng, Tao Liu, Le Guan, Peng Liu, Hong Li, Hongsong Zhu, Kejiang Ye, and Limin Sun. 2023. Detecting Vulnerabilities in Linux-Based Embedded Firmware with SSE-Based On-Demand Alias Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 360–372. https://doi.org/10.1145/3597926.3598062
[29]
Zhui Deng, Xiangyu Zhang, and Dongyan Xu. 2013. Bistro: Binary component extraction and embedding for software security applications. In Computer Security–ESORICS 2013: 18th European Symposium on Research in Computer Security, Egham, UK, September 9-13, 2013. Proceedings 18. 200–218. https://doi.org/10.1007/978-3-642-40203-6_12
[30]
Luke Deshotels, Vivek Notani, and Arun Lakhotia. 2014. Droidlegacy: Automated familial classification of android malware. In Proceedings of ACM SIGPLAN on program protection and reverse engineering workshop 2014. 1–12. https://doi.org/10.1145/2556464.2556467
[31]
Ming Fan, Jun Liu, Xiapu Luo, Kai Chen, Zhenzhou Tian, Qinghua Zheng, and Ting Liu. 2018. Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Transactions on Information Forensics and Security, 13, 8 (2018), 1890–1905. https://doi.org/10.1109/TIFS.2018.2806891
[32]
Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33 (2020), 2881–2891.
[33]
Edward B Fowlkes and Colin L Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American statistical association, 78, 383 (1983), 553–569.
[34]
Han Gao, Shaoyin Cheng, Yinxing Xue, and Weiming Zhang. 2021. A lightweight framework for function name reassignment based on large-scale stripped binaries. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 607–619. https://doi.org/10.1145/3460319.3464804
[35]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30 (2017).
[36]
Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1667–1680. https://doi.org/10.1145/3243734.3243866
[37]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[38]
Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645. https://doi.org/10.1145/3548606.3560612
[39]
Vishal Karande, Swarup Chandra, Zhiqiang Lin, Juan Caballero, Latifur Khan, and Kevin Hamlen. 2018. Bcd: Decomposing binary code into components using graph-based clustering. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security. 393–398. https://doi.org/10.1145/3196494.3196504
[40]
Clemens Kolbitsch, Thorsten Holz, Christopher Kruegel, and Engin Kirda. 2010. Inspector gadget: Automated extraction of proprietary gadgets from malware binaries. In 2010 IEEE Symposium on Security and Privacy. 29–44. https://doi.org/10.1109/SP.2010.10
[41]
Yonghwi Kwon, Weihang Wang, Yunhui Zheng, Xiangyu Zhang, and Dongyan Xu. 2017. Cpr: cross platform binary code reuse via platform independent trace program. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 158–169. https://doi.org/10.1145/3092703.3092707
[42]
Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. In Proceedings of the 28th international conference on program comprehension. 184–195. https://doi.org/10.1145/3387904.3389268
[43]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, https://doi.org/10.48550/arXiv.1301.3781
[44]
Glenn W Milligan, Shih Chung Soon, and Lisa M Sokol. 1983. The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE transactions on pattern analysis and machine intelligence, 40–47. https://doi.org/10.1109/TPAMI.1983.4767342
[45]
Mark EJ Newman. 2004. Analysis of weighted networks. Physical review E, 70, 5 (2004), 056131. https://doi.org/10.1103/PhysRevE.70.056131
[46]
Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical review E, 69, 6 (2004), 066133. https://doi.org/10.1103/PhysRevE.69.066133
[47]
Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. 2021. Sok: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In 2021 IEEE symposium on security and privacy (SP). 833–851. https://doi.org/10.1109/SP40001.2021.00012
[48]
James Patrick-Evans, Lorenzo Cavallaro, and Johannes Kinder. 2020. Probabilistic naming of functions in stripped binaries. In Annual Computer Security Applications Conference. 373–385. https://doi.org/10.1145/3427228.3427265
[49]
Kexin Pei, Jonas Guan, David Williams-King, Junfeng Yang, and Suman Jana. 2020. Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770, https://doi.org/10.48550/arXiv.2010.00770
[50]
Yanchen Qiao, Xiaochun Yun, and Yongzheng Zhang. 2016. How to automatically identify the homology of different malware. In 2016 IEEE Trustcom/BigDataSE/ISPA. 929–936. https://doi.org/10.1109/TrustCom.2016.0158
[51]
Holakou Rahmanian and Manfred KK Warmuth. 2017. Online dynamic programming. Advances in Neural Information Processing Systems, 30 (2017).
[52]
William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66, 336 (1971), 846–850.
[53]
Frank Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65, 6 (1958), 386. https://doi.org/10.1037/h0042519
[54]
2010. TF–IDF, Claude Sammut and Geoffrey I. Webb (Eds.). Springer US, Boston, MA. 986–987. isbn:978-0-387-30164-8 https://doi.org/10.1007/978-0-387-30164-8_832
[55]
Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing functions in binaries with neural networks. In 24th USENIX security symposium (USENIX Security 15). 611–626.
[56]
Yi Song, Xiaoyuan Xie, Xihao Zhang, Quanming Liu, and Ruizhi Gao. 2022. Evolving ranking-based failure proximities for better clustering in fault isolation. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13. https://doi.org/10.1145/3551349.3556922
[57]
Prashast Srivastava, Stefan Nagy, Matthew Hicks, Antonio Bianchi, and Mathias Payer. 2022. One Fuzz Doesn’t Fit All: Optimizing Directed Fuzzing via Target-tailored Program State Restriction. In Proceedings of the 38th Annual Computer Security Applications Conference. 388–399. https://doi.org/10.1145/3564625.3564643
[58]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27 (2014).
[59]
Shuai Wang, Pei Wang, and Dinghao Wu. 2015. Reassembleable disassembling. In 24th USENIX Security Symposium (USENIX Security 15). 627–642.
[60]
Zhao Xiaolin, Zhang Yiman, Li Xuhui, and Chen Quanbao. 2018. Research on malicious code homology analysis method based on texture fingerprint clustering. In 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). 1914–1921. https://doi.org/10.1109/TrustCom/BigDataSE.2018.00291
[61]
Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin. 2018. Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823, https://doi.org/10.48550/arXiv.1804.00823
[62]
Can Yang, Zhengzi Xu, Hongxu Chen, Yang Liu, Xiaorui Gong, and Baoxu Liu. 2022. ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the 44th International Conference on Software Engineering. 1393–1405. https://doi.org/10.1145/3510003.3510627
[63]
Guoli Yang, Yuanji Kang, Xianqiang Zhu, Cheng Zhu, and Gaoxi Xiao. 2021. Info2vec: an aggregative representation method in multi-layer and heterogeneous networks. Information Sciences, 574 (2021), 444–460. https://doi.org/10.1016/j.ins.2021.06.013
[64]
Junyuan Zeng, Yangchun Fu, Kenneth A Miller, Zhiqiang Lin, Xiangyu Zhang, and Dongyan Xu. 2013. Obfuscation resilient binary code reuse through trace-oriented programming. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. 487–498. https://doi.org/10.1145/2508859.2516664
[65]
Mingwei Zhang, Rui Qiao, Niranjan Hasabnis, and R Sekar. 2014. A platform for secure static binary instrumentation. In Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. 129–140. https://doi.org/10.1145/2576195.2576208

Index Terms

  1. DeLink: Source File Information Recovery in Binaries

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis
    September 2024
    1928 pages
    ISBN:9798400706127
    DOI:10.1145/3650212
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 September 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Program Comprehension
    2. Source File Information Recovery

    Qualifiers

    • Research-article

    Conference

    ISSTA '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 58 of 213 submissions, 27%

    Upcoming Conference

    ISSTA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 310
      Total Downloads
    • Downloads (Last 12 months)310
    • Downloads (Last 6 weeks)57
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media