Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3465413.3488575acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Open access

PERFUME: Programmatic Extraction and Refinement for Usability of Mathematical Expression

Published: 15 November 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Algorithmic identification is the crux for several binary analysis applications, including malware analysis, vulnerability discovery, and embedded firmware reverse engineering. However, data-driven and signature-based approaches often break down when encountering outlier realizations of a particular algorithm. Moreover, reverse engineering of domain-specific binaries often requires collaborative analysis between reverse engineers and domain experts. Communicating the behavior of an unidentified binary program to non-reverse engineers necessitates the recovery of algorithmic semantics in a human-digestible form. This paper presents PERFUME, a framework that extracts symbolic math expressions from low-level binary representations of an algorithm. PERFUME works by translating a symbolic output representation of a binary function to a high-level mathematical expression. In particular, we detail how source and target representations are generated for training a machine translation model. We integrate PERFUME as a plug-in for Ghidra--an open-source reverse engineering framework. We present our preliminary findings for domain-specific use cases and formalize open challenges in mathematical expression extraction from algorithmic implementations.

    Supplementary Material

    MP4 File (check069-weidemanA.mp4)
    In this video, we present the paper "PERFUME: Programmatic Extraction and Refinement For Usability of Mathematical Expressions", on extracting mathematical expressions from binary programs. We discuss the pipeline of PERFUME, namely, symbolic execution, data synthesis and machine translation and explain each step with examples. We motivate the effectiveness of PERFUME with our experimental results and discuss remaining challenges and future work.

    References

    [1]
    angr. 2021. The Angr binary analysis platform. http://angr.io.
    [2]
    ArduPilot. 2009. ArduPilot. https://github.com/ArduPilot/ardupilot.
    [3]
    Roberto Baldoni, Emilio Coppa, Daniele Cono D'elia, Camil Demetrescu, and Irene Finocchi. 2018. A survey of symbolic execution techniques. ACM Computing Surveys (CSUR) 51, 3 (2018), 1--39.
    [4]
    Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. 2014. BYTEWEIGHT: Learning to recognize functions in binary code. In 23rd USENIX Security Symposium (USENIX Security 14). 845--860.
    [5]
    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https://proceedings. neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
    [6]
    Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the Role of Bleu in Machine Translation Research. In 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy. https://aclanthology.org/E06-1032
    [7]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19-1423
    [8]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
    [9]
    Chris Eagle. 2011. The IDA pro book. no starch press.
    [10]
    Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. 2020. How Can Self-Attention Networks Recognize Dyck-n Languages?. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4301--4306. https://doi.org/10.18653/v1/2020.findings-emnlp.384
    [11]
    Michael Hahn. 2020. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics 8 (2020), 156--171. https://doi.org/10.1162/tacl_a_00306
    [12]
    Alexander Heinricher, Ryan Williams, Ava Klingbeil, and Alex Jordan. 2021. Weldr: fusing binaries for simplified analysis. In Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis. 25--30.
    [13]
    Anastasis Keliris and Michail Maniatakos. 2019. ICSREF: A Framework for Automated Reverse Engineering of Industrial Control Systems Binaries. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society. https://www.ndss-symposium.org/ndss-paper/icsref-a-framework-forautomated-reverse-engineering-of-industrial-control-systems-binaries/
    [14]
    Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2020. Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned. arXiv preprint arXiv:2011.10749 (2020).
    [15]
    Guillaume Lample and François Charton. 2019. Deep Learning for Symbolic Mathematics. CoRR abs/1912.01412 (2019). arXiv:1912.01412 http://arxiv.org/abs/1912.01412
    [16]
    Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, ?těpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. 2017. SymPy: symbolic computing in Python. PeerJ Computer Science 3 (Jan. 2017), e103. https://doi.org/10.7717/peerj-cs.103
    [17]
    NSA. 2021. Ghidra. https://ghidra-sre.org/.
    [18]
    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
    [19]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311--318. https: //doi.org/10.3115/1073083.1073135
    [20]
    Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems 56 (2016), 157--173. https://doi.org/10. 1016/j.is.2015.08.004
    [21]
    Pengfei Sun, Luis Garcia, Gabriel Salles-Loustau, and Saman Zonouz. 2020. Hybrid firmware analysis for known mobile and iot security vulnerabilities. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 373--384.
    [22]
    Pengfei Sun, Luis Garcia, and Saman Zonouz. 2019. Tell me more than just assembly! reversing cyber-physical execution semantics of embedded iot controller software binaries. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 349--361.
    [23]
    Sepehr Taghdisian. 2018. sx. https://github.com/septag/sx.
    [24]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
    [25]
    Martin Weigel. [n. d.]. MartinWeigel/Quaternion. https://github.com/ MartinWeigel/Quaternion/blob/master/Quaternion.c
    [26]
    Dongpeng Xu, Jiang Ming, and Dinghao Wu. 2017. Cryptographic function detection in obfuscated binaries via bit-precise symbolic loop mapping. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 921--937.

    Cited By

    View all
    • (2022)AutoCPS: Control Software Dataset Generation for Semantic Reverse Engineering2022 IEEE Security and Privacy Workshops (SPW)10.1109/SPW54247.2022.9833887(236-242)Online publication date: May-2022

    Index Terms

    1. PERFUME: Programmatic Extraction and Refinement for Usability of Mathematical Expression

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      Checkmate '21: Proceedings of the 2021 Research on offensive and defensive techniques in the Context of Man At The End (MATE) Attacks
      November 2021
      76 pages
      ISBN:9781450385527
      DOI:10.1145/3465413
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 November 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. binary analysis
      2. reverse engineering

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      CCS '21
      Sponsor:

      Upcoming Conference

      CCS '24
      ACM SIGSAC Conference on Computer and Communications Security
      October 14 - 18, 2024
      Salt Lake City , UT , USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)118
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)AutoCPS: Control Software Dataset Generation for Semantic Reverse Engineering2022 IEEE Security and Privacy Workshops (SPW)10.1109/SPW54247.2022.9833887(236-242)Online publication date: May-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media