research-article

Open access

Neural reverse engineering of stripped binaries using augmented control flow graphs

Authors:

Yaniv David,

Uri Alon,

Eran YahavAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 4, Issue OOPSLA

Article No.: 225, Pages 1 - 28

https://doi.org/10.1145/3428293

Published: 13 November 2020 Publication History

PDF eReader

Abstract

We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p526-p-video.mp4)

This is a presentation video of our talk @ OOPSLA'20. In this paper we address the problem of reverse engineering of stripped executables. This is a challenging problem because of the low amount of syntactic information available, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the CFG and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models.

Download
65.90 MB

References

[1]

Miltiadis Allamanis. 2018. The Adverse Efects of Code Duplication in Machine Learning Models of Code. arXiv preprint arXiv: 1812. 06469 ( 2018 ).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Estimating types in binaries using predictive modeling

Combined WCET analysis of bitcode and machine code using control-flow relation graphs

Pushdown control-flow analysis for free

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations