How Far Have We Gone in Binary Code Understanding Using Large Language Models

Shang, Xiuwei; Cheng, Shaoyin; Chen, Guoqiang; Zhang, Yanming; Hu, Li; Yu, Xiao; Li, Gangyang; Zhang, Weiming; Yu, Nenghai

Computer Science > Software Engineering

arXiv:2404.09836 (cs)

[Submitted on 15 Apr 2024 (v1), last revised 24 Oct 2024 (this version, v3)]

Title:How Far Have We Gone in Binary Code Understanding Using Large Language Models

Authors:Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, Nenghai Yu

View PDF HTML (experimental)

Abstract:Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary code is challenging for reverse engineers due to the absence of semantic information. Therefore, automated tools are needed to assist human players in interpreting binary code. In recent years, two groups of technologies have shown promising prospects: (1) Deep learning-based technologies have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This makes participants wonder about the ability of LLMs in binary code understanding.
In this work, we propose a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios. The benchmark covers two key binary code understanding tasks, including function name recovery and binary code summarization. We gain valuable insights into their capabilities and limitations through extensive evaluations of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding.

Comments:	12 pages, 8 figures, to be published in ICSME 2024
Subjects:	Software Engineering (cs.SE); Cryptography and Security (cs.CR)
Cite as:	arXiv:2404.09836 [cs.SE]
	(or arXiv:2404.09836v3 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2404.09836

Submission history

From: Xiuwei Shang [view email]
[v1] Mon, 15 Apr 2024 14:44:08 UTC (884 KB)
[v2] Tue, 16 Apr 2024 08:15:22 UTC (884 KB)
[v3] Thu, 24 Oct 2024 14:35:43 UTC (885 KB)

Computer Science > Software Engineering

Title:How Far Have We Gone in Binary Code Understanding Using Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:How Far Have We Gone in Binary Code Understanding Using Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators