LLM4Decompile: Open-source Large Language Models for Decompilation with Emphasis on Code Executability and Recompilability

On Mar 23, 2024

Decompilation plays a crucial role in software reverse engineering, enabling the analysis and understanding of binary executables when their source code is inaccessible. This is particularly valuable for software security analysis, bug detection, and the recovery of legacy code. However, traditional decompilation techniques often need help to produce human-readable and semantically accurate source code, posing a significant challenge.

Research in decompilation has traditionally utilized various tools and methods to translate binary code back into source code, albeit with varying degrees of success. These tools, like Ghidra and IDA Pro, excel in specific scenarios but often need to be revised to restore code to a state easily understandable by humans. This challenge is compounded by the inherent difficulty in accurately reconstructing the finer details of source code, such as variable names and the original structure, including loops and conditional statements, which are typically lost during the compilation process.

Researchers from the Southern University of Science and Technology and the Hong Kong Polytechnic University introduced LLM4Decompile, which stands out for its unique approach. It utilizes LLMs pre-trained on vast amounts of C source code and corresponding assembly code, aiming to leverage their predictive capabilities to reconstruct accurate and syntactically correct source code from binary executables. Unlike existing tools, LLM4Decompile prioritizes code executability, a key aspect of functional programming.

The team compiled a dataset of 4 billion tokens, encompassing a wide range of C and assembly code pairs, to train models of varying sizes from 1B to 33B parameters. This extensive pre-training aims to imbue the models with a deep understanding of code structure and semantics. Unlike previous tools that often generated either non-functional code or difficult for humans to parse, LLM4Decompile strives to produce code that resembles the source in syntax and retains its executable essence.

The evaluation of LLM4Decompile’s efficacy is equally meticulous, utilizing the newly introduced Decompile-Eval benchmark. This benchmark assesses decompiled code on two crucial fronts: re-compilability and re-executability. These metrics testify to the model’s understanding of code semantics and its ability to generate syntactically correct code. LLM4Decompile achieved a significant milestone, demonstrating the ability to accurately decompile binary code with a staggering 90% re-compilability rate and a remarkable 21% re-executability rate for its 6B model. These figures mark a 50% improvement in decompilation performance over its predecessor, GPT-4, underscoring the leaps in decompilation accuracy and utility.

In conclusion, the introduction of LLM4Decompile is a game-changer in software engineering. Their work not only addresses the longstanding challenges inherent in decompilation but also paves the way for new avenues of research and development. With its advanced methodology and impressive performance, LLM4Decompile is a beacon for future endeavors, heralding a future where decompilation can be as nuanced and refined as the code it seeks to unravel. This is an exciting time for software engineering, with LLM4Decompile leading the charge towards a more sophisticated and effective approach to decompilation.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Credit: Source link